Two Tier Architectures Are Anachronistic [video]

40 points by amcaskill 2 years ago · 24 comments

Reader

It's a tech talk so I watched it and made some notes. It's an argument for using in-process databases when doing data science, rather than an external DB. The speaker pitches DuckDB as a concrete example, which seems to be an in-process DB for Python data frames.

The speaker presents measurements showing how much overhead the wire protocols for various DBs have. MySQL is the best, Postgresql is orders of magnitude worse due to a very inefficient binary format design. The best is still 10x worse than netcat.

Apache Arrow is trying to design a universal protocol for DB access that's more efficient than what's out there currently.

Speaker asserts that scale-out is usually not needed in data analytics, no need to use Spark etc unless you want it on your CV.

Audience member asks "what about multi-user/multi-process access", speaker admits DuckDB basically doesn't do that.

Speaker pitches for using embedded in-proc DBs inside AWS Lambda functions. Not practical to install Oracle RDBMS in something that only runs for 100msec.

A web shell for DuckDB is demonstrated, it uses WASM.

Decentralization is pitched as a reason to avoid 2-tier architecture (separate db engine w/ client protocol).

TN1ck 2 years ago

> Speaker pitches for using embedded in-proc DBs inside AWS Lambda functions. Not practical to install Oracle RDBMS in something that only runs for 100msec.
It's not only unpractical, but hard to get it done. Recently tried to run Postgres in an AWS Lambda to create an anonymized DB dump. It was so painful that I gave up and created an access restricted database to do the anonymization instead. An in-memory mode for Postgres that would be as easy to run as sqlite or duckdb would be so useful for things where one can not replace it with either of them (sql dumps, testing).
- wild_egg 2 years ago
  
  You may be interested in this
  https://github.com/zonkyio/embedded-postgres-binaries
  I've been using this for test runners in Node and Go for a while now and it's been quite painless. Would be nice to have wider language support though
  - TN1ck 2 years ago
    
    Thank you, that's indeed super interesting for me (we use go as our backend language).
fulafel 2 years ago

DuckDB is actually a interesting case because it seems to have some history of memory corruption / segfault problems. The robustness provided by a process boundary is traditionally been valued a lot, though usually for keeping the app from corrupting the db and not the other way around.
city_guy_1 2 years ago

So this is only relevant for personal developer environments?

0xbadcafebee 2 years ago

One of the major failures of the modern computer science age (among others) is a lack of direction away from traditional i/o. We still are stuck on files and directories and tcp sockets. Yet what we actually want to do with i/o is not read a file from a local disk, or connect to a server and transmute the contents of the file over some additional protocol.

What we really want is to store some data somewhere, and later be able to retrieve it, without necessarily knowing what it was we stored or where or how. And we don't want to think about what server it's on, or what hard drive, or what folder. And we don't want to think about client protocols or query languages.

All of that would be possible if we reinvented i/o. Basically, just imagine what you want your experience to be, and then start making up names for functions that do that. Stuff that in a kernel, or a standard library. Now you have i/o that's based on how you really want to use data. The backend implementation of it can vary, but the point is to make the user experience what we actually want rather than what somebody else thinks is practical. Make the data interface you want to use, and make it a standard.

onetimeuse92304 2 years ago

Many decades ago, before we had files, we had data storages that do what you describe. They stored records of data. No need for files.
What happened, is we discovered that files are really useful because you don't need to declare the format of data that goes into the file. So the operating system can handle things like reading and writing and the application can organise how it wants to keep the data in the file.
The same really is for sockets. It is really useful to have somebody transfer the data for you in a stream and you, the application, only worry about the format of the data.
- kuchenbecker 2 years ago
  
  A junior engineer on my team asked me why we store bytes in our Blobstore/Filesystem rather than something structured like a DB.
  Bytes are a "narrow waist" and in-fact DBs actually use our system for storage. By supporting bytes, anything that can be serialized can be stored by the next layer up and the contract is very simple.
ozr 2 years ago

We have this right now. It's abstractions on top of the real primitives. That's what client protocols and query languages are.
devbent 2 years ago

> And we don't want to think about what server it's on, or what hard drive, or what folder. And we don't want to think about client protocols or query languages.
Different types of data are legal in different jurisdictions (for example the definition of PII data), the physical location of the hard drive matters.
When medical data is stored, where and how is important. When handling data that needs to, legally, needs an audit trail, abstractions won't do.
When data is needed at low latency, the details matter. When cost is important (egress charges per operation or counted by size of data transfered), details matter.
- 0xbadcafebee 2 years ago
  > the physical location of the hard drive matters
  Not exactly: what matters is the legal designation of the data storage device. The location of that device is one of many factors that "matter", but not to the application, or developer, or user. They only "matter" to the law. We aren't going to start writing UnitedStatesFileWrite() functions, now, are we?
  Instead of considering the physical location of a hard drive, what we should be doing is querying a data storage object which has the properties we want:
  io_construct = DataStorage() storage_search = io_construct.DataStorageSearch({ "contains": [ { "legal": { "jurisdiction": { "location": [ { "country": "US", "state": "California" } ] } } }, { "record": [ { "email": "foo@bar.domain" } ] } ] }) with io_object as io_construct.AttachDataStorage(device = storage_search): io_object.read()
  We should never have to think about what building a hard drive is located in, much less the complexities of dealing with specific data laws. The IO construct should deal with that.
mike_hearn 2 years ago

I think the details of IO are already abstracted pretty well, it's a topic that's had a lot of effort put into it. The remaining things you have to think about are pretty fundamental and not fundamentally technical in nature, like:
1. Price
2. Brand of whoever is providing the storage (matters because it's a proxy for lots of other details)
3. General physical location
Once you made those decisions services like S3 abstract the rest. There are tools that let you access these via FUSE (in which case client protocols don't matter).
Kinrany 2 years ago

Wouldn't the data interface still be a stream of bytes?
- 0xbadcafebee 2 years ago
  
  No reason it has to be. You could have the data interface accept plugins which preprocess data in different formats and expose it as something else, like an object, document, stream of documents, etc.
  - ozr 2 years ago
    
    Those are all ultimately represented as bytes. You're just looking for a different query language.

blibble 2 years ago

that chart of the "inefficiency of client protocols" tripped my bullshit alarm

the paper is here: https://15721.courses.cs.cmu.edu/spring2023/papers/15-networ...

it's a super-contrived example that's not using any of the functionality of the database and is just using it as "cat"

basically just doing cat over localhost, well, what a surprise, if you add a layer of serialisation of course it's slower that just doing memcpy()

if you're using your database to store files... maybe don't do that

DrDroop 2 years ago

I know of a DSP Engineer that used memcpy as a baseline to compare the speed of a sound filter. I think it is a good measure for first principle thinking.
There are other things wrong with the talk, it takes way too long to get to the point for one thing. DuckDB is cool and all but most of data management is getting the data in the right format/place and doing security or stuff like that, not running some query.
- blibble 2 years ago
  
  memcpy seems like a reasonable baseline for a function designed to operate on things in memory
  not for a database
  - DrDroop 2 years ago
    
    Man, you should read the original thread where memcopy was brought up as another example why netcat is a bad baseline for a network protocol and I was like, yea no that part at least sort of makes sense because that is the baseline. Sometimes I don't know why I keep commenting in this website. It is like talking with idiots all day.
onetimeuse92304 2 years ago

Also I know some of these databases. For example, if you use something like MongoDB with its default configuration, it will be slow as molasses. It will send 20 documents over the network (default cursor batch size) and then yield its time to the operating system and wait for further instructions.
If that document is just three small fields, then you just effectively succeeded receiving maybe couple packets before the server gave up. Pitiful.
Change the batch size to maybe 2 or 20 thousand, enable network compression, increase client read buffer size from its ridiculously low default size, and this could start looking more like a data transfer we expect.
mlyle 2 years ago

For a lot of data science/analytics, what you really want is "cat" of the data.
The database can't always do the data reduction and analysis you want to do quickly, and even in many of the cases where it can, trying to tell it about them in SQL and stored procedures can be pretty gross.
I say this as a huge proponent of SQL, stored procedures, and doing lots of work in the database.

sertbdfgbnfgsd 2 years ago

Starts by saying "i sent the abstract drunk, then i had to create a talk", basically admitting he started with the conclusion and then built the argument.

vrosas 2 years ago

Well at least he’s honest. Pretty much every talk I’ve given was a shotgun of random abstracts that I came up with content for later. The downside of conference talks being a required component of promotion packets.

Settings

Two Tier Architectures Are Anachronistic [video]

Keyboard Shortcuts