Announcing SurrealDB 1.0
surrealdb.comThis feels like satire.
and written by someone who should have spent more time studying the history of database systems.
> Imagine a world where the majority of your backend logic is seamlessly embedded within the database itself
This is not a good idea. It has been done many times and never quite caught on because it is not a good idea.
From a security perspective it is a nightmare.
Or if you do put all the correct isolation around the code to protect the database, then you have basically created an "app server" (old term) inside a database, and it would happily, run outside of the database since in essence it is already doing so.
> never quite caught on
It was quite normal in 70s 80s and even 90s; all the ms sql, db2, oracle and as400 systems I encountered in those days had all or almost all logic as stored procs. Very large ones.
There's also a ton of optimization savings in doing this. your DB already has to move from media to cache, so why not do the pipelined processing in place where it's cheapest?
Instead, we haul that shit to the NIC, then across the network, then copy it into memory on some server (probably inefficiently), do the operations there, or we have to reinvent this with pushdown functions for distributed databases.
There are many cases where moving function-to-data is the right answer.
During the Java era the standard quickly became ORM instead of stored procedures.
Largely thanks to NeXT showing the way with EOF (the original ORM).
ORMs basically exist because people doing coding refuse to learn the relational model and want to program with OO instead. Most of the time they are, at best, irrelevant but eventually become expensive, slow, complex and painful. Everyone in coding knows, once they get a lot of experience, that translation/mapping layers are a waste of cycles, memory, and are complex and where assumptions mismatch results in tons of bugs, and that applies to ORMs in partcular.
I've just spent the past week refactoring code that was too slow because joins were being performed by a loop in the code rather than just doing a join in the first place.
Replacing 3 classes, and numerous methods with two 6 line queries that does the same task in a thousandth of the time is quite satisfying.
That seems due to the leaky abstractions that ORMs invite; they appear to be just 'normal code'; just working with structures, objects and variables like you would normally do in code. In reality you have to really know the insides, performance characteristics etc of the db you are using. We encounter a massive amount of codebases from the last ~10 years that are built like there is no database to account for, so there are usually not even indices, there are indeed loops doing inserts etc.
Stored procs might be considered evil; they did make the developer acutely aware of the innards of the database instead of just not even knowing how a db works and which things are expensive. To know that you need to look past the leaky abstraction of ORMs and that's distracting from making features.
I have heard tech leads in this situation claim that because everything runs fine in the cloud without knowing anything about postgres/mysql/dynamo etc, and those extra costs are cheaper than people, it's fine. But our team wasn't hired because it was fine; we were hired because hosting costs were eclipsing the developer costs. It's not that hard to do when stuff is just incredibly badly built...
They don't even need to be stored procedures, just having proper queries in the code to perform the database actions would be preferable to ORM.
Another big issue I have with ORMs is that a lot of them are hostile to you writing queries in the first place. Network IO is the worst, please stop trying to force us to use your terrible overlay.
I’m going to start having a seizure if I keep reading this thread because I agree so strongly.
It is painful looking at modern software development. They don’t know the basics and the whole Agile/TDD thing I think was the start of the end because table schema (and every other kind of planning) is not really compatible with agile.
> Or if you do put all the correct isolation around the code to protect the database, then you have basically created an "app server" (old term) inside a database, and it would happily, run outside of the database since in essence it is already doing so.
You can drastically cut down on end-to-end latency by eliminating the network hop from your app server to your DB, especially in cases where you are forced to make multiple back-to-back DB queries per request.
It's possible and even relatively straightforward to eliminate that network hop without moving all of the application logic into the database.
"all" is doing heavy lifting here in a way that distorts the observation the person you are replying to was making.
Fair enough. It wasn't my intention, perhaps just being careless.
I think these conversations usually boil down to some kind of pragmatism versus idealism, so let me be frank about that. If everything could be done as close to the data as possible, that would be faster than not. Every boundary / transformation is a few burned cycles and delays waiting for the arrival of photons/electrons. The ideal here, really, is to have everything occur in one process on one machine without hitting the disk. Ideally the User's machine, for whatever value of User.
However, there are other considerations which cause us frequently to draw the boundaries differently. Often just as valid.
Where certain things happen is relatively flexible, it's software and we can come up with whatever architecture we can come up with. If you're moving more logic into the database process, then there are costs to that but they're not measured in cycles. Others if you do it by moving the database into the application process.
Maybe those costs make sense, in which case go for it. We've had the means to do arbitrary work in the database for many decades, and for some periods it was even fashionable.
It's a nightmare from a lot of angles, especially debugging. Mashing everything together is a recipe for disaster. To debug something you want to isolate a small unit of execution and be able to replay the same data through it continually until you get the right results back.
A world where the database is seamlessly embedded in the backend code instead would be more useful. Oh wait…
This line in particular is a head-scratcher:
> Advanced inter-document relations and analysis. No JOINs. No pain. […] queries allow for multi-table, multi-depth document retrieval, efficiently in the database, without the use of complicated JOINs
That sounds like they have a networked database, not a relational database. That might be fine, but as I understand it, relational databases won because they offer more flexible access patterns and it’s easier to write correct queries.
Not Open Source (in an OSI way):
https://github.com/surrealdb/surrealdb/blob/ed60a35b9b539e1b...
Is it claiming anywhere to be open source? I've heard an interview with the creators and they seemed pretty open about which aspects where open source and which aspects where shut down?
I could be missing something from their branding but I'd rather see projects with non-open source licenses from the beginning that bait-and-switch license change when they want to become profitable. (Although, I'd much rather just see a fully open source project)
Sure. I wasn't really trying to insinuate anything, although I think in a more-ideal world the licensing arrangements of all projects would be in a prominent place, especially if they use a "Star us on GitHub" call-to-action. It's a vital part of evaluating this sort of thing.
> Is it claiming anywhere to be open source?
They have a page on their website where they claim to be "an open source company" (https://surrealdb.com/opensource)
Certainly debatable copy, but in total fairness that page is a list of actually-Open-Source licensed projects.
It would indeed be better to state that clearly, then I don’t have to check it (our company has a policy which won’t get this passed).
> Imagine a world where the majority of your backend logic is seamlessly embedded within the database itself
thinks of Oracle APEX
thinks of IBM AS400
shudders
Hahaha, loved coding RPG on the AS400. The file system on top of the database layer! Awesome stuff
Fireship did a 'SurrealDB in 100 seconds' in case anyone would like a quick video summary: https://youtu.be/C7WFwgDRStM?si=va17q7zagn4lIoi3
It benches very very slow and no options to horizontally scale outside of manual sharding.
API has a cool design though.
The dev team needs to look at perf before the complexity hole gets too insane if it hasn't already.
What’s your source on the benchmark thing? It seems like they have a lot of backend KV store options including FoundationDB and TiKV which are both distributed, so I’d be curious how the backends perform relative to one another.
Strong vibes from RethinkDB. I hope SurrealDB will have a different fate, I'm a big big fan of live queries and change feeds and don't like most existing implementations.
This immediately sprung into my mind. Already got bitten by RethinkDB (trusted it too much to do some logic, rewrite was painful), so I’d rather have reasonably easily replaceable third-party components than depend on something unique, especially without any guarantees.
Can you give a quick elaboration on what you mean about the fate of RethinkDB? I'm not familiar with it.
The project didn't really take off as a business and the team joined Stripe. See the "History" section. https://en.wikipedia.org/wiki/RethinkDB
My biggest concern with SurrealDB is the license, otherwise I'd be very interested in adopting it.
I know the license is okay, but from my reading of it, sounds like if SurreaDB really wanted to shut you down they could.
Otherwise, I find the architecture of it to be beautiful and compelling.
Unless you're operating a "database service", a well defined term (in the license) that excludes almost every use-case
I don't see how SurrealDB could shut you down?A “Database Service” is a commercial offering that allows third parties (other than your employees and contractors) to access the functionality of the Licensed Work by creating tables whose schemas are controlled by such third parties.So if you create schemas on the fly, you cannot? Why would I use such a db?
Right, so could I use it to creat an Airtable clone?
Generating a table schema per airtable spreadsheet sounds like an anti-pattern to me.
To me it seems like the options for user-defined schema are either push-down the schema (and do one schema per spreadsheet) which lets you take full advantage of your DB's query planner statistics system etc, do an entity-attribute-value table and deal with building your own query planner and collecting your own statistics, or go with a document store and build some kinds of dynamic indexes instead of dynamic table schemas.
That being said, I generally agree: I’ve been bitten by enough database products that are some startup’s main product to really trust them. I’d rather use a database that’s either maintained by something like the Apache Foundation (Druid, Pinot, etc.) or one that’s a side project of some big company (FoundationDB) or one that’s been around long enough to be unlikely to surprise us when the team is acquihired.
I've never seen database that either (a) allows you to create that many schemas at all or (b) doesn't become less stable because of it.
One example of this is that the database may create a directory structure per schema which will then result in file system performance degradation or hitting end user limits.
“Imagine a world where the majority of your backend logic is seamlessly embedded within the database itself.”
Don’t need to. Majority of my work is de-coupling systems “designed” around this idea.
Why can’t the database be the database, the backend be the backend, etc ? You can’t have a toolbox full of multi use tools. Sometimes you just need a fucking Phillips head screwdriver.
I like where they are going. Are there benchmarks out yet?
Is there any trick to implement live updates in a scaleable way? In the limit for a naive implementation, every mutation would have to check every subscriber to see if the change is relevant which seems like it would cause large bottlenecks for writes.
One of the guys behind Convex explained the rough idea; I hope I do it justice.
The strategy is to break the subscription up into listens based on the read-set ranges of the query. Then you put the individual read-set ranges into a system table that you index. Finally when writes happen, you notify all queries who's read-set intersects with the write-set.
For example, say I have a query `SELECT * from block WHERE parent_id = X AND last_modified_at > Y`.
This query might create two subscriptions for its read sets:
Now a write happens: `INSERT INTO block { id: Z, parent_id: X, last_modified_at: Y + 10, title: "hi" }`{ query_id: Q, table: block, column: parent_id, min: X, max: X } { query_id: Q, table: block, column: last_modified_at, min: Y, max: Infinity }We can find our subscriptions to notify by doing queries like these:
Then you notify all those query_ids.SELECT query_id FROM subscription WHERE table = block AND ( (column = id AND min <= new_block.id AND max >= new_block.id ) OR (column = parent_id AND min <= new_block.parent_id AND max >= new_block.parent_id) OR (column = last_modified_at AND min <= new_block.last_modified_at AND max >= new_block.last_modified_at) )I'm sure there's a lot of details and optimizations you can do on top of this; finding the right read sets seems pretty tricky for complex queries. Plus stuff like batching/debouncing, etc.
Is there a talk you can point me to?
There’s several challenges with this approach that come up for me (which unless I’m mistaken is the naive approach of checking each write against a notification set).
The first is that maintaining the read set seems very expensive since it scales with the number of live queries installed. In a multi tenant DB, that seems tricky to manage in terms of memory used.
The second is that the computation of intersection with a read range seems potentially very expensive. Imagine you have a string column. You now have to do a string comparison for every insertion that might hypothetically match the query.
Finally, computing that read set correctly seems challenging (as you mention) and it’s not immediately clear to me it’s always tractable (eg could you do this with arbitrary complicated joins?).
Additionally, in your description, each write has an implicit table scan to do the select to find the intersection. That will tank write throughput even for small total intersection sets (eg there’s a reason LSM databases do deletes by inserting a tombstone instead of checking whether the data exists first, same with the merge operator in RocksDB - a db read in the write path significantly kills performance)
See https://news.ycombinator.com/item?id=31836545
To the specifics of my solution, you’re going to visit many less than O(subscriptions) rows using a btree index on min & max as I suggested, and I’m sure there’s use case specific data structures that could do an even better job.
You also don’t need to stall the write transaction until subscriptions are notified; you can batch up that work into big chunks and process it on a background queue after the transaction commits.
Finally, you don’t need to have subscribe to read sets that exactly match the query. You can simplify them a lot for those kinds of complex cases because you can tolerate over-subscription, re-running the query even if it hasn’t changed, checking the new result, and then deciding to deliver it to the external subscriber or not.
Listener callout does not have to be part of the synchronous writes. Say if you keep a changelog, listeners can be asynchronously notified. These checks can be batched and throttled as well to minimize call-outs.
I would suspect this is what Google does with Cloud Firestore.
If you’re throttling, then your database can’t actually keep up with broadcasting changes. Throttling only helps ensure the system keeps working even when there’s a sudden spike.
Batching only helps if you’re able to amortize the cost of notifications by doing that, but it’s not immediately clear to me that that there’s an opportunity to amortize since by definition all the book keeping to keep track of the writes would have to happen (roughly) in the synchronous write path (+ require a fair amount of RAM when scaling). The sibling poster made mention of taking read / write sets and doing intersections, but I don’t think that answers the question for several reasons I listed (ie it seems to me like you’d be taking a substantial performance hit on the order of O(num listeners) for all writes even if the write doesn’t match any listener). You could maybe shrink that to log(N) if you sort the set of listeners first, but that’s still an insertion speed of m log(n) for m items and n listeners or O(mn) if you can’t shrink it). That seems pretty expensive to me because it impacts all tenants of the DB, not just those using this feature…
While I applaud any attempts to innovate in this area, I'd be more interested in seeing the opposite approach - integrating persistent storage in a programming language. If we already have default implementations of dictionaries, linked lists, and other in-memory data structures in programming languages, why not have default implementations of permanent data structures such as object collections and KV-stores?
There have been many attempts over the years. There's Prevayler [0], in Java, and Madeleine [1], in Ruby. Neither of them became wildly popular.
I worked with Madeleine for a few years and while the performance was amazing, RDBM systems are ahead because of the many years of standardisation, tooling and capabilities (e.g. migrations)
The other very important thing is - DB latency is an issue only in the hands of a novice. In the real world, network latency is a couple of magnitudes higher.
Check this out: https://microstream.one/products/microstream-for-java/
Io also has persistence baked into it: https://iolanguage.org/reference/
But I'm sure for most of the HN crowd one has Java in the URL and the other is a esoteric cool language from the early 00's so.. not practical. :)
I feel like databases are the new JS framework, there is a new one every week.
1.0 version but https://github.com/surrealdb/surrealdb/issues/1548 is still open :)
A lot of the issues here have been resolved in documentation. For example there’s now new docs on permissions and deployment also more information on backups. Others are currently being worked ;)..
The issue is still open as we haven’t resolved all of it yet so makes a good reference.
Is it a database where the only date/time type is a melting clock, and a join is defined as “the chance encounter of a sewing machine and an umbrella on an operating table”?
Live queries & change-feeds! Reactive data stores are great; neat to see.