Ask HN: SQL or NoSQL?
Just came across this tweet criticizing the Parler social network for using a relational database:
https://twitter.com/sarahmei/status/1348477224467394560
My understanding was always that for relational data (e.g., social networks) you should use a relational database. Is the person in this tweet correct? If so what is a better option? Don't worry too much about this tweet, my guess is that the author wanted to express a strong opinion to provoke a reaction form a certain audience. Once you reach a large scale, relational databases start being a problem for availability and replication of data across different availability zones. Operations become complicated (you have replication chains, master/slave setups, etc.) If your data is relatively simple and doesn't require a lot of relations and foreign keys, then something like Cassandra can save a lot of headaches. Btw, a common trick to make a relational database perform at scale by limiting joins is to "flatten data", i.e. replicate data across different tables to avoid joining them. Finally, don't let yourself be fooled by anyone who claims they know "the better option." There is no better option. There is only a better option for a particular use case you're looking at, given the specific constraints at hands. That's what engineering is about, including software engineering. If you want to learn more about designing storage systems by constraints, I recommend that you read the 2007 Dynamo paper from Amazon, and in particular section 2.3 "Design Considerations". Below is a link, you can easily find a PDF online if you need. https://www.allthingsdistributed.com/2007/10/amazons_dynamo.... I no longer buy the "use relational databases for relational data, use NOSQL for non-relational data" mentality. Basically all meaningful data in an application context has relationships. There is no real such thing as "non-relational data" Instead the question really is do you want a planned, enforced schema or an unplanned, freeform one. Use a SQL database for the former. Take a long look in a mirror and question the decisions that made you the way you are if the latter. To add to this: with Postgresql's JSON type, it's very easy to mix the two. We use relational data for most things to get the benefits of an enforced schema, but in places where freeform is preferable we stuff that data in a JSONB column. I can't think of any projects that would need more flexibility than that provides. Adding to this is the support for queries on the JSONB column which combined with computed table columns and the support for indexes (including sparse indexes) on JSONB columns. You can throw a bunch of stuff you don’t know how to deal with yet that you know you will need to deal with later... then gradually massage out the important parts, expose them to code using the table in a normal SQL style manner, test if things make a difference to indexes, and then eventually promote relevant information out of the unstructured data in the JSONB column up to a place in the fully specified SQL table with an appropriate column type and everything. It’s a fantastic way to build an exit strategy for situations where original developers a picked NoSQL database and current developers want to replace it with a more structured SQL database. Freeform flexibility is one aspect, but a document-style could also simply be a preference for how you want to structure your data or typically has an impact on how joins happen (if the document-database offers joins). Those joins will work in a graph-like fashion instead of how flat sets are typically joined. Or a nested document could be an optimization to provide data in the exact format that your client wants to see it. Although some document databases have popularized the idea that you should join in the client because they didn't provide joins initially, it doesn't have to be that way. Mixing paradigms in one database is probably going to be happening more. Just like Postgres is offering a 'document' style, some document databases are offering documents with relations. It wouldn't surprise me to see document databases offer optional schemas. I think that the future is a mix of options and tools in one database (which JSONB columns are a first step for). Depending on the situation we'll just model differently. The best database might become the one that makes us use these different tools most elegantly together. The difference between a document and a table is only a 'flatten' and a 'schema' away. Having worked with NoSQL apps, I would say that NoSQL simply move the burden of maintaining data integrating inside the application code. I would recommend default should be SQL and NoSQL should be used only if it really is impossible to do it in SQL. Exactly, but it goes further than that. The mentality never made sense since the term NoSQL never made sense to start. It's amazing how many people use a term that just originated from a meeting to talk about alternative databases. How we keep using it, although it's practically impossible to say what NoSQL is. Depending on whom you ask that term means different things. This is a very good introduction to the term: https://www.youtube.com/watch?v=qI_g07C_Q5I Graph databases are considered 'NoSQL' yet they have relations and transactions. Schemaless is often also one of the properties give to NoSQL, but it's also a bit strange to consider that a NoSQL attribute. Some traditional databases offer schemaless options and databases like Cassandra has schema yet is considered NoSQL. I work at Fauna which has relations, stronger consistency than many traditional databases. It is schemaless at this point but that might change in the future. Since it doesn't offer SQL it's thrown into the NoSQL bucket with the assumptions that come along with it. None of these one-liners in computer science make sense IMHO and we listen way too often to colleagues who use them. Similarly "Use SQL for enforced schema" might be accurate in many cases but in essence it depends on your situation, and we need to do research about what we use instead of following one-liners ;) There is some truth in this tweet but it doesn't mean you should use a nosql document database. Storing more context in document helps obviously because you don't have to fetch the data many times, it's actually also done in relational databases whenever needed. But you can't store a lot in one document, that doesn't scale nor work. For example, if someone changes its avatar or want to delete its account, do you want to parse all your social network documents to update an avatar or remove the comments on a tiny subset of them ? If a post is popular, are you going to update its document thousands of times per second ? In practice you will most likely find a mix of everything. Relational databases, in memory data stores, cache layers, perhaps a few nosql documents database, some big data stuff and a probably some excel sheets. Exactly! That's how I've built until now, a mix of databases. But it's also harder to manage. Database vendors notice this and that results in databases that start offering alternative ways of modeling. It's not about NoSQL vs SQL. Facebook's Tao is still backed by MySQL, so it's not like there's some intrinsic limitation. The issues are number of records examined to return a result, lock contention, sharding, and replication/consistency. NoSQL databases generally trade some of the conveniences of relational to be able to provide stronger properties in these aspects. The limitations that Sarah Mei identifies as clownpants is using a 32 bit primary key for an identifier for an ephemeral thing. That is again nothing to do with SQL vs NoSQL. It would affect both of them the same way. I would go a step further, you can't even talk about NoSQL vs SQL. It's about database features, the join patterns, how scaling happens, both are overlapping more and more and will continue to overlap more. Products built on SQL are aiming to scale and 'NoSQL' is aiming to provide the features that SQL provides in a scalable manner. Her original statement was already quite confusing. A relational store doesn't necessarily mean SQL, many 'NoSQL' offer relations and are a perfect fit for social media or were even built to support this kind of applications :) Unless you've seen their code and their data structures, we don't know the impact on performance of their technical choices. I would say that there is no black and white answer of what type of product needs what type of database -- it all depends on how you design the solution. I'd also venture to say that with so many databases now supporting JSON as a native data type, you can blend relational and non-relational data as needed within a relational DB. I'm using NoSQL only if I feel that I can't do it properly in a SQL database. So far I never used NoSQL. All ways use SQL NoSQL is for incompetent people who can't figure out how to convert a JSON request to a table structure. They just put the entire JSON as it is in a DB and call it NOSQL. Anyone using NoSQL for anything is either lying or clueless. You’re completely ignoring the situations where your doing things like exploratory development or free form data processing. If you’re dealing with enough data you want to start having an index here or there to speed up the queries you’re developing for whatever reason. I’ve done some data exploration work against things like the Python Package Index where the size of the dataset had put off more significantly resourced groups have abandoned projects in the past. For me to get useful data out of that in a reasonable timespan required repeatedly prototyping the queries what data I kept, and how I was loading it. UnQlite (sort of a nosql version of SQLite) was my secret weapon. I could build things in iterations extremely easily with each generation of the code wrapped up in simple scripts for easily measuring the timing of each stage allowing me to optimise things to eventually get the answers I was looking for with a job that took less than an entire weekend to run and with an optimal data set size that only ended up keeping a few GB of the much larger data I started with. This would have been WAY more work for me with with a SQL database. > for easily measuring the timing of each stage Why would easily measuring the timing of each stage be way more work with an SQL database but way easier with a NoSQL database? Martin Kleppmann's "Designing Data-Intensive Applications" discusses this https://dataintensive.net/ Besides being a good read overall, the book discusses topics like this one in detail and with a healthy attitude (people tend to have strong opinions on this) > My understanding was always that for relational data (e.g., social networks) you should use a relational database. I thought you were supposed to use a graph database for that, like dgraph. Do I remember incorrectly? > Dgraph is a horizontally scalable and distributed GraphQL database with a graph backend. --- Edit: found the source... According to https://www.infoworld.com/article/3251829/why-you-should-use...: "However, as with any popular technology, there can be a tendency to apply graph databases to every problem. It’s important to make sure that you have a use case that is a good fit. For example, graphs are often applied to problem domains like: - Social networks - Recommendation and personalization - Customer 360, including entity resolution (correlating user data from multiple sources) - Fraud detection - Asset management" This is what I thought the tweet would be referencing, but since the author is talking about posts, comments, etc. the alternative is clearly not saying a graph database (which is good for social connections) is optimal. Realistically, it just seems like a low-effort attempt to dunk on Parler for likes and retweets. I think the choice boils down to a few questions: - do you need relational data, or something more simple, or something more flexible ? - do you need transaction integrity ? Transaction integrity is a nice feature, but you can also design all your code so that if something blows "in the middle", it is somehow repaired automatically in a further event. Maybe a third point: most of our relational / transactional database technology is quite old. Could we do something better than SQL query language, common database types, and the actual database code that was very optimized for magnetic spinning disks, but maybe is not optimized for SSD ? Maybe, we would need something like SQLV2. And my god how much hype bullshit is inserted in those technical discussions. 'NoSQL' can be transactional and relational. The question should always be: "this is my problem, what's the best database?". NoSQL is such a huge bucket that the original question doesn't make sense imo. So is SQL, some traditional databases have quite some nifty features to support specific patterns. SQL will (maybe sadly?.. maybe not?) not go away. Many so-called 'NoSQL' are looking into providing SQL or already provided SQL (with or without limitations) to their users because they just want to use what they know.
I would be stoked for an SQLV2 standard! I think it always originates from business analysis requirements. Do you have some analysis that could be difficult to perform if using X? If it is then maybe switch to Y, or find a balance, or even build duplicates. Nine joins is not a big deal, you simply filter data before joining. And that is only if tables are fully normalized, no one does that. For example, current avatar and user info maybe in same table. Post and permission will likely be in the same table. Of course, you will use materialized views for even better performance. whenever someone makes blanket technical statements in this crazy boasting fashion, i think of Yeats: > ...the worst are full of passionate intensity. that said, it's difficult to feel sympathy for people supporting a platform that encourages terrorism, murder, etc. Social media are typically quite heavy on tree traversals. That kind of pattern is very similar to trying to resolve a deep ORM query or a deep GraphQL query which also doesn't map very well on 'traditional' relational databases https://en.wikipedia.org/wiki/Object%E2%80%93relational_impe.... I believe this 'issue' depends on:
A) the type of join
B) whether your relational databases flattens between consecutive joins.
C) is there easy/efficient pagination on multiple levels The type of join shouldn't be a problem, SQL engines should in most cases be able to determine the best join. In the cases it can't you can go start tweaking (although tricky to get right, especially if your data evolves, it's possible, you probably want to fix your query plan). B is however tricky and a performance loss since it's really a bit silly that data is flattened into a set each time to be then (probably) put into a nested (Object-Oriented or JSON) format to provide the data to the client. This is closely related to C, in a social graph you might have nodes (popular people or tweets) who have a much higher amount of links than others. That means if you do a regular join on tweets and comments and sort it, on the tweet you might not get beyond the first person. Instead, you probably only want the first x comments. That query might result in an amount of nested groups. So it might look more like the following SQL (wrote it by heart, probably not correct): SELECT
tweet.*,
jsonb_agg(to_jsonb(comment)) ->> 0 as comments,
FROM tweet
JOIN comment ON tweet.id = comment.tweet_id GROUP BY tweet.id
HAVING COUNT(comment.tweet_id) < 64
LIMIT 64 That obviously becomes increasingly complex if you want a feed with comments, likes, retweets, people, etc.. all in one.
There are reasons why two engineers that helped to scale twitter create a new database (https://fauna.com/) where I work. Although relational, the relations are done very differently. Instead of flattening sets, you would essentially walk the tree and on each level join. I did an attempt to explain that here for the GraphQL case: https://www.infoworld.com/article/3575530/understanding-grap... TLDR, in my opinion you can definitely use a traditional relational database. But it might not be the most efficient choice due to the impedance mismatch. Relational applies to more than traditional SQL databases though, graph database or something like fauna is also relational and would be a better match (Fauna is similar in the sense that joins are very similar to how a graph database does these). Obviously I'm biased though since I work for Fauna.