Settings

Theme

Ask HN: Learning NoSQL, papers and books

142 points by wareotie 8 years ago · 50 comments · 1 min read


In your opinion, which papers and books are mandatory to really understand NoSQL subject?

indogooner 8 years ago

I don't know what is your current knowledge/experience with NoSQL databases but I would suggest start with the well known Bigtable paper [1]. Post that instead of reading more papers have a look at AOSA chapter on NoSQL [2]. You can then either go through Bigtable paper again to improve understanding if you feel so or jump to Dynamo paper[3]. To develop your understanding further I think it would be good to go through documentation and source-code of some opens source databases. This would help you connect the usage scenarios with the design choices you saw in the papers.

After this it is upto you. The papers involve references to lot of distributed systems literature. If you are interested you can go through resources here [4]. If you want to go a more hands-on way, I would also recommend reading AWS DynamoDB best practices (you can read up Cassandra or CouchDB also) documentation [5] to see the practical consideration while using these systems. Then try to use it or any other NoSQL database in a side project and see whether they are good fit. The data modelling would involve thinking hard about use-cases and would also help you compare this to relational systems.

[1] https://static.googleusercontent.com/media/research.google.c... [2] http://www.aosabook.org/en/nosql.html [3] http://www.allthingsdistributed.com/files/amazon-dynamo-sosp... [4] https://github.com/aphyr/distsys-class [5] http://docs.aws.amazon.com/amazondynamodb/latest/developergu...

tzury 8 years ago

Not ranting or trolling, but in the vast majority of cases I've come across, PostgreSQL or even mySQL or SQLite would have been a better choice.

(There must be something appealing to developers using JSON's style syntax rather than a Structured Query Language.)

There should be a solid reason to pick noSQL in general, and when such appear, picking the right one amongst the available noSQL platform is another job.

https://en.wikipedia.org/wiki/NoSQL

  • greendisc 8 years ago

    > Not ranting or trolling, but in the vast majority of cases I've come across, PostgreSQL or even mySQL or SQLite would have been a better choice.

    This is ranting.

    I am a Postgres proponent but saying that PostgresSQL/mySQL/SQLite is the better choice in the vast majority of cases the parent has come across is reckless. The words were well chosen making the rant not that obvious.

    There aren't good or bad DBs. Every DB has its strengths and respective trade-offs. As much I like Postgres, there so many use cases to use also other DBs and also NoSQL ones. I am not feeding the troll and starting reasoning why NoSQL can be terrific or SQL can be a big struggle, I am on both sides, both SQL and NoSQL have their place.

    It's sad that a thread which is about learning NoSQL gets hijacked by a unrelated top comment opposing NoSQL.

    • traviscj 8 years ago

      There definitely are bad databases. You can easily make a system that is NOT consistent and NOT available and NOT partition tolerant, for example.

    • cnlwsu 8 years ago

      “Yeah, but Postgres” is the new “Is it webscale?” For any db related thread. They found a blue hammer that will work for every problem and want everyone to know.

  • cloverich 8 years ago

    Not only that, but Postgres has great JSON support. I think a good way to put it is: If you don't know the ins and outs of SQL, start there. It solves the vast majority of problems you'll encounter. Expand out to NoSQL as your needs (and knowledge) arise.

    • no_wizard 8 years ago

      I know you mentioned Postgres, I was just wondering high if you or anyone else had experience with using JSON with MySQL? I am currently in a place at work where the design was made to use MySQL and we can’t back out of that and I find their documentation a little terse in this subject (I’m not a SQL expert) so I was wondering if anyone could speak to it at all? Does it automatically parse the keys in a JSON file as table names for where to put the values or are you just calling a file every time? Is one per say more effective or efficient than the other?

      Sorry to latch on I’m very eager to learn. Our stacks of choice are Django and Flask respectively, if that helps

    • tzury 8 years ago

      Right on! This is the reason I've actually mentioned Postgres.

  • jjirsa 8 years ago

    “I want to learn about planes”

    “Trains are usually a better choice. Most people don’t need planes”

  • pgwhalen 8 years ago

    Can you explain why? I agree, but I’d like to be able to justify it as much as possible when arguing for SQL.

    • ajmurmann 8 years ago

      Not the author but here is my explanation: SQL databases are similar to a Swiss army knife. You can apply them pretty much to every use case. However, for most use case they won't be as good as a more specialized tool. NoSQL DBs usually make stronger trade offs that limit them to fewer use cases, but make them incredibly well suited for others. If you know for sure what your problem is and you gotta scale, go NoSQL. If you and your company are starting out you are most likely better off with Postgres. Even if your current use case is a perfect fit for a specific NoSQL store your business needs are likely to change and now you gotta migrate. For all but intense cases Postgres will scale well. Once you are super successful you can migrate the pieces of your system that need to to a better scaling solution. You must make 100% sure though that you understand the tradeoffs that you are making. There is no system that is just in general better than any other reasonable system. If a knew system claims otherwise we just don't know the tradeoffs yet which is super dangerous.

      • kpil 8 years ago

        >However, for most use case they won't be as good as a more specialized tool

        It's just a small set of problems that really requires a nosql database.

        Most (if not all) nosql databases are perceived as less complicated since they hand-wave away all complicated things to the users of the database, while focusing on being fast and simple to use and run in a cloud or cluster.

        Anyone running a database system in a fault tolerant configuration immediately hits the CAP theorem, and SQL and nosql databases sacrifies or ignores different aspects of both CAP and ACID in order to scale.

        As you write, you really have to know what you are sacrificing before doing that choice. Perceived complexity is probably not a good selector.

        One problem is that SQL databases are normally installed in "pet-mode" where you have two or three servers that you really have to take care of. This feels less satisfactory when developing for the cloud, and typically also doesn't scale very well horizontally. Instead of running your own distributed database in the cloud (and fail) there are also PaaS databases, but SQL tends to be flavoured making it hard to change the infrastructure.

        Maybe another problem is the model mismatch - relational databases are imposing restrictions on how data is represented, and how it's retrieved that makes no sense from a "rest-interface based" view as there's a mismatch between the relation-entity view (objects and lists) and relational algebra.

        There are graph databases, and I personally think that they might be the future. Building strong models within a bounded context is still probably the best way to model complex data and processes that operate on that data.

        Unfortunately the future isn't here yet and most graph databases are still slower than my laptop.

        The best compromise is probably to use CQRS - Command Query Responsibility Segregation, meaning that queries and commands (modifications) are handled by separate stacks where read-only data might be distributed and updated ("cached") for use, but actual processing is made to a single consistent database running on a few "pet" servers.

        This only makes sense for systems that mostly read things, and are updating it's data relatively seldom.

  • paulddraper 8 years ago

    Q: How can I learn about noX?

    A: Not trolling, but X is vastly usually better than noX.

    IDK what tolling is.

  • zzzcpan 8 years ago

    Vast majority of cases I've come across, if not all of them, only suffer from any reliance on 90s era RDBMS systems.

    And it's never about JSON, it's about latency and resilience, about being able to simply add and replace nodes, about just working in a modern distributed environment.

    • collyw 8 years ago

      How many systems actually need a distributed database though? In my experience its usually resume driven development that makes the choice to go NoSQL.

      • projectileboy 8 years ago

        Dear God, this. 90% of the people reading this (or more) - myself included - are currently working on a system that averages fewer than 100 concurrent users. I’ve worked on big systems, and DBs like Cassandra are great, and absolutely have their place, and that place is likely not your system. Quit over complicating everything, please. Please.

        • federicoponzi 8 years ago

          I love this comment, because it's exactly how I feel about when people talk about these systems, designed for big scale.

      • lern_too_spel 8 years ago

        Anybody who needs more availability than an individual instance can provide.

    • cglace 8 years ago

      That's all great until you need to perform a join.

      • overcast 8 years ago

        RethinkDB handles joins just fine :)

        • elcritch 8 years ago

          As does couchbase. :-) Personally I like map-reduces.

          • ifcologne 8 years ago

            So does the multi-model database ArangoDB. https://docs.arangodb.com/3.2/AQL/Examples/Join.html

            And some NoSQL databases speak SQL as well - without being relational.

            I like the JSON support in PostgreSQL a lot. Very easy to deal with unstructured JSON data while still using common attributes in a relational format. But there are more cases that one might think about - as a relational guy - that benefit from graph databases, document stores or optimized time-series databases.

  • ChemicalWarfare 8 years ago

    NoSQL is a great fit for OLAP-type systems where there's tons of high volume writes, eventual consistency (or BASE in general) is good enough, strict schema is not enforced and the consumer - data scientist or a customer service rep etc is not affected too much if they have to wait a few extra seconds for the search results to come back.

hdra 8 years ago

I highly recommend Martin Kleppmann's Designing Data Intensive Applications(http://dataintensive.net).

It will not only help you understand what's "SQL" and "NoSQL" data stores, it also covers the differences between each of them, what problems they are designed to solve, how they try to solve it, and if it'll help with your problems as well.

ozanonay 8 years ago

I teach a course on database systems, including one class on distributed databases (like Dynamo and Spanner) and another on dataflow engines (like MapReduce/Hadoop and Spark).

Students seem to find the Dynamo paper to be the single most enlightening resource. It does a great job of explaining Amazon's use case and how the solution fits the problem. I also reference the relevant Red Book chapter and some students value that context.

It's worth noting that students are very comfortable with relational DBMSs by this point, both in theory and in practice. It quickly becomes clear to them that NoSQL is better called "no transactions", as they know the costs and benefits of various isolation levels in a traditional RDBMS. If you don't yet have an undergraduate-level background in database systems I'd encourage you to seek that out either first or at least along the way to understanding NoSQL systems. My recommendations for how to do this as a self-learner are up on https://teachyourselfcs.com.

  • itcmcgrath 8 years ago

    Yet many non-relational systems do support ACID transactions across multiple resources. Just from Google there is Megastore, Cloud Datastore, Spanner, Cloud Firestore

zzzcpan 8 years ago

Distributed systems. Consensus [0], CAP, PACELC theorems [1], CRDTs [2], maybe Chord DHT [3] for hash rings. Oh, and jepsen.io for actual database choices.

[0] https://en.wikipedia.org/wiki/Consensus_(computer_science)

[1] https://en.wikipedia.org/wiki/PACELC_theorem

[2] https://en.wikipedia.org/wiki/Conflict-free_replicated_data_...

[3] https://en.wikipedia.org/wiki/Chord_(peer-to-peer)

TruthSHIFT 8 years ago

The most important thing to understand about NoSQL is when you should use it. For many circumstances, NoSQL isn't the right tool for the job. The key is being able to recognize when it is.

I'm still learning how to determine when I should use NoSQL instead of SQL. My best advice is to carefully consider how to plan on querying your data. If you plan on making complex queries that link multiple relationships, NoSQL is not for you.

  • analogic 8 years ago

    Or in a slightly different form, what I'd personally love to always have an answer for: Is there a fast way to do this complex query in reasonable time in rdbms or do we have to force it into NoSQLish solution? (say.. solr)

    After I've optimized my query/indexes to get from 60s to like 4s running through usual stuff and trying to not do anything too stupid, how to get it to <200ms? Maybe better question how to structure data so you don't need the complex query?

sciurus 8 years ago

Seven Databases in Seven Weeks https://pragprog.com/book/rwdata/seven-databases-in-seven-we...

Designing Data Intensive applications http://dataintensive.net/

brudgers 8 years ago

Same as for SQL databases: Readings in Database Systems, 5th Edition -- Peter Bailis, Joseph M. Hellerstein, Michael Stonebraker, editors

http://www.redbook.io/

WoodenChair 8 years ago

As a starting point, if you have little background in NoSQL, I strongly recommend this 1 hour talk by Martin Fowler: https://www.youtube.com/watch?v=qI_g07C_Q5I

It's slightly dated, but it still gives a strong overview of the different paradigms. The truth is what you want to learn probably differs greatly depending on the paradigm that fits your application. NoSQL databases can broadly be categorized into document-oriented, key-value store, columnar, and graph. This video will help you understand what (at least three) of those are. Then you can focus in on books/articles about the paradigm that makes the most sense for you.

rolandm 8 years ago

Video from Martin Fowler about Introduction to NoSQL: https://www.youtube.com/watch?v=qI_g07C_Q5I

Tutorial from Felix Gessert about NoSQL https://medium.baqend.com/nosql-databases-a-survey-and-decis...

and Slides https://www.slideshare.net/felixgessert/nosql-data-stores-in...

zitterbewegung 8 years ago

Designing Data-Intensive Applications [1] is a good book all around for creating application and management of the data that they provide including NoSQL.

[1] See http://dataintensive.net

dahart 8 years ago

I don't know of any mandatory books or theory about NoSQL, I picked it up on the fly using Firebase for a web app. Not affiliated, but I'm a reasonably happy customer. It's super easy to learn, and they have lots of tips and pointers about how to use it well, as I'm sure others do.

Their tips are here, and I think this applies to most/all NoSQL (someone correct me if I'm wrong.) https://firebase.google.com/docs/database/web/structure-data

The tl;dr is:

- Avoid complex queries. Structure data so that you can make simple queries that execute fast.

- Avoid nesting & flatten data as much as is reasonable.

NoSQL is easier to learn & use than SQL, there's lower barrier to entry, but the trade off is that it's less powerful than SQL, so you have to keep your data simple too.

  • jklein11 8 years ago

    >Avoid nesting & flatten data as much as is reasonable.

    Isn't this contradictory?

    • dahart 8 years ago

      Yes, it is a little bit - if you mean that one reason to use NoSQL is to store nested JSON.

      This is referring more to schema than data. In part what that means is to avoid nested indexes... subtle but different than avoiding any nesting at all. In other words, if you can treat the nested data as a blob, it's probably okay, but if it's being used for a query, it's adding complexity that can cause trouble.

      Some of the reasons for that are Firebase-specific, it has to do with security rules and how security can get too complicated if you're not careful with nesting.

      But I'd guess it still applies to other NoSQL data... nesting data as part of the schema is like making another table, and all the complexity that comes with it. Except it's a new table you can only get to by going through the first table.

      A common problem with nesting is thinking you got the order right for your use case and later finding out you sometimes want to index by the inner data rather than the outer data. If you only have A/B (B nested in A) and you need to query for As, then you're fine. When you find out you need to query for Bs, you have a problem.

      Firebase even recommends duplicating data, if necessary, to have two indexes A/B and B/A, rather than trying to query for nested data.

manigandham 8 years ago

Start with a general understanding of SQL/NoSQL/ACID/CAP and how they relate: https://www.quora.com/What-is-the-relation-between-SQL-NoSQL...

Then read this book for in-depth details - Designing Data-Intensive Applications : https://dataintensive.net/

unkown-unknowns 8 years ago

I found the book CouchDB: The Definitive Guide to be a good introduction when I first read it some years ago. I bought the dead tree edition but they have an online version that I think may have been updated.

http://guide.couchdb.org/

opendomain 8 years ago

NoSQL Distilled NoSQL for dummies 7 databases in 7 weeks NoSQL for mere mortals Professional NoSQL

and of course the orirginal papers from Amazon and Google.

If you have more questions - contact me at HN AT NoSql dot Com

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection