Settings

Theme

Startups should use a relational database

raycmorgan.com

57 points by raycmorgan 12 years ago · 78 comments

Reader

robconery 12 years ago

One thing that could likely get you fired rather quickly is running analytics on your live transactional system. Yes, your business needs to make decisions based on data, this is not terribly new. To think that you only have one data store is a bit short-sighted.

Many businesses (including startups) have moved to using document stores for high read environments and scraping nightly drops to their backend analytics systems. This is smart - you don't want to run summing/aggregation on a live transactional system for (hopefully) obvious reasons.

EDIT: it's also worth noting that map/reduce is typically much more powerful when aggregating large datasets. When trying to run analytics on top of a transactional system, developers like Ray here would end up with multiple joins and groupings - all of which slow everything down. Map/reduce certainly isn't perfect, but the author dismisses it as difficult witchcraft when, in practice, parallel execution of MR queries can greatly decrease resources and time to information.

I sort of think we've moved beyond this discussion.

  • nostromo 12 years ago

    The disconnect between your comment and the article is the term "startup" now means giant companies like Airbnb and tiny two person companies that haven't yet created an MVP.

    I think this article is targeted at the latter: pre-MVP and just post-MVP. For those startups, having two databases with one dedicated to a backend analytics system reeks of premature optimization.

    • karterk 12 years ago

      Having taken this path from a 2-person running a "nobody cares about this" app to an app with some decent traction, there are always other things more worthwhile to do in the early stages than trying to get insights from the invariably small amounts of data in the DB. A 2-person company should be talking to users rather than trying to analyze patterns from database tables. Sample size is just too small, and you will be evolving so fast that the trends will almost be meaningless.

      If you grow to be more than 2-people, then taking a mirror dump of the prod db to run queries against is pretty trivial effort.

    • vidarh 12 years ago

      Think of it this way: You really ought to have a failove/disaster recovery copy of your database, and an easy way of making offsite backups.

      Set up a slave of your MySQL or Postgres database (no the slave is not a backup, but dumping it regularly is an easy first step to a very basic backup setup), hosted in a different data centre, and if you're a two person company you now also have somewhere to run analytics whenever you feel ready.

      It can be <1 hour effort and a few tens of dollars a month in extra costs for a small system, yet makes a tremendous difference in resilience and gives you that db to run analytics against "for free" whenever you do need it.

    • VLM 12 years ago

      "having two databases with one dedicated to a backend analytics system reeks of premature optimization."

      Its free software, you don't have to pay for two instances of Oracle.

      One thing that will quickly kill a biz is combining the functions of PROD and DEV/TEST. Making the DEV/TEST box the DEV/TEST/REPORTS box is not a big deal, and you can't run a (real) biz without a DEV/TEST box.

      • waps 12 years ago

        Having run the technical side of ~5 "small" businesses now (no more than $1.5M revenue), I disagree.

        Eventually, nothing can match the performance of storing binary blobs on a cluster. But that only becomes worthwhile if you database is significantly larger than a terabyte. And I'm only talking about the operational "core" database, not your "data warehouse" (the log dumping ground, which should be split off when your database gets to be a few dozen gigs).

        Meanwhile, mysql has big advantages :

        1) can do basic optimization with "ALTER TABLE", even (mostly) live.

        2) you can mix PROD and DEV/TEST (though obviously you need to use good judgement). Obviously you should also have a DEV/TEST instance for actual testing. Sometimes you want to run a test quickly against PROD though. Adding a slave, having it sync and then running against the slave is a joy.

        3) creating reports is quick, customizable and everything you want.

        4) It's "idiot-friendly". Employees can ramp up to the structure in a mysql db in 2 weeks flat. Try that with custom document stores.

        5) It's typesafe and relational safe (if correctly designed), with the advantages that brings : significantly less weirdness in the database.

        6) Phpmyadmin. Mysql workbench. Django. Php ...

        I'm even going to argue that the GP's argument, that running analytics on PROD can get you fired, is not just wrong, it's actually an advantage of using mysql. (And the open source SAP database can run "live" analytics. You just can't believe how great that is for dashboards)

    • robconery 12 years ago

      True - but the backend analytics system can be Excel :) which it often is. In fact in my "two person startup" (that was alive happily for 5 years) this is exactly what I did :). I do take your point, however. I still think it's wise to not have an admin backend that runs rollups on your live system (which we've all done).

  • hedgehog 12 years ago

    It would be unusual that someone would run slow queries against the production master, typically you either use a replica or a backup restored into a separate environment. In my experience SQL has been pretty good for exploratory querying. By the time a company grows out of that setup they'll probably understand what tools will answer the questions they've got at scale.

  • marcosdumay 12 years ago

    What I don't get about those discussions is why people assume databases that can ONLY do map/reduce are a good thing? SQL is perfectly able to express map/reduce operations, and they main relational databases handle it almost as well as it's possible.

    • robconery 12 years ago

      Completely agree here. Your live transactional system needs to export reportable data - this can come in the form of a DB backup/download (which I've done many times) so you can run queries locally or by pushing CSVs using a cron job.

      You can use a SQL query for this or a simple map/reduce - either way my argument is to focus on transactional design that works best for your system - don't conflate reporting needs with your live system (i.e. BA queries).

lmm 12 years ago

The unhappy truth is that for many startups, relational integrity and transaction safety are simply not very valuable. Customers of an early-stage startup are by definition willing to take a risk on whatever they're getting from that startup. So simply not thinking about these problems - accepting that occasionally a partial write will happen, or two writes will collide, or a migration will not quite work correctly and your pages will crash until it's fixed - is a worthwhile sacrifice to increase development speed.

  • stiff 12 years ago

    This is just a very bad excuse for doing ridiculously shitty systems that stay shitty way after the startup phase. If you at all bother to understand your problem domain, and are not just a monkey at the typewriter, writing down a good domain model and adding the constraints is not going to decrease development speed, quite to the contrary, it's going to increase it, web developers often spend whole workdays just tracking down in the logs the "story" of some now angry customer that happened to violate some unformalized assumption of the system and got mishandled later in the process, especially if this customer happened to pay already. Not to mention that with a good domain model, the code for the individual functionality flows out naturally, while with a shitty one, you might end up with three times as much code for the same thing.

    Also, those mistakes in modelling the domain and in enforcing the constraints are often there to stay and slowly become impossible to fix, once you have 10000 records that do not fall into a few well specified states, it's hard to go through all of them, find some common denominators, and migrate the database. Not to mention that with the mess people can do in the code, and with the messy stack in use today, it's easy to introduce bugs that might be hard for anyone to notice but seriously harm your business.

    The amount of fashionable nonsense in software engineering seems to be higher than ever, unfortunately.

    • raverbashing 12 years ago

      > If you at all bother to understand your problem domain

      This is a startup. There is no problem domain. There is no spec. There's none for a startup. You may not even have code.

      If you're doing a "startup" but have this ironed out, great, everything you said then applies, but it's not a startup, since you found your business model, it's a small company.

      And I would love to enforce constraints on the DB but unfortunately, I already had "primary keys" that repeat, unbeknownst to the project customers.

      • marcosdumay 12 years ago

        > This is a startup. There is no problem domain.

        Is the startup just a game developers do on their free time? Because if it is trying to solve a problem, there is a problem domain, even if it's the wrong problem.

    • lmm 12 years ago

      If you understand the problem well enough to write a good domain model then you're basically already there - which is to say, you can only know you're modelling the right domain after you've found product/market fit. But to be able to find that out you need some level of product, which often means having working code before you fully understand what your product really is. At that stage, a schemaless database is very helpful; working out your domain model in detail and typing it out twice would be a waste of time when it's likely to change quickly.

    • memracom 12 years ago

      Why do you call it "software engineering"? Do they try to follow SEMAT? SWEBOK? If not then they probably are just software hackers and do not deserve to be called software engineers. A real software engineer is someone who follows rigourous engineering processes, does not reinvent the wheel and who does their work by applying the lessons of those who have gone before. It requires mental discipline to do this, in particular because you have to resist the urge to reinvent the wheel.

      Reinventing the wheel is great fun. Without constraints you can appear to be amazingly productive. But it is all a fantasy because without constraints and a rigorous process you are probably piling up technical debt faster than anything else.

      • memracom 12 years ago

        And what is wrong with pointing out that "software engineering" is not used to talk about engineering?

  • raverbashing 12 years ago

    Exactly

    Relational DBs may even be the wrong choice for some specific problems.

    And about this quote from the article: "At some point, you will need to ask your primary database questions. If you chose the wrong database, this is where things get tricky. "

    Yes, this is correct. However, I know how to read the manual of whatever db I'm using and maybe code something simple to process the output to the format of my liking.

dclara 12 years ago

I'm totally with you. We've experienced to have Object database or XML database and NoSQL database. Now we understand that relational database is just the right way to go for web applications, because it deals with structured data so well, keeps querying and sorting, filtering seamlessly and effortlessly. It's a must.

It is the same thing for choosing Linux distribution and JDK mode. See the references here:

http://bingobo.info/blog/table-of-contents.jsp

BTW, your title should have "relational" instead of "relation".

HorizonXP 12 years ago

Serious question: what are NoSQL databases really good for? I'm only really used to relational DBs, and I'm unclear about which problems a NoSQL database is useful for.

  • BadassFractal 12 years ago

    Take Redis for example: simple KV database with in-memory perf and a very comfortable API with option to flush data to disk periodically. Addresses a lot of interesting scenarios such as session storage, request throttling by a certain key etc. Not used as a replacement for a RDBMS most of the time, but rather as a specialized tool for a certain use case. At some point your RDBMS is already hammered hard enough, you don't need to dump everything into it, but ultimately you could, especially at first. Comes with clustering for free as long as you're aware of the risks and use it appropriately.

    This is not to say you couldn't turn Postgres into something similar. Put the data on a ramfs, relax various write guarantees (lots of knobs in PG) etc.

  • nl 12 years ago

    which problems a NoSQL database is useful for

    It depends in which NoSQL database. Depending on your problem, you can probably find a NoSQL DB optimised for it. It will often be unclear if that NoSQL DB is actually better than a relational database until you try it.

    Examples:

    High write throughput: Cassandra

    Simple key-value: Redis

    Text search: Solr/Elastic Search

    etc..

    • twic 12 years ago

      Redis does more than simple key-value - rather than just reading and writing values, since values can be complex types like lists and dictionaries, it can insert into them, append to them, etc. It's still key-value, it's just not simple key-value!

      That said, i would hesitate to describe Redis as a database at all. A key characteristic of databases is that they store every write in a durable way. Redis can checkpoint its state periodically, but as i understand it, it either can't or typically isn't used to safely keep every write. Redis is something in between a database and memcached. I doubt there's ever a situation where you have to choose between PostgreSQL/Cassandra/CouchDB and Redis; Redis is something you would use in addition to a database.

      As for text search - RDBMSs have full text search, and at least in the case of PostregSQL, it seems pretty good - see slide 49 in http://es.slideshare.net/billkarwin/full-text-search-in-post... from 2009. You might not want to be leaning on your database for text search when you're at scale (for operational reasons more than performance ones), but it's a plausible way to start.

  • SiVal 12 years ago

    Somewhat similar to the difference between C/C++ or enterprise Java and various dynamic or "scripting" languages in terms of structure, the "S" in SQL or NoSQL. The former tends to require more up-front structural design to do anything, which is a nuisance at first when you're not sure what you will end up needing. The latter tends to be quicker to throw together and repeatedly rearrange in the early days of its evolution, which is a huge advantage to a small startup trying to figure out what business it's in, but often runs into trouble later, once the needs have stabilized and the structure becomes more of a benefit than a cost.

    For areas other than structure, the difference can be more like the difference between a general purpose programming language and a domain-specific language. SQL databases are general purpose, so they offer a large, nicely integrated package of features, while NoSQL databases often sacrifice some of those features to optimize for some niche. From this perspective, the SQL database can be the best choice for a startup trying to find its niche with the NoSQL chosen as an optimization once that niche has been identified.

    The fact that a new business can't be sure what its business will end up being is an argument both for and against using a SQL database.

  • AznHisoka 12 years ago

    I have been using redis as a write though cache(not just a read cache). So I used to have 1000 concurrent writes to my Postgres db. Obviously this had a negative impact on performance even with modern specs(64gb ram, 8 cores). I realized that a lot of those writes we're updating the same column for the same row but in different threads. So I basically use redis to buffer these queries and perform 1 single write instead of multiple writes for the same row.

    • robbles 12 years ago

      What's the algorithm you use for consolidating those repeated writes? I've thought about doing something like this in the past, but was put off by the worry that there might be edge cases where data would be lost (e.g. insufficient writes to trigger a write through, many concurrent writes, etc.)

      • AznHisoka 12 years ago

        So I actually store the rows as serialized JSON objects in my redis cache, and when a write occurs, I update the JSON representation in the cache (if it's not in the cache, I read it from the DB and store it there). After X hours, I have a daemon that goes through all expired keys, deserializes the JSON, and executes a transaction, updating 500 objects at a time. To prevent possible race conditions, you need a mutex to allow only 1 thread to modify the cache for that row.

        Each object has a field that keeps track of what fields were updated, so I can construct an update query from that.

        When I read a row from the DB, I first check if it's in the write through cache, if so I retrieve it there.

        Of course, this means that all writes go through my cache. There can't be any other process that updates that table that avoids writing to the cache, or else we have database inconsistency.

        This improved the load in my DB server by more than ten-fold, as well as the indexing time to my ElasticSearch server.

        • robbles 12 years ago

          Sounds pretty effective to me. Thanks for sharing this.

          Possibly stupid question - how do you retrieve all the expired keys? I was under the impression that once keys expired, they were effectively invisible to clients?

          • AznHisoka 12 years ago

            So I actually set a soft time to expire as a value for the key. My daemon reads all objects < this time, and deletes those keys itself. The hard time to expire is set to a time much further than this to prevent the scenario of the daemon exiting for some reason, and memory overflowing.

  • marcosdumay 12 years ago

    When you remove non-local constrains (like foreign keys) you gain scalability, that enables companies like Google to query their data better.

    Now, if you are doing anything more mundane (like, for example, logging every bank operation in a country) that's overkill, and non-local constrains are a great thing to have. Thus, I'd advise not using it unless you have a clear necessity.

    And always keep in mind that changing yor data is hard, whatever technology you are usig. Relational databases making the change itself hard, while schemaless databases make maintaining the software that reads it hard. Schemaless does give you a small upfront benefit, for a huge cost later.

  • VLM 12 years ago

    Via analogy:

    RDBMS = swiss army chainsaw, can do anything although maybe painfully and slowly.

    NoSQL = 5/16th inch torque pre-set wrench for tightening sma connectors to 10 inch-lbs torque. Can't do much else, but its near perfect at its tiny little area of expertise.

    On a meta level its a battle between hardware and applications, where app demands have (momentarily, of course) fallen behind commodity hardware ability.

  • microtonal 12 years ago

    For example, XML databases are handy when you have large XML datasets that you like to query. The database indexes the XML, allowing you to execute most XPath queries and XQuery programs quickly.

    • taspeotis 12 years ago
      • microtonal 12 years ago

        It's not about indexing paths you indicate, XML databases are about indexing the whole structure (remember, XML is not relational, but are graphs). Besides that they provide XPath/XQuery processing and optimization. You can query large XML documents or sets of documents, like you'd query them with e.g. XQilla.

        Of course, it is possible to implement all of this on top of existing database technology. E.g. Oracle's Berkeley DB XML is implemented on top of Berkeley DB. But, a relational database with some indexing of XML does not provide the same functionality as an XML database.

        • taspeotis 12 years ago

          > It's not about indexing paths you indicate, XML databases are about indexing the whole structure

          From my first link, MSSQL does both:

              The primary XML index indexes all tags, values, and paths within the XML instances in an XML column
              ...
              To enhance search performance, you can create secondary XML indexes. A primary XML index must first exist before you can create secondary indexes
          
          And it supports some reasonable subset of XQuery [1].

          [1] http://technet.microsoft.com/en-us/library/ms187854.aspx

        • sixbrx 12 years ago

          Oracle for one does allow full document xml indexing (via XMLIndex) which is trivial to implement, also a rather full featured XQuery integrated with SQL.

  • daliusd 12 years ago

    Map-reduce style scalability. SQL databases are not good for that but can be used for that as well. However then your DB will be not normalized most probably.

    But again different NoSQL DBs have different purpose.

Gulthor 12 years ago

People too often forget about graph databases when talking about NoSQL solutions. Graph databases offer an interesting and elegant alternative to relational databases and I could definitely see a startup decide to use this kind of technology.

As far as I know, most graph databases support transactions and offer great scalability. Such databases are also schema-less and can be queried with Gremlin, a powerful graph traversal language (see www.tinkerpop.com).

With respect to scalability and transactions, Titan (http://thinkaurelius.com/) looks very promising: it supports various backends for storage (Cassandra, HBase, etc.) and indexing (currently Elastic Search and Lucene). Graph analytics can be done via Faunus (http://thinkaurelius.github.io/faunus/), backed by Hadoop.

There are other vendors out there (Neo4J, OrientDB, etc.) which offer interesting solutions worth looking at - I'm just a bit less familiar with them.

The major downside I see with graph databases is that most of them are fairly recent and their ecosystem is tiny (though growing). Should a startup venture on such young technologies, or stick to mature and battle-tested solutions (ie. relational databases)?

Could startups use this kind of graph "NoSQL" databases? I don't see why not. If your startup is some kind of social network, graph databases are certainly an option worth considering. If I were to create a startup, I'd hardly use a document database like MongoDB but I will really consider using a graph database. In the end, it's all about having the right tool in hand, and knowing how to assert what is "right" for you.

adamnemecek 12 years ago

A blog post titled "Startups should use NoSQL databases" in 3, 2, 1, ...

VLM 12 years ago

There can only be one DB much like the LOTR can only have one ring. Why? Thats the only area the linked article falls down on. Its a pretty good article other than that.

So you properly normalized your entire system, customer billing transaction records all the way up to article tags. Then article tags gets too huge. So next version looks at RDBMS and Redis, and the next version after that only looks at Redis. Customer billing transactions remains on a "real" DB and the tag cloud lives on redis. And the problem with that is... what exactly?

Its obsolete thinking. I can't have two databases because we're a poor startup and the only databases that exist are DB2 and Oracle and everyone knows they're super expensive so super expensive times two is unaffordable. Dude, its almost 2014 not 1980, Postgres/mysql/redis its all free.

  • raycmorganOP 12 years ago

    Thank you for your comment. I agree that there is nothing wrong with using a plurality of systems when needed. Your example of moving from one to another is great! Start with a simple system, and once you find bottlenecks, optimize with specialized stores.

jwilliams 12 years ago

This comes up every now and then on HN. There are plenty of NoSQL horror stories.

Thing is. Most SQL database at scale is a bit of a horror too. Have you seen real-life production relational databases? Gawd. Hacks on hacks. Then you add another database. And another analytics database. And a bunch of point to point data feeds. Argh.

But hey. That's data.

If you think choosing SQL will solve your analytics woes down the line -- it's just not true. You're in for some pain no matter what you do.

... That's unless you get a porcelain schema first time. Which, if you're in a startup, probably means you're working on the wrong problem.

That's not an argument for using NoSQL (I used MongoDB daily, but I've got plenty of love for PostgreSQL). It's a rebuttal that SQL magically solves a different problem.

luzero 12 years ago

Startups should use the right tool.

A relational database might be just as wrong as a nosql one if all you need is redis.

sedlich 12 years ago

Strongly disagree with the article as simplification always looks shiny. Start-Ups should sit back for a few hour and days and invest the work to answer some serious questions as these http://nosql-database.org/select-the-right-database.html (there are other cataloges like this one).

Then you get a little closer to the truth.

lampe3 12 years ago

I often read that argument that NoSQL Databases are Schemaless and yes the Database is but your Data is or it isn't. YOU must know your Data.

"All the while moving work onto the developers to standardize how they handle different migration cases."

I know a startup is fast and bla bla... BUT your team should know the tools that you are using... For me SQL DB's force me to add a new field and some kind of value and i don't like to be forced to a solution.

"In document stores, you have two choices: store related data as sub-documents, or store related data as separate documents with references. It is up to the developers to understand the trade-offs of both approaches. Selecting one over the other can lead to performance gains or issues, scalability issues and above all, make asking certain questions of the data a lot harder."

Again know the tools you are using. And for example MongoDB has good ORM's too.

"But that takes much more forethought and is dependent on a particular problem."

If your startup is doing something new and shiny you don't have the knowledge and forethought and you often dont know what particular problem will come at you.

Most of the point's look like: You learned at your University SQL now you know it(but in really life you don't) and now use it because you know how to normalize a Database. This argumentation is often used to say why java is so great or why javascript is bad.

I personally started with php then moved to rails and now to meteor(uses MongoDB) and we never before meteor could make so fast a good prototype which for a startup is very important.

So yeah if you are comfy with SQL use it if your comfy with NoSQL use it.

guard-of-terra 12 years ago

Don't they? For some tasks relational databases are good, for some they are worse. Call me captain.

However relational databases will have hard time with big data because your dataset is bigger than your database and you have no relational integrity.

RyanZAG 12 years ago

Depends what your startup is doing. If you are only using your database to store some basic transactions, then a relational database is a very good fit. This is really the case for most startups tackling common problems. However, if your startup is tackling a problem with unique technical challenges, then you can't just ignore the issue. For example, a geo-location startup tracking the location in real time of users with a free app is simply not going to be able to use a relational database.

  • robbles 12 years ago

    I don't see why not - use a relational database for storing long-term data, such as users, friendships, preferences, etc. and store ephemeral data such as location in a key-value store like Redis. Store summary statistics of the location data in your primary data store in scheduled background tasks.

    Just because you're storing a huge amount of one specific type of data, that doesn't prevent you from taking advantage of the features of a relational database.

  • Sanddancer 12 years ago

    Precluding yourself from using a relational database means that you're not going to be able to use some of the best tools available for geographical data. Things like PostGIS are built around, and heavily dependent on, the fact that it is Postgres. Now, you may not want to use an RDBMS for everything, but at the same time, you don't want to pull it completely out of the system either.

  • charliesome 12 years ago

    > For example, a geo-location startup tracking the location in real time of users with a free app is simply not going to be able to use a relational database.

    Why not?

    • nl 12 years ago

      Rapidly growing, infrequently queried data is not the ideal scenario for most relational databases.

      1) Relational databases typically aren't optimised for write-throughput. It's quite possible to do it, but you'll need fast and large disks (eg, FusionIO in a SAN or something).

      2) Location-tracking applications typically don't require interactive queries - generally it is more a batch-based system that can be run offline.

      Saying you are not going to be able to use a relational database is overstating it a bit in my view.

      Clearly you can make it work, but something like Cassandra will give you better write thoughput, won't force you to rely on a SAN/NAS for data storage and will let you use Map/Reduce to batch process the data.

  • VLM 12 years ago

    Total disagree. The most important job of a startup at startup time is probing the market.

    If you have a billion customers each with a million records you are the Google Maps location thingy and are not a startup anymore and a relational solution may, or may not, work.

    If you have ten users the ideal database is probably ten interns and some whiteboards. I'm not kidding.

    I observe there is a common claim brought up every week on HN about fake it till you make it. You don't automate (insert menial task here) until you have ten customers.

    I observe there is another common claim brought up every week on HN about how you need a massively scalable design which can never change from day one, because your schema, either formal or informal, is perfect and unchanging, LOL.

    Those two stylistic outlooks are not compatible.

    Start with something that has too many features, too many abilities, too much room for expansion, like a relational DB, and then later on if you need to, put some stuff on another platform. IF you need to. IF your company survives. Lots of IF.

    • RyanZAG 12 years ago

      Fake it till you make it is why you'd use a quick and dirty mongodb with json dumped straight from your clients, though. Maybe we have different ideas on the speed of development on mysql vs mongodb though.

      • VLM 12 years ago

        Also probably different outlooks on which letters to emphasize in CRUD, and probably different assumptions about what "all" startups are doing with a database, anyway.

yeukhon 12 years ago

I don't know. It's a hard question. MongoDB is pretty much used in any hackathons simply because it's easy to setup, driver support is good, and schemaless. The last one is really why people use MongoDB over SQL DBMS. For startup, there might be a concern that schema migration is tough.

But one can argue that not careful with schema design can break api and make codebase messy.

I guess I will stick with the hard work now... I guess not careful with schema will definitely bite me.

  • microtonal 12 years ago

    I don't know. It's a hard question. MongoDB is pretty much used in any hackathons simply because it's easy to setup, driver support is good, and schemaless. The last one is really why people use MongoDB over SQL DBMS.

    I find that a poor argument. One can use an ORM that automatically creates a schema based on classes. E.g. I like Ebean with DDL generation. You just write classes and add @Entity annotations. Ebean automatically creates the schema. Combine this with an embedded database, such as h2, and there is virtually nothing to set up.

    Once you are out of the rapid iteration phase, you can take the latest Ebean generated schema and use to proper migrations for later changes.

  • est 12 years ago

    Another reason to choose MongoDB is built-in array and nested dict support, with good enough indexing.

    So you don't have to create bullshit m2m tables with tedious joins for a fucking tagging system

    • mwhite 12 years ago

      Obligatory JSON and hstore in Postgres comment.

      • est 12 years ago

        stable version, every value is string. No atomic incremental operations, no nesting, shitty index.

        • mwhite 12 years ago

          JSON has non-string values and nesting. [1]

          I read somewhere that nesting in hstore is coming in the next version (Q3 2014?) and non-string types are on deck.

          Compared to the nightmarish development workflows and processes I've had to deal with resulting from using CouchDB as a main datastore, having to get the entire JSON value in order to update one key seems like not that big of a deal. What NoSQL databases even let you do incremental operations in that sense?

          Shitty index? It seems like you should be able to make an index on a value inside the JSON just as easily as any other index.

          Then maybe some advanced features of Postgres can really shine: http://www.postgresql.org/docs/8.3/static/indexes-bitmap-sca... https://wiki.postgresql.org/wiki/Index-only_scans

          I'm also exploring a solution for abstracting that as a normal, non-JSON table for semi-structured data using views.

          Basically, it seems like for semi-structured data where you know what the schema is, but maybe it just changes over time or isn't 100% certain, so it's not possible to store it using a typical schema, JSON + indexes + views offers the best of both worlds.

          [1] http://clarkdave.net/2013/06/what-can-you-do-with-postgresql...

joshguthrie 12 years ago

"Blog titles should stop using should."

pkolaczk 12 years ago

He forgot one of the very important reasons to use (some) NoSQL databases: high availability. Relational database systems are very poor at providing that. Most often the availability options are limited to resistance to node failures. RDBMSes have several SPOFs and must use failover which is not dependable, hard to test, and in many times needs manual intervention. Forget resistance to network partitions.

  • jahewson 12 years ago

    CAP theorem tells us that you can't have availability without sacrificing consistency or partition tolerance, which means that there isn't a NoSQL database which can do that either.

    It is not true that relational databases must have a single point of failure (SPoF) or must use failover: MySQL Cluster is a sharded multi-master distributed database without a SPoF.

    On the other hand Redis, for example, is a master-slave failover NoSQL datastore.

    • pkolaczk 12 years ago

      CAP theorem says it cannot be done at the same time. But it is perfectly fine to sacrifice consistency for availability at the time partition happens and restore consistency once the partition is fixed. Still better than nothing if revenue counts. Financial institutions do like that all the time.

  • gaius 12 years ago

    None of what you wrote is true.

    • pkolaczk 12 years ago

      And you provided no arguments at all. Sure, there are things like multi-master replication in RDBMS world, but I yet have to see a scalable system that utilizes it and at the same time is fully ACID compliant (which rules out async replication). We tried multimaster HA replication once, but the write throughput was terrible.

      • gaius 12 years ago

        There is no possible argument to make. There are a million - probably more - highly available RDBMS systems all over the world handling real money and real goods and services, 24/7. You can argue if you wish that tables are not a good way to store particular data structures, fine. But the claim that RDBMSs are unreliable is just ludicrous and had been for 30 years.

        • pkolaczk 12 years ago

          First, there is difference between reliability and availability. IMHO availability is a subset of reliability.

          Not saying they are unreliable per se, but making them really highly available is much, much harder than some NoSQL stores designed for HA, and the solutions are much more complex, usually beyond the point of being able to prove their correctness. It will cost you lots of effort, money and hardware. And to make it network-partition-tolerant, you'll have to give up ACID anyway, so one of the main advantage of RDBMSes over NoSQL stores goes away. IMHO not worth the trouble. In fact, most of the RDBMS systems operating in banks and insurance institutions I've seen were not even fully ACID. They were AD + eventually C and very relaxed I. You really don't need full ACID to handle money, it is just a convenient model for programmers.

          I can see systems based on RDBMSes, even the most expensive ones, claiming 7 nines availability on paper do actually fail and sometimes in a totally weird ways, that fixing the mess takes too long. I know some of the companies migrated to NoSQL stores exactly because of this reason - an expensive RDBMS cluster failing after hardware accident while another NoSQL cluster still operating fine in the same datacenter, despite networking problems. I've seen that simply way too often happening to big names, including a few commercial banks and telecoms in Poland, to believe in marketing of HA RDBMS store. Sure, all of them recovered (sometimes after minutes and in one case after a week) and none lost any data, therefore I'm not saying RDBMSes are unreliable ;)

          This is exactly a similar story as with scalability. Can RDBMSes be scaled? Yes, they can. But it is expensive, hard and requires very careful application design. It does not work automagically by "I'll simply normalize and throw my queries at it".

  • coolsunglasses 12 years ago

    Rarely matters for a startup.

    • pkolaczk 12 years ago

      For the startup I once worked for, it mattered much more than we had thought at the beginning. The investors were smart enough to notice we had some considerable periods of downtime. Additionally, once we got first million of users (not really that much and nowhere near the scale of Google or FB) we ran into performance problems which couldn't be easily solved just by indexing, optimizing queries or adding more hardware, and "buying" a beefy Oracle superserver was not an option as we didn't have enough revenue yet. So we had to dump joins, relax transactions, denormalize a lot and ended with a half-baked, bug-ridden NoSQL store on top of PostgreSQL, that couldn't even do horizontal partitioning well. I wished we had a proper solution like Cassandra right from the start. It would save us lots of pain.

      • coolsunglasses 12 years ago

        First million users. Come on. 99% of your audience is never going to have that problem.

        Especially if they spend their early days fucking with a Cassandra cluster instead of talking to customers.

        And it should be noted, you made it anyway.

        When you make it by the skin of your teeth, that means you probably timed it right.

        Preempting a problem far-ahead of time in startups means time and effort was wasted, especially if it was done before the existence of the problem was established.

        It is unreal to me that people still can't figure out how to apply Maslow's hierarchy to startups.

        • pkolaczk 12 years ago

          A million registered users is nothing for an MMORPG. You can get it pretty quickly even in a national-level game, without going global. A completely different story is keeping those users active and earning on them. The problem here is - you need to sustain a pretty massive load, yet only a few % of that load brings you revenue. And things like being out of service for even 10 minutes during peak hours (and peak load can be 100x higher than average load if you do special events in the game) can put you out of business or at least seriously worry potential investors.

          The sad thing is they actually didn't make it. I don't think the revenue ever crossed the cost of f*ing with all the scalability and availability problems. AFAIK currently they use Membase.

          • coolsunglasses 12 years ago

            A million registered users is $15,000,000 a month if you're charging $15 a month, which was standard for a long time.

            Are you fucking serious? A product that fails with a million users has a monetization problem not a technological one.

            • pkolaczk 12 years ago

              Not in the freemium model. In the freemium model it is easy to get many registered users because you're giving the basic version of the game for free and charging for extras. And if 10% of users pay for some extras at all, you may consider yourself very lucky. Typical paying users share is 2-5%. Another problem is 1 million of those users don't play all the time (we had only 5% of them logged at a time), however it is possible to get most of them active for a short time by organizing special events / competitions for them. Therefore you need to have capacity for handling that 1 million of users for a very short period of time, therefore you need to scale, but you are not going to make money from them constantly.

        • pkolaczk 12 years ago

          Additionally, when you do a startup and it succeeds it is already too late to redesign your app completely to make it scale. Changing the RDBMS in the middle of the game and migrating data is risky and would cost you probably more than using a properly scalable database right from the beginning.

          Additionally once you know you need to scale, your competition will see your product. Knowing the idea works, if they start from scratch, but using a better database for the job, they can easily put you out of the business, because instead of competing and adding new feauters, you'll be busy fighting scalability problems.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection