Jepsen Disputes MongoDB's Data Consistency Claims
infoq.comIn the circles I run in, MongoDB is regarded as a joke and the company behind it as basically duplicitous. For example, they still list Facebook as their first user of MongoDB on their website, for example, but there is no MongoDB use in Facebook hasn't been for years (it came in only via a startup acquisition).
I had the misfortune to use MongoDB at a previous job. The replication protocol wasn't atomic. You would find partial records that were never fixed in replicas. They claimed they fixed that in several releases, but never did. The right answer turned out to be to abandon MongoDB.
I was floored by this comment yesterday from one of their Developer Relations people:
> Did any of you actually read the article? We are passing the Jepsen test suite and it was back in 2017 already. So, no, MongoDB is not losing anything if you know what you are doing.
https://twitter.com/MBeugnet/status/1253622755049734150?s=20
Can you imagine saying the phrase "if you know what you are doing," in public, to your users, as a DevRel person? Unbelievable.
Generally speaking, there are many levels of “if you know what you are doing.”
- The system warns about unsafe usage at either compile time or runtime, and you ignore at your peril.
- The system does not warn, but official documentation is consistently verbose about what is required for safety.
- Official documentation isn’t consistently helpful and can be downright dangerous, but the community picks up the slack.
- The company gaslights the community into believing it is possible for a non-core-team member to “know what they are doing” from one of the above levels when Jepsen provides written evidence that this is not true.
I’m fine with things that are the third level from the top. I like to live dangerously. But I don’t think anyone can look at that last level and say “people are giving informed consent to this.”
The problem is that if you know what you are doing (I.e. you configure MongoDB to be safe) you may lose all reason to use it (no longer has the same performance, of course)
That comment wasn't yesterday, it was a month ago. It was actually after that comment that Jepsen decided to retest MongoDB and write this new article
Ah, good to know. I saw it when someone I follow retweeted it yesterday.
Firstly let me point out that this response is neither intended as a defence of MongoDB defaults which are atrocious, or of the company, who are arguably duplicitous.
However I can _quite easily_ see how a non-native English speaker could use the phrase “if you know what you are doing” to mean “if you are careful”.
Even so, they're just implicitly admitting to the very accusations that they are defending. If "running it right" is so insufficiently defined that jespen cannot figure out how to set it up, maybe your product just doesn't actually offer consistency. Sure, you're conceding at a different abstraction level, but it still means that you're failing to solve the problem of consistency for businesses that that want to solve it with your product.
He just replaced "cannot consistently synchronize data" with "cannot consistently deploy a system that can consistently synchronize data". But what's the difference between those statements to people that need to solve that problem? None.
Perhaps, but the point is, if you're working in DevRel and you think your role is to defend the product from criticism, and you do it by placing the onus on the developer to figure out how to use your product safely, you've totally lost the plot.
Also, if you're a DevRel, maybe you should have strong verbal and written communication skills in the language of your users? Just an idea
I think personally it's an abusive tactic on their part, lots of devs are concerned with being smart, by saying if you know what you're doing you can get the insecure ones to shut up.
But obviously it is a tactic of the doomed.
"If you are careful you might not lose all your data" doesn't sound very reassuring, either.
> I can _quite easily_ see how a non-native English speaker could use the phrase “if you know what you are doing” to mean “if you are careful”.
This kind of thing is a scourge. I had a Chinese friend respond to something I said once by saying "that's nice". It looks so innocent... but it's really hard to overlook the fact that "that's nice" is a serious insult coming from a native speaker. I had to ask them to please never use that phrase to me again.
Wow! Where are you from that you consider it a "fact" that "that's nice" is a "serious insult"? (I'm genuinely curious... I assume not the US based on the time this was posted?)
To my ear (American who grew up in the South, lives in the Midwest, and works with people largely from the Mountain West), "that's nice" could definitely be used dismissively or sarcastically, but there's any number of ways to say it that are actually sincere and genuine, and I can't imagine a situation in which it would be a "serious insult".
I mean, I can imagine that in a professional setting, if the coworker was saying it with a sarcastic tone or being showily bored or dismissive when saying it, that their behavior might be insulting. But anything can be insulting if it's delivered in an insulting way. "That's nice" has no particular edge to it to my ear.
So I have to assume you aren't American, or that this is a regional thing that I don't have experience with. In any case, your reaction to "that's nice" reminds me of an American friend of mine who moved to London and when her coworkers would ask her if she had a preference about where to get lunch, she would reply "I don't care", which would be totally normal (to me) in the US. But to her British colleagues, that word choice made it a very rude thing to say (the appropriate reply being "I don't mind" to indicate that she didn't have a preference and was willing to go anywhere).
Language is complicated. Even to your "I don't care" example in the US, it indeed often means "I don't have a real personal preference, so if others do please go ahead and make a reasonable decision and I'll be fine with it." However, in other circumstances, it can mean something more dismissive in the vein of "I don't give an f what you do. Just go away."
I'm from California.
Did the joke just went over my head or "that's nice" is a bad thing for real?
No it's not, it depends mostly on the tone of the voice. It's often enough sarcastic, but really doesn't have to be.
Tone of voice does not exist in text.
Just like data consistency doesn’t exist in MongoDB
Then you have to follow the principle of charity and assume they're being nice.
I think it sits on the same level as ' it's interesting',when 9/10 times you could see that the person who said so is not even remotely interested.
I'm not sure I agree with that. A lot of the time "interesting" serves as a filler word for "I'm not totally sure what to think about it" or "I'm not ready to articulate in 1000 words why X may or may not be a good approach right now."
"That's nice" is a way of explicitly stating that you don't care to any degree about whatever the other person just said.
It's very context sensitive. "That's nice" (with at least an implied exclamation point) can be a literal reaction to something you like. "Yeah, that's nice" as a reaction to a statement/argument can also be the equivalent of "Whatever." i.e. it's stupid but I'm not even going to dignify it with an explanation of why it's stupid.
What if it really is actually nice?
"How nice!" is a positive exclamation with minimal wording change from "that's nice".
You can also be explicit about who's being nice. "That's nice of you."
This really depends on what you want to mean by "nice".
Then you have to struggle to find another word to say that it's nice, remembering that saying "it's interesting" is forbidden, too.
I use Great so consistently sarcastic that my kids have unfortunately learned that Great means something not great.
Indirectly stating that they aren't good enough to use MongoDB properly could be offensive for thin-skinned developers, but it's only bad attitude.
I'm much more concretely worried by a software design for which the authors (not hostile critics) consider "if you know what you are doing" an acceptable safety and quality standard for data integrity.
I've never used MongoDB and hopefully never will, but I have to defend that phrase in the larger context. A great number of programming problems I've seen have been to do with people not knowing what they're doing with complex tools, it really is that simple.
Actual example: a long time ago someone in my company introduced a race condition into their product because they didn't know about transaction isolation levels (or the locking facilities exposed by MSSQL). I can give many more, as I'm sure anybody here can.
All complex tools need large investment in time to understand. I suspect MongoDB's team are using that as an excuse but in general you must jnow what you're doing with any tool.
Let's make an analogy. To drive a car one needs to "know what they are doing" to the point that driving licenses are mandatory. Let's consider a common accident type: a collision because you turn but your directional lights remain off.
If it happens because the driver just forgot to signal, it's a mistake and it's hard to blame the car. It's like knowing what transaction isolation levels can be used, but asking for the wrong one or trusting a random default.
On the other hand, if the directional lights remain off despite pulling the lever because the car is in a probably illegal power-saving mode, activated by turning the headlights off, and the driver should have reactivated directional lights with an ambiguously named item in the depths of some options menu, blaming the accident on user error would be very partisan: it's the car that doesn't know what it's doing. It's like figuring out the correct transaction isolation level according to DBMS documentation, asking for it, and expecting it is actually used.
It’s not that he’s lying. He just has a tenuous arrangement with the truth.
The joke I learned early on: "Migrating away from Mongo is trivial: wait long enough, and all your data will be gone anyway."
I imagine things are better now.
MongoDB: the Snapchat of databases.
That’s mean to SnapChat.
Snapchat at least warns you that your data will disappear.
Snapchat has disappearing messages as a feature.
So does MongoDB! It's a feature!
it's documented too. as per the dev rel's tweet.
it's documented, if you know how to read!
on edit: between the lines.
“Does your database have snapshot isolation?” “No, we have snapchat isolation!”
I just call it a probablistic datastore.
They’re not.
Let’s hope the data disappears after you’ve collected revenue but before you paid the expenses. Crap product, shady company, sleazeball salesforce.
There was a time when I advocated for MongoDB with the usual caveats. The ability to easily store and index complex data was of great value. And then in 2015 October, within a week of each other, SQLite and MySQL both learned how to index on expressions and store JSON (SQLite 3.9 2015-10-14, MySQL 5.7 2015-10-21). PostgreSQL added jsonb the year prior in 9.4. At that moment the value of MongoDB for me diminished greatly.
Why is storing json in a database important to you? Whenever I see json fields in PostgreSQL/MySQL, I know I'm most likely in for inconsistent data and a world of pain.
I use json (in postgres) for user generated config data (that is, both the schema and the values of the config are user generated). I will never query it other than to read the entire data to send elsewhere for processing and write it when the user sends new data. I don’t know or care about the content, since I can’t know what’s going to be there, but if I ever did want to query inside it, postgres allows that.
I could store it in a table with key and value columns, but since I always use it together as one thing, I don’t see the benefit and it just means more rows would need to be accessed.
Maybe its not a good design, but it works well for me and makes my life easier.
Good use case indeed!
Why is storing all data normalized in a database important to you? Whenever I see an exceedingly complicated database schema, I know I'm most likely in for unfixable legacy DB warts and incessant schema updates for every little feature.
Hyperbole aside, the best option often is somewhere in between. I find that a relational database with columns for primary keys/relations/anything used in WHERE statements in the normal application path, and a json blob for everything else that's just attributes for the current row/object/document, makes for a very flexible and successful DB schema. You get all the strong relational consistency benefits of traditional schemas, plus the flexibility to add and modify features that don't require relation changes without touching the schema, and the ability to represent complex structures while still allowing ad-hoc and admin path queries to peek into them where necessary.
In fact, most "fully normalized" databases end up reimplementing a key/value store or twelve in there anyway (e.g. config settings, user attributes, and the like). Might as well just use JSON at that point.
Are you saying fully normalized RDBs end up reimplementing key/value stores, and then suggest to just use a JSON blob instead? The reason why RDBs are normalized is to avoid update anomalies, and it isn't really difficult to understand. Technically, adding a column to a DB table is a matter of "ALTER TABLE T ADD (mycol NUMBER)", but it sounds like the real problem to be worked on is the organizational overhead going on. At least the way you're describing your use of JSON blobs in a relational DB really sounds like a guerilla tactic to sidestep DBAs gatekeeping your DB. JSON is just an accidental data serialization format supported by browsers OOTB. If you're using any other language in your backend, like Java, then you need additional serialization libs to read and write JSON anyway (like with every other format such as XML), and the convenience of JSON is lost. Though lately browser apps become so bloated with MVw frameworks and whatnot, I wonder if the original point of using JSON - that it can be handled as JavaScript object literal OOTB - is really all that important. If you've worked with MongoDB, you'll know that Mongo's JSON is not quite JSON, but uses additional ad-hoc constructs to re-introduce some typing (such as for dates and IDs).
So I have to agree with GP wondering why JSON is so important to people, and is even portrayed as a relief or saviour to developers. In my experience, JSON in a relational DB is always a sign of organizational failure, developers using the wrong tool for the job, or not knowing what they want or do.
I'm not sidestepping the DBA, because I'm the DBA :-)
Fully normalized databases are a nice academic idea, but the supposed benefits of going all the way don't materialize in the real world. That kind of approach is just old school cargo cult - just like full NoSQL is new school nonsense. Good developers know that the answer isn't to use whatever was the fad was when they were in school, be it fully relational databases or NoSQL or anything else, but rather to look at the available technologies and take the most appropriate bits and pieces in order to make a project successful.
After all, if JSON were nonsense, why would a serious relational database like PostgreSQL be adding full support for it? They know it has good use cases.
I know it has full use cases, so I use it, along with proper primary and foreign keys and a properly relational base data model. Yes, all my primary objects are tables with proper relations and foreign keys (and a few constraints for critical parts; not for everything because 100% database side consistency is also an impossible pipe dream, as not every business rule can be encoded sanely in SQL). Just don't expect me to add a user_attribute table to build a poor man's KVS just because people in the 90s thought that was the way to go and databases didn't support anything better. I'll have an attributes JSONB column instead.
And yes, JSON is just a trivial data serialization format that happens to have become de facto. It has an intuitive model and the concept isn't new. It just happens to have been what became popular and there is no reason not to use it. Keep in mind that PostgreSQL internally stores it in a more compact binary form anyway, and if clients for some programming languages don't yet support skipping the ASCII representation step that is an obvious feature that could be added later. At that point it ceases to be JSON and just becomes a generic data structure format following the JSON rules.
It really depends on applications. In most databases I've seen, if information is unimportant it is just omitted and if it's important it's worth putting in a proper column to allow relational operations; and if important information is big it's usually unstructured or ad-hoc (e.g. log messages, PDF documents), not a "document" and particularly not JSON.
I disagree. Most of the information in a database is not something I'd want to query by (otherwise I'd be indexing every column, which I'm not), certainly not as part of the normal application.
What's the point of putting, say, every single user management field into columns in a "users" table? The regular application is never going to have to relate users by their CSRF token, or what their UI language is, or any of the other dozens of incidental details associated to a user, some visible, some implementation details.
What matters are things like the username, email, name - things the app needs to actually run relational operations on.
If you look at any real application, pretty much everyone has given up on trying to keep everything relational. That would be a massive pain in the ass and you'd end up with hundreds of columns in your users table. You'll find some form of key/value store attached to the user instead. And if you're going to do that, you might as well use a json field.
And if you do use a json field with a database engine that supports it well, like PostgreSQL, you'll find that it can be indexed if you need it anyway, and querying it is easier than joining a pile of tables implementing a KVS or two. Because yes, I might want to make a report on what users' UI language is some day, but that doesn't mean it has to be a column when Postgres is perfectly happy peeking into jsonb. And I don't need an index that will just cause unnecessary overhead during common update operations that don't use it, when I can just run reports on a DB secondary and not care about performance.
I designed an application in this manner and it has turned out exceedingly well for me. We have only had about a dozen schema changes total across the lifetime of the app. One of them involved some JSON querying to refactor a field out of JSON and into its own table (because requirements changed) and that was no problem to run just like any other database update. If we need to move things to columns we will, but starting off with an educated guess of what will need to be a column and dumping everything else into JSON has undoubtedly saved us a lot of complexity and pain.
One of our tables is just a single primary key and then a json blob. It stores event configuration. Why? Because it's always loaded entirely and there is never any reason to run relational ops on it. It's a huge json blob with lots of little details, only edited administratively, with nested fields and sub-fields (which is way more readable than columns, which are a flat namespace), including a sub-document that is actually a structure that drives generation of a pile of HTML forms. If I'd tried to normalize this I would have ended up with a dozen tables, probably more complexity than the entire rest of the DB, and a huge pile of code to drive it all, and every single application change that added an admin knob would've had to involve a dabatase schema change... All for what? Zero benefit.
I use it for when I get extra feilds from a third party API which our system does not need at the moment but which may or may not become useful in the future. Sometimes these extra fields are undocumented so creating columns for them would require investigation, just throwing them in there as JSON is a much better use of my time.
Also good use case! :)
> Why is storing json in a database important to you?
I don't need JSON per se, I want to store data with a predefined, recursive format which most of the time will be just serialized / deserialized but occasionally also queried without having any idea ahead of the time what the query will be. For all I care, it could be Python pickle, PHP serialize, XML or whatever ungodly format you want as long as the above stands. (XML actually works in MySQL thanks to ExtractValue and UpdateXML but alas, I break out in rashes when I touch XML :P)
Let's say you want to display a page of text and images. (Doh.) But you want to give your authors great flexibility and yet nice styling so you give them components: a rolodex, a two-by-two grid, tab groups and so forth (we have 44 such components). This stuff is recursive, obviously. Are you going to normalize this? It's doable but a world of pain is an accurate description for the results. The data you get back from deserializing JSON should map pretty well to the templating engine.
Rarely you want to query and update things, some analytics, some restructuring etc. If it were just a key-value store with page id and serialized data the only way to do maintenance would be to deserialize each page and manually dig in. Sure, it's doable but having the database do that is simply convenient. That's the reason we use database engines, right? At the end of the day, you don't need SQL, you can just iterate a K-V store and manually filter in the application in whatever ways you want -- it's just more convenient to have the engine do so. Same here. The nicest thing here is that if someone wants ongoing analytics you actually can add an index on a particular piece in the blob and go to town.
Inconsistent json data is exactly why you use a json data type or database.
I know that you, dear developer, would never produce inconsistent data. But lots of other developers do.
It is often the case that you will need to query that inconsistent json data, but either the pain is too great, or the value too low, to normalize that data. Thus, you dump it as is into a json field or database.
Ah, well, I guess I prefer having consistent data. The only way to have consistent data is to forbid the database from representing inconsistent states. If inconsistent states are allowed, they will appear.
Every single time I've had to work on top of a Mongo cluster, it has gone into "three stooges" mode, where each node insists that one of the others is master.
I pretty much refuse to deploy a new instance of it now, I've been burned too often.
"I'm trying to sync, but nothing happens!"
Tangentially related their sales strategy is questionable.
As an intern at Shopify, I got an email from MongoDB asking us to switch. Shopify was 10 years old the time. Plus several coworkers would also receive similar emails two years later (and some in between of course).
I have a shirt from MemSQL that says "Friends don't let friends NoSQL" and I wear it proudly.
If their sales team is sending emails to interns, their strategy is very questionable indeed.
I don't know any details here, but it might have been just one of those "hey <developer position> we'd love to show you how <product> can help you save <time/money/both> - contact us for a demo " kinda things which wouldn't be too weird.
There was a custom blurb about Shopify and talk about "adoption" and alignment with my team, along with a followup email when I didn't respond. Weirder part was that my email wasn't public (but guessable, first.last@shopify) so they must've gone through some manual process to find people to contact
They will happily waste the time of anyone willing to take a call. They are so sleazy they make Oracle look clean.
Isn't it amazing MongoDB is a 12 billion dollar company? Someone is using it and actually paying for it, even though it's not any of the developers you or I know.
It’s the switching costs to get rid of it...
Can confirm: stuck on MongoDB. Paying them for hosting is one less pain.
That’s like giving the heroin dealer your ATM card.
I run in two circles: the one you mention, but also the other: I have gotten pushback from people (usually devs at clients of mine) for saying it’s lunacy to run a real, actual business on Mongo. (This has always happened from orgs with <10TB of data in the database.)
You’d be astounded how common it is at so-called “enterprise” startups. It blew my mind.
A lot of people simply never went through the LAMP stack days and have little/no experience with real databases like Postgres (or even MySQL). It’s disheartening.
I have found their salespeople to be the most sleazy and unethical of any that I’ve worked with. Much worse than all the other database vendors combined.
I know they are trying to beat Oracle in this sector, but I didn’t realize they meant it like this!
Cant agree with this more! I have never met any sleazier sales reps than MOngodb.
Try their managers.
I got my pm friend prototype his idea on mean stack, but when we got more serious, immediately transitioned to postgres and started using sequelize as the orm. Pretty good decision so far. I don't think they will have cases that won't scale with orm for foreseeable future.
How do you manage dev vs prod instances with sequelize? Specifically, do you use migrations for your local dB or just force sync it?
they use sequelize migrations for now, eventually we may want to go to something more advanced. take a backup, apply migrations has worked so far. force sync extremely harmful, sequelize is terrible for not being able to incrementally do schema changes. I don't think it can actually add a column, for example.
For us, as a result, it means anytime we add a new model or change a table, we write the table definition in more verbose api, and sometimes resort to sql commands for adding things like defaults in the migrations. (sequelize for some reason can not specify uuid default value for postgres, for example, so we set a default ourselves, even though we don't need one as we have time dependent uuid generator on the client to help with indices).
We kind of learned working around sequelize shortcomings :)
I am still looking for a tool that'd make incremental backups on postgres easier than it is, but for now things are ok.
Sequelize documentation is pretty bad for some of this stuff, but it has the functionality you listed. You can give Postgres a default uuid by setting the column options to ‘new Sequelize.UUIDV4()’, and queryInterface supports add, remove and change column.
The biggest annoyance I had with migrations was anything that needed a model to be defined inside the migration. I eventually gave up on that and just wrote my own scripts to work around it. But perhaps I just wasn’t persistent enough...
I am not super sure, but i remember trying this, and it did not work insides migrations :(
You might be right. It's quite a while since I was working on that project, and now I'm starting to doubt how sure I am about this...
As a skateboarder I've always found the name itself rather amusing as the term mongo has relatively negative connotations in skating.
It also does in Spanish, it's a (rather old, but still used) shortened version of "retard" (from "mongólico", originally used to describe people with Down's syndrome).
This is true in virtually all languages, as John Langdon Down used it to refer to the syndrome he first characterized, that today instead bears his name.
Lose the O and it's also a nasty piece of British slang meaning the same thing.
It’s funny. Someone I used to know from Scotland is/was a software tester by day. She went mental about the name of their companies new DB. All over social media. “Derogatory, backwards”
Same in Swedish
Same in most languages, but there's also Emperor Mongo in Flash Gordon stories, definitely alluding to Mongol conquering hordes.
Most languages would be a stretch, there are over 7000 languages in the world.
Nobody seems to like it. Someone has any idea on why the company still has their revenue increasing ?
I like it, but I don't really really chime in on threads where it's mostly Mongo bashing and jokes, and I'm guessing others who like MongoDB do the same.
But I'd think MongoDB the company increasing in revenue isn't totally related to the quality of MongoDB the database. In fact a lot of their products seem to be targeting the "I don't want to learn how to set it up and understand indexes" crowd.
Which is funny because one of the things hosted Mongo will do is constantly send you automated emails suggesting that you should create indexes (for tables that MySQL or Postgres would laugh at)
Sends emails when scanned vs returned record ratio is greater than a threshold. Not quite sure how something like this would be different for MySQL/postgres. If an index is missing for you query pattern, wouldn't you create an index in MySQL/postgres?
> If an index is missing for you query pattern, wouldn't you create an index in MySQL/postgres?
No. I’d consider adding an index. An index is not free, it comes at a cost and that cost may well be higher than the costs of not having that index. For example, if a reporting query that runs once every few hours is lacking an index, the cost of updating that index on every write (and the disk space/memory used) may well exceed its benefits.
Exactly this. If you connect to your database and run a couple un-indexed queries to explore your data, any query which takes over 100ms will appear for the next 30 days in the "Performance Advisor" tab and it will offer you indexes to create targeted at that query. Just based on a query you did manually one time.
Just to note, this is referring to the features for a hosted databased on cloud.mongodb.com, and not something built into MongoDB the database.
I prefer this being shown in the slow queries over not showing it at all , from my experience from people running bad queries manually on production MySQL instances and starting a domino effect.
I guess to each their own.
Secondary indicies can be a lot cheaper in a non-ACID database, where you can acknowledge a write when it is durably committed but index updates may not have completed.
Agreed, but a notification about a potential performance issue is not forcing anyone to add an index.
My claim isn’t that indexes are a bad idea. The comment I replied to suggested that Mongo’s marketing targets people who don’t want to learn how to use indexes; my claim is that the reality of the hosted product is that it frequently suggests that you add indexes (presumably doing so helps with Mongo Corps hosting margins, as well as with performance [which is laughable even with indexes])
> my claim is that the reality of the hosted product is that it frequently suggests that you add indexes
But it auto-suggests what index to use and has a button for you to immediately apply it. I'd say it definitely intends for you to avoid learning how indexes in MongoDB work. The index suggestions it makes are often terrible.
> performance [which is laughable even with indexes]
It really depends on your use-case and how you can structure and query your data. For one such service I'm the lead on I haven't found anything faster that doesn't sacrifice the ability to do more generic queries when we occasionally need to. (Yes we've benchmarked against postgres json fields which is always everyone's first suggestion)
> But it auto-suggests what index to use and has a button for you to immediately apply it. I'd say it definitely intends for you to avoid learning how indexes in MongoDB work. The index suggestions it makes are often terrible.
Heh, yeah, you're right. So I guess it's the worst of both worlds: they try to not have you learn about indexes, but they give you terrible index suggestions, so when those don't work out you still end up having to learn about indexes :)
MSSQL can automatically create indexes for frequently run queries, instead of sending emails about it.
I have a legacy app that uses CouchDB under the hood. It's not actually appropriate for how it's being used, but I can see how it could be used in an appropriate way. Over time I got to respect and even like CouchDB (just not the way we're using it ;-) ). However before CouchDB apparently we started with MongoDB and then migrated to Couch after problems with data integrity in Mongo (this was a long time ago and I've heard that problems that we had have been fixed).
But I've always been curious: as a person who like MongoDB, do you have an opinion about Mongo vs. Couch? Keep in mind that you don't have do convince me that there are niches where that style of DB is appropriate ;-) I'm mostly just interested in the comparison since I've never spent any time looking at MongoDB.
IMO, MongoDB is really good if all your indexes and your data (after sharding) can fit in memory, and you need queries that are a simple match query against a small number of keys or a range query against a date or float type. And also your data doesn't fit into a row/column structure very easily (otherwise go SQL). The aggregation framework, joins, other features are all nice to have for occasional use, but you can't be doing those 10k times a second.
It's great for things like a realtime layer of some app that merges data with a slower and more historical layer of data running on a SQL engine or something safer. Or for services that provide realtime or recent-time analytics, while storing your historical data somewhere else (see any patterns here so far? :P ). In my case the main usage is for an advertising bid and content serving engine, which was pretty much the ideal example use-case for MongoDB mentioned in books I read years ago when first learning it.
Just to note, yes the data integrity problems are "fixed", but only if you configure your instances properly and your read and write statements properly. It's not terribly hard to do, but I don't know if I would really go recommending MongoDB for newbies. If you know how to configure it properly for your data-safety needs, and would benefit from being able to have a flexible schema in early development.... I'd still maybe suggest looking at other document DBs unless you need the read/write speed Mongo can give on simple queries (and fresh projects probably do not need that)
I like it as in certain situations.
For situations where you don't know the schema or for different schemas per record mongo is a great place to dump.
For data when you care about speed and don't care about losing some data. Think sending back a game screen when the client moves and requires a redraw. Depending on how fast the screen is changing dropping a screen isn't the biggest deal.
Reporting was a little bit more difficult but somehow rewarding.
Marketing and advertising. They do a lot of it, trying to brand themselves as the core of the modern stack. The same thing they have been doing from the very beginning, when they had no product to back their claims. Reading the comments here, little seems to have changed.
Because good engineers don't take for granted what's written on HN and do a thorough evaluation of a product, MongoDB or something else.
If there's a more thorough evaluation than Jepsen, please let me know.
But their marketing team early on was amazing.
that acquisition that used mongo got spun off and is now Honeycomb.io . nice people.
Honeycomb.io is a reimplementation (with some nice additions) of some of Facebook's internal monitoring tools. Mongo was never involved in those tools, as far as I know. I have no idea if honeycomb is using it internally.
> but there is no MongoDB use in Facebook hasn't been for years
Are you sure?
Until very recently I worked there, and at once point I dug quite deeply into the company to see if I could find if it was still in use. I couldn't find documents referencing its present existence, packages to install it, or anything else, so I'm pretty certain that it's not in use.
There are three VIPs actively being used to load balance MongoDB traffic as of this moment.
The docs you likely found on the wiki are dated, but MongoDB is definitely being used in Enterprise Engineering.
Source: I'm currently on the EE traffic team.
It's difficult to make accurate blanket assertions about large technical organizations, which Facebook certainly is.
I am amused by the downvotes that my previous comments received.
Ah yes, but am engineer at Facebook installed it on his work laptop once
It’s certainly not used for any mission critical apps. Facebook’s stack is pretty well known. They’ve been using sharded MySQL for a while now. Instagram started on PostgreSQL but I believe has switched to Cassandra.
> mission critical apps
Thanks for the clarification.
As an example: would you consider the backend software stack that manages physical access to the campus 'mission critical'?
Ubiquity Edgerouters' management interface has a MongoDB backend and I have one on my desk at work, but I would not consider that I run "MongoDB" to provide services for my customers.
In much the same way I wouldn't say that my site is powered by Microsoft Excel; but you can be sure Microsoft Excel is used in my company.
I think they mean for first party backend tools. For example at my job I would say we use react in production - not on the frontend, but only for an internal warehouse application interface, nobody outside can use it but it's vital to our business, and we're the ones who built the application.
I would not. Facebook.com still operates with or without that system. Especially now with COVID-19, I’m sure it’s not being used at all.
Mission critical - essential for operating Facebook.com
> I’m sure it’s not being used at all
Is is a company wide system? Its all fun and games until the cooling in a dc starts to have issues and the facilities team can't gain access because....mongo... ?
Sure, that's mission critical (ish), but it's not really what people think of when they think of Facebook. It's not millions of posts per second, ultra scalable whatever.
Disclosure: I worked at Facebook, but not in that department, or anywhere near any MongoDB.
From the jepsen report:
"""
Curiously, MongoDB omitted any mention of these findings in their MongoDB and Jepsen page. Instead, that page discusses only passing results, makes no mention of read or write concern, buries the actual report in a footnote, and goes on to claim:
> MongoDB offers among the strongest data consistency, correctness, and safety guarantees of any database available today.
We encourage MongoDB to report Jepsen findings in context: while MongoDB did appear to offer per-document linearizability and causal consistency with the strongest settings, it also failed to offer those properties in most configurations.
"""
This is a really professional to tell someone to stop their nonsense.
Amazing that anyone can trust Mongo after this BS.
MySQL and PG are not truly consistent per default, they don't fsync every writes.
MongoDB explains that pretty well: https://www.mongodb.com/faq and https://docs.mongodb.com/manual/core/causal-consistency-read...
> MySQL and PG are not truly consistent per default, they don't fsync every writes.
Postgres most certainly does fsync by default.
It's tru, you can disable it, but there is a big warning about "may corrupt your database" in the config file.
No PG does not fsync every writes, more details here: https://dba.stackexchange.com/questions/254069/how-often-doe...
My point is people complain about MongoDB are the one not using it most likely, MongoDB is very different from 10 years ago.
I like to remind people that PG did not have an official replication system 10years ago and as of today is still behind MySQL. No DB is perfect, it's about tradeof.
> It writes out and syncs the accumulated WAL records at each transaction commit, unless the committed transaction touched only UNLOGGED or TEMP tables, or synchronous_commit is turned off.
So wal is synced before commit returns, and if you power cycle immediately after, the wal is played back and your transaction is not lost? So it's fine?
It does not need to sync all writes, only the records needed to play back the transaction after restart. This is what all real databases do.
“PG writes out and syncs the accumulated WAL (= Transaction log) records at each transaction commit [snip] It also syncs at the end of each WAL file (16MB by default). The wal_writer process also wakes up occasionally and writes out and syncs the WAL.“
So PG keeps data consistent by default - unlike MongoDB.
> MySQL and PG are not truly consistent per default, they don't fsync every writes. MongoDB explains that pretty well [links]
Where in those MongoDB doc links is there anything about MySQL or PG?
I don't know if this is true or not, but it's besides the point; MongoDB omitted various failings from the Jepsen report to make their product look better than it actually is. This is not only unethical, but may also be illegal in various jurisdictions under false advertising laws.
Whatever failings MySQL or PostgreSQL may or may not have are not important at all here.
The default in MySQL and in postgresql is to fsync before commit and afaik that has always been the default.
Not it was not the case and there was several serious issues with fsync and PG in the past: https://www.percona.com/blog/2019/02/22/postgresql-fsync-fai...
On MySQL: https://dev.mysql.com/doc/refman/8.0/en/innodb-dedicated-ser...
InnoDB uses O_DIRECT during flushing I/O, but skips the fsync() system call after each write operation.
The fsync thing is more complex than it looks like.
That bug was unfortunate, but you can't say that "it doesn't fsync" because, pedantically, it does, it just ignores the return value.
And, obviously that's a bug, it's designed to do so.
Also, if you write with O_DIRECT, a fsync is not needed, as it's how you tell the OS to block until written.
From top of linked article:
>>> I have to admit raising an eyebrow when I saw that web page. In that report, MongoDB lost data and violated causal by default. Somehow that became "among the strongest data consistency, correctness, and safety guarantees of any database available today"! <<<
It's not wrong, just misleading. Seems overblown given that most practitioners know how to read this kind of marketing speak.
> It's not wrong, just misleading. Seems overblown given that most practitioners know how to read this kind of marketing speak.
So basically whatever MongoDB was doing 10 years ago, they are continuing to do there. They did not change at all, yesterday or two days ago there were few people defending mongo that indeed in early years mongo want the greatest, but it is now and people should just stop being hang up in the past.
The reason why people lost their trust with mongo wasn't technical, it was this.
I appreciate your optimism in thinking that most (all?) people reaching for distributed systems actually know enough in the space to evaluate such claims.
Agree, and the "Mongo and Jepsen" page isn't targeting distributed systems experts, most of them know to stay away, because even if there are things that mongo does right, other systems do it better.
What other systems would you recommend?
I don't consider myself an expert in that area. Just someone who learned a lot from Kyle's articles.
Based on this, my understanding is: most of the time you want a relational database. If a relational database becomes a bottleneck for certain data, and you don't want to do typical scaling solutions for relational data, then you need to know what you'll trade for the higher performance. Based on what you trade, you then decide what kind of data store you will use.
What do you want to do?
FoundationDB
Isn't that too low-level?
You can tell a lot about a developer by their preferred database.
* Mongo: I like things easy, even if easy is dangerous. I probably write Javascript exclusively
* MySQL: I don't like to rock the boat, and MySQL is available everywhere
* PostgreSQL: I'm not afraid of the command line
* H2: My company can't afford a database admin, so I embedded the database in our application (I have actually done this)
* SQLite: I'm either using SQLite as my app's file format, writing a smartphone app, or about to realize the difference between load-in-test and load-in-production
* RabbitMQ: I don't know what a database is
* Redis: I got tired of optimizing SQL queries
* Oracle: I'm being paid to sell you Oracle
This might be a stupid question, but surely no one thinks of RabbigMQ as a database right? I’ve used it from 2012 to 2018 extensively, including using things like shovels to build hub spoke topologies, however not once did I think of it as anything but a message broker.
Did I miss something huge?
>This might be a stupid question, but surely no one thinks of RabbigMQ as a database right?
Arguably the world's most popular database is Microsoft Excel.
It’s definitely popular, that much is inarguable.
I once worked on a system for notifying customers of events by posting to their APIs. Events came in on a Rabbit queue and got posted.
If a customer's API was down, the event would go back on the queue with a header saying to retry it after some time. You can do some sort of incantation to specifically retrieve messages with a suitable header value, to find messages which are ready to retry. We used exponential backoff, capped at one day, because the API might be down for a week.
I didn't think of RabbitMQ as a database when I started that work, but it looked a lot like it by the time I finished.
Sounds like delay line memory.
lol api could be down for a week? What?
Not everything is or needs to be webscale.
RabbitMQ stores your data, right? Then it's a database! That's pretty much all it takes. A flat file, memory-store, SQL DB, Document store, any of them can be databases if that's where you stick your data!
But also no, RabbitMQ and Kafka and the like are clearly message buses and though they might also technically qualify as a DB it would be a poor descriptor.
Ah I see, we are going with “well technically it stores something therefore it is database joke”. Now I’m fully onboard :)
Back when I worked in LA my CTO used to joke that most places use Microsoft Outlook as a database and Excel as BI tool.
If memory serves the original EToys.com code treated the filesystem as tree-structured database using atomic operations (though no transactions). It worked just fine, then the rewrite with an RDBMS that should have been stabler and faster resulted in the famous meltdowns. Admittedly this is cheating a bit since you can name folders & files with semi-arbitrary or internally structured string keys. By 1997 standards pure disk access without having to walk the filesystem heirarchy was blazingly fast compared to many of the databases I was using.
[Source: I was friends with the guy who wrote it as well as other EToys employees. God that was a trainwreck.]
Interesting, is there a blog around discussing this in detail ? If not would be kind enough to go more into detail.
I don't think anyone posted about their particular system, but it's not unknown now. If you google "filesystem as a database" there are some relevant hits. One super simple and probably not ideal, but at least balanced version uses a hash of some primary key like customer row id as the file index, then partitions the items into directories with all permutations at each level (or only populated ones) based on successive parts of the hash. For example an item key that hashes to a32c4214585e9cb7a55474133a5fc986 would be located somewhere like this:
The advantage of this kind of structure is that you never need to manually scan a directory since you know exactly what path you're trying to open. You still incur the OS lookup time for the inode-equivalent in the directory entry, but a deeper heirarchy keeps that faster. You can trade off time to traverse the heirarchy versus number of entries in the final directories by adjusting the length of the hash chunk you use at each level. Two characters will put vastly fewer entries at a given level, but vastly increase your directory depth.a32c/4214/585e/9cb7/a554/74133a5fc986 a32c/ 4214/ 585e/ 9cb7/ a554/ 74133/a5fc986Basically if you're manually scanning the heirarchy for anything but a consistency check or garbage collection you've already lost.
That's how git stores its objects:
18:35 $ tree .git/objects/ .git/objects/ ├── 02 │ └── 9581d0c8ecb87cf1771afc0b4c2f1d9f7bfa82 ├── 3b │ └── 97b950623230bd218cef6aebd983eb826b2078 (...) ├── info └── pack ├── pack-b1fe2364423805afb6b1c03be0811c93b19dedc9.idx └── pack-b1fe2364423805afb6b1c03be0811c93b19dedc9.pack 10 directories, 10 filesOne important note: make sure you carefully consider using atomic renames and such for manipulating the files! Overwrite in place is a great way to end up with a corrupted item if something goes desperately wrong and you're not protected by COW or data journaling.
Usually you write these sorts of things as append only with an optional garbage collect. You get a minimal sort of Atomicity with that.
I was thinking of doing something similar as a lightweight embedded datastore: apply structure to the file system like you would a redis key.
Would love to talk to anyone on the EToys team or anyone who has done something similar.
I'm @akamaozu on twitter.
Unfortunately eToys imploded a couple of years later (2001) and there were only a few people involved at that stage so it's possible none of them are in the industry anymore. You might start by looking at email servers, I believe there are a few that use a deeply nested directory heirarchy for much the same reasons. IIRC Apple also does something similar with the interior of the sparsebundles used in Time Machine backups, but I don't know if any of that code is opensource.
You laugh, but I bet Excel produces orders of magnitude more real "business intelligence" than all other "BI" tools combined.
Here is an anecdote.
I had to work on a tool that shows what's wrong with an assembly line: missing parts, delays, etc... So that management can take corrective action. Typical "BI" stuff but in a more industrial setting.
The company went all out on new technologies. Web front-end, responsive design, "big data", distributed computing, etc... My job was to use PySpark to extract indicators from a variety of data sources. Nothing complex, but the development environment was so terrible it turned the most simple task into a challenge.
One day, the project manager (sorry, "scrum master") came in, opened an excel sheet, imported the data sets, and in about 5 minutes, showed me what I had to do. It took me several days to implement...
So basically, my manager with Excel was hundreds of times more efficient than I was with all that shiny new technology.
That experience made me respect Excel and people who know how to use it a lot more, and modern stacks a lot less.
I am fully aware that Excel is not always the right tool for the job, and that modern stacks have a place. For example, Excel does not scale, but there are cases where you don't need scalability. An assembly line isn't going to start processing 100x more parts anytime soon, and one that does will be very different. There are physical limits.
I think you drew the right conclusion from your experience, but I also want to point out that building the first prototype is always anywhere from one to three orders of magnitude easier than building the actual product.
The devil is in the details, and software is nothing but details. The product owner at the company I work for likens it (somewhat illogically, but it works) with constructing walls. You can either pick whatever stones you have lying around, and then you'll spend a lot of time trying to fit them together and you'll have a hell of a time trying to repair the wall when a section breaks. Or you can build it from perfectly rectangular bricks, and it will be easy to make it taller one layer at a time.
Using whatever rocks you have lying around is like building a prototype in Excel. Carefully crafting layers of abstraction using proper software engineering procedures means taking the time to make those rectangular bricks before building the wall. End result more predictable when life happens to the wall.
Well in these situations, the implicit ask of your company (I've been there myself) is to basically rebuild excel but replace some of the power/flexibility of excel for safety and to remove the risk of error away from front end users (aka move the risk to the back end developers)
Unfortunately which specific features of Excel are acceptable to remove are unknown until you have already way over invested into the project.
The best I've seen this done is having Excel as a client for your data store. Where read access is straightforward and write can be done via csv upload (and heavy validation and maybe history rollback).
That way the business can self-service every permutation of dashboard/report they need and only when a very specific usecase arises do you need to start putting engineering effort behind it.
I suppose you can also supplement the Excel workflow with a pared down CRUD interface for the inevitable employee allergic to excel.
I posted elsewhere[0] in this thread about my employer's successful practice of replacing shared spreadsheets with web applications.
Here is another option that we use instead of CSV import.
Our applications support custom reports and custom fields.
Users can define new reports and run them on demand.
They can also define custom field types with validation, data entry support, etc.
This combination provides some of the extensibility of Excel while retaining the advantages of an application.
Edited for wording changes.
...And orders of magnitude more wasted time and capital due to inaccurate and isolated data.
People use what they know to solve the problems they have.
You can complain about their solution or see it as an opportunity.
I posted elsewhere[0] in this thread about my employer's practice of replacing shared spreadsheets with web applications.
This approach works quite well for us and I would encourage you to consider it as an option.
You bet, but I'd really love to see data that supports that.
well technically it stores something therefore it is database joke
Confluent, the company behind Kafka, are 100% serious about Kafka being a database. It is however a far better database than MongoDB.
Excel can be an excellent source of new line-of-business applications.
Many of my employer's applications started out as a shared spreadsheet or Access database.
Our development team worked with the users and built a web application to solve the same problem.
This approach has a lot of advantages:
* The market exists and has an incumbent. There's a lower risk of a write-off.
* The users are open to process changes. You still have to migrate people off of the spreadsheet, though.
* It's easy to add value with reporting, error checking, concurrent access, and access control.
* You can import the existing data to make the transition easier. This will require a lot of data cleaning.
Edited to add the following text from another post.
You can cover most of the requirements with a set of fixed fields.
The last 10% to 20% of the use cases requires custom reports and custom fields.
Users should be able to define their own reports and run them without your involvement.
They should also be able to define custom field types with validation, data entry support, etc.
If your web application has these two features and other advantages then you should be able to replace Excel.
> Kafka and the like are clearly message buses and though they might also technically qualify as a DB
ksqldb is actually a database on top of this.
The thing is that they have an incrementally updated materialized view that is the table, while the event stream is similar to a WAL ("ahead of write logs?" in this case).
Because eventually you can't just go over your entire history for every query.
Oh ho ho ho. What weird things we use as a databases. I remember when I first started out as a consultant developer we were using a CMS as our data repository because someone thought that was a good idea. (It wasn't). The vendor was flown in from the states to help troubleshooting. I will never forget how he looked at me when I had to explain to him why we made so many nodes in the content tree, it was because we were using the CMS as a repository.
It's both. It's best used when it's being used as a message broker, but any sufficiently advanced message broker will need many of the features of a database – durability of messages, querying in various ways, etc. I think it's reasonable to think of it as a very specialised database.
I interpret it as they'd probably not call it a database, but they might use it in places where a database would be better suited, and effectively store data in it.
As someone who chose MySQL and provides direction to developers who really like Postgres, and who also uses Postgres for fun, I do find myself having to both defend MySQL as a prudent option and convince them that I know anything at all about Postgres or computer science. :)
I've heard MySQL (well, MariaDB, really) has improved a lot in recent years, but I still can't imagine why I'd ever choose it over Postgres for a professional project. Is there any reason?
It used to be that bargain basement shared-hosting providers would only give you a LAMP stack, so it was MySQL or nothing. But if you're on RDS, Postgres every time for my money.
mysql's admin tools are still far superior than what's available for postgres
What tools are these? Curious as a Postgres user
MySQL Workbench is actually pretty damn good.
As a PostgreSQL user. Pgadmin 4 is the worst.
PgAdmin3 was much better, but then they decided to turn it into a web application and we got PgAdmin4 Very annoying.
I used to use PgAdmin 3, but after... I donno how many years now, the PgAdmin4 is still a buggy mess.
It's really sad because all the contributors to Postgres have made an AMAZING database that's such a joy to work with. And then there's PgAdmin4 where its almost like they just don't care...
I don't feel I'm smart enough to contribute anything to PgAdmin4 to try make it better. So I stick to DataGrip and DBeaver.
I've probably tried a dozen or so.
For MySQL, I haven't found anything that beats SequelPro. For Postgres, I haven't found anything that comes close to parity, but my favorite is Postico.
I know people that swear by IntelliJ for their db stuff, it just never hit home for me personally though.
TablePlus is fantastic and supports all the most popular DB’s, including Postgres & MySQL. https://tableplus.com/
It’s the only DB client that doesn’t look like it was built in the 90’s. Slick UX & UI. Nice balance between developer tool & admin tool
For PG - DBeaver do a damn good job for me.
DBeaver is certainly underhyped. Moved over to using it from MySQL Workbench and so far it's been a big improvement.
I can't compare against SequelPro as I don't have a Mac, but DBeaver's worth a try for anyone looking for a cross platform DB editor
I spent a little time in it and I really like DBeaver. Free tier works great.
I tend to find people who argue with me against MySQL bring up things that haven't been true in a long time such as Unicode or NULL handling.
I'd probably choose Postgres over MySQL for a new project just to have the improved JSON support, but there's upsides to MySQL too:
- Per-thread vs per-process connection handling
- Ease of getting replication running
- Ability to use alternate engines such as MyRocks
Does MySQL have transactional DDL yet? Because if not, it is not a contender as a primary datastore.
Does that mean they fixed “utf-8” or that everyone is just supposed to know that it’s fucking bullshit and always has been?
You can’t cut corners like that without inviting questionS about the character of the primary committers. The pecking order in software is about trust.
People don’t let that stuff go easily, which is why you still see people harping on MongoDB. Once someone is labeled a liar And a cheat, everything they say that doesn’t add up is “guilty until proven innocent.”
The utf-8 situation is on top of a bed of half truths. Things like publishing benchmarks with backends that don’t support isolation. A cornerstone of a good DB is handling concurrent access efficiently and correctly. Drawing attention to other benchmarks is a lie by omission. Better than just being incorrect for a decade, certainly, but still sketchy.
MyRocks is a little bit janky in my experience - it doesn't support some transaction isolation levels, fails to handle some workloads that rely on locking (such as a job queue), has failed to upgrade MariaDB minor versions [0], has very sparse documentation, and overall has given me some amount of unexpected behavior.
Though I'm willing to put up with it due to its incredible compression capabilities...
MySQL also has great JSON features (json data type, virtual indexes onnit, multi-value (array) indexes, json_table, ....)
I'd be impressed if these were better than PostgreSQL's JSON features. Do you know how they compare?
2 years ago, they were worse, from what I recall.
MySQL has definitely improved a lot since 5.0 but it still had a few warts.
I prefer PostgreSQL, but MySQL provides a better clustering experience if you need more read capacity than a lone node can provide.
Oracle is great if and only if you have a use case that fits their strengths you have an Oracle specific DBA, and you do not care about the cost. I have been on teams where we met those criteria, and I genuinely had no complaints within that context.
Given both my experience and prior research, I don't believe you that Oracle is ever better than have the stuff on the above list, and I think it's worse than Postgres on every metric.
Every time I need to work with an Oracle DB it costs me weeks of wasted time.
For a specific example, I was migrating a magazine customer to a new platform, and all of the Oracle dumps and reads would silently truncate long textfields... The "Oracle experts" couldn't figure it out, and I had to try 5 different tools before finally finding one that let me read the entire field (it was some flavor of JDBC or something). To me, that's bonkers behavior, and is just one of the reasons I've sworn them off as anything other than con artists.
SQL Server: I use C# and write line-of-business applications.
My day job involves developing for / customizing / maintaining two separate third-party systems that rely on SQL Server (one of them optionally supports Oracle, but fuck that).
I gotta say, as much as I hate it with a passion, and as often as it breaks for seemingly silly reasons (so many deadlocks), it's at least tolerable (even if I feel like Postgres is better by just about every metric).
Datomic: I'm done already, send more work please
I've been working with a partner company that is using Datomic to back a relatively impressive product - but I don't really see much written about it. What has been your experience?
>* H2: My company can't afford a database admin, so I embedded the database in our application (I have actually done this)
I'm familiar with the variant, "InfoSec won't let us deploy a DB on the same host".
SQLite: I enjoy using reliable and correct databases even at the cost of scalability
SQLite has always intentionally failed to report this error:
sqlite> create table foo (n int); sqlite> insert into foo (n) values ('dave'); sqlite> select count(*) from foo where n = 'dave'; 1That is documented as a feature. https://www.sqlite.org/different.html under "Manifest typing".
SQLite does not use column typing except in integer primary keys.
Fair enough, but just because you call a bug a feature doesn’t mean it’s not a bug.
I think most people have realised weak typing is not a good idea in programming languages. It’s especially not a good idea in databases.
Most people have not in fact realized weak typing is not a good idea. I myself vastly prefer strongly typed languages and think they are superior. However there are a huge number of people I work with and know professionally who prefer dynamically typed languages. Weak versus strong typing is a highly subjective opinion. Each one has different costs and benefits and which camp you land in depends in large part on what you value personally.
Most "scripting" plangs are both strong and dynamic typed at the same time.
SQLite accepts a type constraint without enforcing it. That's a bug just like MySQL sometimes accepting a foreign key constraint without enforcing it.
It's not a type constraint. It's a hint to SQLite to try and coerce values when it can. Here's what that link parent posted says:
> As far as we can tell, the SQL language specification allows the use of manifest typing. Nevertheless, most other SQL database engines are statically typed and so some people feel that the use of manifest typing is a bug in SQLite. But the authors of SQLite feel very strongly that this is a feature. The use of manifest typing in SQLite is a deliberate design decision which has proven in practice to make SQLite more reliable and easier to use, especially when used in combination with dynamically typed programming languages such as Tcl and Python.
It's intended behavior that's compatible with the SQL spec.
That doesn't change the fact that it's not a good idea.
Why?
I admit I was kind of thinking that, even though I appreciated the humor. :) I imagine an awful lot of web sites out there would do just fine with SQLite as their back end.
It's great for small, read-only websites that aren't quite simple enough to just be static HTML sites. Also for desktop app config/data DBs.
Can you elaborate? I've seen benchmarks and from their website what I understood is that it can handle really massive reads and writes, tens (maybe hundreds) thousands of ops per second, but personally never tested to this extent.
We're using it in Quassel, and as soon as you go over ~3-4 parallel write/read threads, it starts locking up completely, sometimes taking 30 seconds for simple queries that should really take milliseconds.
The big issue is that sqlite does full db locking for any operation, so during any write you can't just easily read at all.
This can be fixed with WAL mode, but WAL mode is broken in uts early versions, and new versions of sqlite aren't in all disteos yet, despite being out for almost a decade. And even WAL mode gets abysmal performance.
As long as you're only doing reads, though, you can forego the locking and it's pretty nice!
Definitely would not use in a read/write situation for a web API for various reasons, though. :-)
Which version 'fixes' WAL in your opinion?
It really can (LXD cloud setup from personal experience), the problem is that if you don't serialise your writes then yeah, fun times to be had. There are compromises for all databases. People just like telling others their opinion as fact, and how wrong everybody is apart from themselves of course.
How far does SQLite scale? Obviously not good for anything public facing with thousands of concurrent users, obviously good enough for something you only use yourself, but what about internal tools with a couple hundred users total (few of them concurrent) - where's the limit when it starts slowing down?
Curiously, I just read this: https://blog.expensify.com/2018/01/08/scaling-sqlite-to-4m-q...
Expensify aren't really scaling SQLite in the way that people would expect. To say it's scaling SQLite is not exactly wrong, but probably gives the wrong impression. The users of their database likely wouldn't see it as SQLite, and they don't use the stock SQLite code.
They have their own layer on top that happens to use SQLite as the storage format on disk[1]. This layer means they aren't using full SQLite at the application level, but rather using their custom database in the application, and SQLite within their custom database.
Further, they've customised the SQLite codebase as far as I can tell to remove much of the functionality that SQLite uses to ensure that multiple instances can safely edit the same file on disk together, then they memory map the file and just have many threads all sharing the same data.
[1]: FoundationDB also does this, and scales to thousands of nodes. The trick is that it's essentially _many_ separate, very simple SQLite databases, each being run independently.
Agreed, its not a vanilla SQLite but can give you a rough idea of the quality of SQLite codebase.
* Cockroach / Spanner: you know what's cooler than millions?
What if they prefer an XML database (like basex, exist, marklogic)?
We ask them politely, yet firmly, to leave.
Psychopath
I've used H2 Ina couple of my personal JVM applications mainly because when it comes to JVM it's a somewhat nicer fit than SQLite
I love postgresql, but I don't remember when did I last interact with it with command line instead of pgadmin.
What about MSSQL?
"We are a Microsoft-only shop"
I have other boats to rock than MySQL! ;-)
Neo4j?
ha, nailed it dude.
HAHAHAH The RabbitMQ one got me. Have your upvote, sir.
MySQL is actually amazing, scale better than PGsql supports Json and is available everywhere. I see no reason to use any other dB for 90% of the use cases u need a dB for
MySQL does not scale better than PostgreSQL.
I can tell you this emphatically as I spent 6 months trying to eke out performance with MySQL (5.6). PostgreSQL (9.4) handled the load much better without me having to change memory allocators or do any kind of aggressive tuning to the OS.
MySQL has some kind of mutex lock that stalls all threads, it's not noticeable until you have 48cores, 32 databases and a completely unconstrained I/O.
EDIT: it was PG 9.4 not 9.5
You comparing tech from 2 different eras... redo the benchmark today and I’ll be surprised if you come to the same results. PGsql even has a wiki page where they discuss implementing MySQL features and changing their architecture so they can scale. https://wiki.postgresql.org/wiki/Future_of_storage#MySQL.2FM...
> You comparing tech from 2 different eras...
They were both the latest and greatest at the time
> redo the benchmark today and I’ll be surprised if you come to the same results.
I would, but it was not just a benchmark, it was a deep undertaking including but not limited to: optimisations made in the linux kernel, specialised hardware along with custom memory allocators and analysing/tracing/flamegraphing disk/memory access patterns to find hot paths/locks/contention. (and at different scales: varying the number of connections, transactions per connection, number of databases, size of data, etc)
It was 6 months of my life.
> PGsql even has a wiki page where they discuss implementing MySQL features and changing their architecture so they can scale.
Just because mysql has some good ideas doesn't mean it scales better. I know for a fact that it didn't in 2015. I doubt that they have fixed the things I found, I could be wrong. But it would have to be a large leap forward for MySQL and PostgreSQL has had large performance improvements since then too.
also, I read that page and it talks nothing about scaling, just that some storage drivers have desirable features (memory tables are very fast, and PGSQL doesn't support it; archive tables are useful for writing to slower media, you can do this with partitioning but it's not intuitive)
MySQL 5.6 was released in 2012. Postgres 9.5 was released in 2016, so this isn't really a fair comparison.
Whups, I meant 9.4, it's not possible that I could have tested 9.5 as the test was conducted in early 2015, both databases were the latest versions.
yes, I should run the test again, but it was 6 months of my life, and I don't see how much could have changed.
How MongoDB is dangerous or less consistant that PG? I have one for you: I can't use PG or MySQL because my app will go down if the master is down so then the entire backend fails. How do you do HA with default PG?
https://www.postgresql.org/docs/10/different-replication-sol...
Logical replication or synchronous multimaster replication may meet your needs.
And you can tell a lot about a developer when they post comments like this.
Almost none of is remotely accurate e.g. RabbitMQ isn't even a database.
I can't believe the one item that was so obviously added as a joke went right over head.
It may be good idea to take a break from the computer and find something less stressful to do.
Perhaps that’s because some other message brokers are now being touted as databases[0][1], I remember seeing a thread about it on HN couple of days ago.
[0] https://www.confluent.io/blog/okay-store-data-apache-kafka/
[1] https://dzone.com/articles/is-apache-kafka-a-database-the-20...
Kafka is a very different beast from RabbitMQ.
Kafka is much more like a distributed file system that has queuing semantics baked in than it is an ephemeral queue that implements some level of persistence.
The fact that you put Kafka and RabbitMQ in the same category sort of makes me feel like you're out of your element, Donnie.
Watch this talk: https://m.youtube.com/watch?v=fU9hR3kiOK0
Re RabbitMQ, isn't that OPs point.
Man, people really hate Mongo.
We use it for a very specific use case and its been perfect for us when we need raw speed over everything. Data loss is tolerable.
It seems like you have the only good use case for it pegged down. I've worked at multiple companies that really, really didn't understand that putting something into the DB comes with some probability that it'll never come out. The arguments were "but it's a dataBASE, it stores data. They'd never sell this as a product if it LOST data; then it wouldn't be a database..."
LOL
All: we've changed the submitted URL from https://www.infoq.com/news/2020/05/Jepsen-MongoDB-4-2-6 to the work it is reporting on. You might want to read both, since the infoq.com article does give a bit of background.
Edit: never mind, I think the other URL - http://jepsen.io/analyses/mongodb-4.2.6 - deserves a more technical thread, so will invite aphyr to repost it instead. It had a thread already (https://news.ycombinator.com/item?id=23191439) but despite getting a lot of upvotes, failed to make the front page (http://hnrankings.info/23191439/). I have no idea why—there were no moderation or other penalties on it. Sometimes HN's software produces weird effects as the firehose of content tries to make it through the tiny aperture of the frontpage.
Lying about your test results from Jepsen is like going onto a reality show with Chef Ramsey, being thrown off for incompetence, then putting his name on your restautant's ads "Chef Ramsey ate here!"
I'd pay to watch Kyle screaming at people in the MongoDB offices, not that he screams or anything. Just a spectacular mental image: "IT'S NOT ATOMIC! IT COULDN'T SERIALIZE A DOG'S DINNER!"
I would watch a tech version of Ramsey's show.. oh boy!
Yep, always thought shame there isn’t one but too small of a niche I guess. Also, almost everyone telling online that they apply best practices at their company is maybe lying and wishful thinking; that would come out so no-one would apply for the show. So maybe more of a startup show where ‘a Ramsey’ comes in when a (bootstrapped or angel invested; VC funded is not saveable that way imho) company is in distress for tech reasons put in by the founders. Relevant pet peeve for this thread; let us (tiny, cash strapped startup company with founders who know just not enough about prod envs to do a lot of damage) do everything autoscale in the cloud and now we have a burnrate of $28k/mo on AWS bills with 5 users.
MongoDB's big problem is that their present user base does not want the problems fixed, particularly at default settings, because it would mean going slower. Their users are self-selected as not caring much about integrity and durability. There are lots of applications where those qualities are just not very important, but speed is. People with such applications do need help with data management, and have money to spend on it.
The stock market wants to see the product as a competitor with Oracle, so demands all the certifications that say so. MongoDB marketing wants to be able to collect money as if the product were competitive. Many of the customers have management that would be embarrassed to spend that kind of money on a database that is not. And, ultimately, many of the applications do have durability requirements for some of the data.
So, MongoDB's engineers are pulled in one direction by actual (paying) users, and the opposite direction by the money people. It's not a good place to be. They have very competent engineers, but they have set themselves a problem that might not be solvable under their constraints, and that they might not be able to prove they have solved, if they did. Time spent on it does not address what most customers want to see progress on.
Translation: They were trying to be everything for everybody.
The syntax is very nice, I honestly think a lot of it's early success came from ease of use.
If they only cared about performance then they would've left the write concern defaults to not acknowledge writes either locally or within a replica set. Or just read from the nearest replica and don't worry about potential consistency issues.
Also this isn't 2011. MongoDB is not a competitor to Oracle and never really has been by people that knew that a DocumentDB was not usable as a SQL one. It's other SQL databases that are the real competitors e.g. Snowflake, Redshift are.
You know it, I know it, MDB knows it, and most of their customers know it, but that doesn't matter: the stock market doesn't. MDB wants to be valued like a durable-database company, and to be able to charge durable-database prices. They need a plausible durable-database story to get those, regardless of what actual current users want.
It is possible there are still potential users not buying until they get that story. MDB wants those users.
MongoDB started life as a database designed for speed and ease of use over durability. That's not a good look for a database.
People have told me that they have since changed, but the evidence is overwhelmingly and repeatedly against them.
They seem to have been successful on marketing alone. Or people care more about speed and ease of use than durability, and my assumptions about what people want in a database are just wrong.
> MongoDB started life as a database designed for speed and ease of use over durability. That's not a good look for a database.
I think it depends. One could say the same about Redis, but it's wildly successful and people love it.
The difference is now they are advertised. Redis makes no claims to be anything other than what it is - a fast in-memory database that has some persistence capability but isn't meant to be a long-term data store. MongoDB, on the other hand, made (and continues to make) claims about being comparable in atomicity and durability to traditional SQL databases (but magically much faster!) that haven't withstood scrutiny.
Keep in mind, too, that most data ain't worth much. It's one thing to entrust data of low value in MongoDB; another to store mission-critical data in it. I would look askew at leadership who didn't ask hard questions about storing data worth millions or billions of dollars in MongoDB without frequent snapshots -- and even then, the value mustn't be contingent on the 100% accuracy of said data.
When I'm thinking about data stores in large systems I like to break them down depending on how they are used on two main axes: is it fast/slow moving and durability from "we don't care" and "we must never lose data".
It's easier to reason about systems if there's fewer things that require durability guarantees, ideally you want to be able to draw data flows that look like a tree instead of a graph.
I find that Redis fits great because it's perfect for a whole bunch of different temporal shared state needs, everything from sessions to partial results. I've also deployed things like Ehcache, MongoDB, and Memcached to fit these needs and found other tools such as Kafka or RabbitMQ to be great "glue".
Having the root of your important data be something "boring" like Postgres or MySQL (or even Oracle!) is just good risk management to me. I wouldn't want to trust Redis or MongoDB for important data because it adds to the things I have to worry about. It's "keeping your eggs in one basket" while making sure that basket is really well looked after.
Yes. What I love most about Redis is that the fundamental tradeoffs of the algorithms it's built on are surfaced up through the interface, and made very plain in the documentation.
Reading past marketing blurbs and using products for the things they are designed is part of any engineer's job. I was irritated by MongoDB's claims and defaults, but that didn't stop us from putting it in production. We used it from 2012 to 2016 (their most infamous years?), and for our use cases, scale, size+expertise, and feature set, it was a perfect match. In our case, durability was a smaller concern by design (lots of write-only data, lots of ephemeral data), but we still configured it carefully and never ran into any data loss whatsoever; snapshots worked, migrations worked, etc.
If the service had lasted longer, scaled bigger, and the business it supported had been more successful, we might have ended up with a now-classic MongoDB to pg migration. That was always an acceptable outcome, and it would have not invalidated going with Mongo at the start.
>In our case, durability was a smaller concern by design (lots of write-only data, lots of ephemeral data),
I assume that you mean write once data. If you mean write only you might as well use /dev/null.
The performance gains are very impressive https://tech.wayfair.com/2013/04/devnull-vs-mongodb-benchmar...
Most backups are never actually read / used after writing but /dev/null is not a good substitute for them.
[Edit] in case I wasn't clear: Imagine data that is written and rewritten and overwritten before it gets read; a user that never returns to your product - his state data was written but will never be read. That sort of thing.
We[1] have done 50+ conversations with developers this year (mostly indie and small startups). You’re right about the ease of use. The top reasons are
A lot of them even knew about the limitations of MongoDB but they still choose it.- they don’t know why, it was just the one they learned/heard about first - there is a lot of tooling for itWe concluded that other databases need to start prioritising usability; something few developer tools usually care about.
[1] https//supabase.io
So Supabase is like Hasura for REST?
When are you launching?
Very similar - our realtime is done a bit differently (we use the WAL rather than triggers), and we plan to use Postgres' RLS rather than a custom Auth system.
> When are you launching?
We're in alpha now - https://app.supabase.io (desktop only)
Maybe it's just because I know SQL reasonably well but I don't even find Mongo particularly easy to use. Not for complex queries anyway.
I think the ease of use was more in the administration. It was (is?) super easy to set up and run (for small installations).
> Or people care more about speed and ease of use than durability
I think 90% of the Mongo installs I've been exposed to were set up by people that were tired of fighting with Hibernate configurations and schema migrations.
It's also popular among people whose definition of "legacy software" is "that app I stopped working on after three months because I have something shiny and new."
We have it at our work. I bet it's because it was the hip new thing to try out in 2013. Our tech lead is more into tech challenges than building a maintainable app.
I used it effectively to denormalize and combine some data from other services... sort of like a 2nd level, queryable cache. Worked very well for my needs. This was 7-8 yrs ago.
Yes, this is true, it's good as a cache since it values speed over durability. But since it's not built for that, you could potentially do better with an in-memory database like Redis.
I find with the MongoDB style of database, it's easy to prototype without needing to do the heavy schema management of SQL.
But, if you need a traditional ACID database, the flexibility comes with punch in the groin technical debt.
The Jepsen analysis : https://jepsen.io/analyses/mongodb-4.2.6
I wonder if I'm the only sysadmin in the world who doesn't hate MongoDB. Yes, I wouldn't use it for new projects, and yes, I wish RethinkDB had taken its place, but it's not as horrible as people seem to think. Default configuration... If it weren't for RDS' doing PG-bouncer-style connection management, 95% of production postgres instances would probably fail. It innodb_buffer_pool_size wasn't set properly, plenty of data-centers would light on fire. If no one setup a firewall or AOF for redis, it's data-loss and data-exposure waiting to happen. If no one adds auth to an HTTP route, it's open to the world, etc etc etc. If tech-stacks were legos, software engineers would earn a heck of a lot less.
I absolutely agree it's been used by people who just don't want to write SQL queries, or being used as a text-search-engine in place of something like more appropriate like ElasticSearch, but to mock successful projects who were based on it seems silly. It reminds me of interviewing candidates at a startup who primarily used PHP/MySQL. Most of them openly laughed and called it all horrible. I voted "no" on them, and sometimes injected a somewhat toxic "ah, you're right - we should close up shop. Someone call Facebook - tell them their tech stack is horrible - shut it all down!".
You can learn a lot about a developer by asking "What do you think about Mongo, JavaScript, or PHP", and if their response isn't a shrug, they're probably more concerned with what editor is correct than if the product they're building is useful. It's an exceptional filter to reject zealots and find pragmatists.
All that said, MariaDB with MyRocks is _awesome_, but certainly not with the default settings :)
> You can learn a lot about a developer by asking "What do you think about Mongo, JavaScript, or PHP", and if their response isn't a shrug, they're probably more concerned with what editor is correct than if the product they're building is useful. It's an exceptional filter to reject zealots and find pragmatists.
Sure, if they’re being rude about it. A developer saying that it will not fit the use case or talking about spending a month of their time fixing a production issue caused by MongoDB will definitely not get a “no” from me. I’m not hiring subservient people I’m hiring people who can think for themselves and choose the right tool for the job, which Mongo rarely is.
Yes, I couldn't agree more. Didn't mean people can't have opinions of course - just that they shouldn't be scoffing/laughing at tools - to me its a particularly red flag. Not wanting to use a particular tool is totally fine!
> just that they shouldn't be scoffing/laughing at tools
Why on earth would that be a problem? Why are you sensitive about tools? Do the tools have feelings?
Tools are just tools. Imagine scoffing at someone using a hammer, when in your less experienced opinion, they should be using NewHammer3.1.
Tools are means to an end. I much prefer working with people who care about the product and the business than what tools they used to build said product/business. I'm looking for "hey, you could save a lot of time by using tool Y", rather than "You people are idiots for even considering tool Z".
No. I can certainly imagine anyone scoffing at someone using a hammer when they are trying to dig a hole in the ground screaming "Tools are just tools!"
So yes, writing a website with assembler raises eyebrows and it is quite OK to ridicule bad languages such as PHP.
You must know that's a fairly ridiculous reading of what I wrote...
RethinkDB is a better solution to every problem that MongoDB claims to solve. I wouldn't use it for everything. But once my need for a document store outgrows what's convenient and easy in Postgres with JSONB, I reach for Rethink. It's great. There's a Jepsen analysis of it a while back too that is quite positive.
It's a shame that Rethink did so many things right and failed as a company while Mongo continues to do almost everything wrong as a company and still gets business.
> It's a shame that Rethink did so many things right and failed as a company while Mongo continues to do almost everything wrong as a company and still gets business.
This seems to be more the rule than the exception, doesn't it?
It's even not that hard to come up with explanations for this, main one certainly being that popularity depends essentially upon simplicity.
And simplicity might not even be economically as inept as we would like it to be. Indeed, since only a small minority of all the systems that are designed reach production and stay there for long then it can make sense to use the quickest piece of junk available, at least until proven it will stick.
My current data solution is layers of code on top of redis, trying really hard to be everything Rethink was.
Easy access to changelogs should be an "easy to access" feature in all databases. Event driven systems aren't rare: the data store needs to be done to tell interested parties that underlying data has changed.
Imagine how I feel about it :)
There is much amusement to be obtained from reading Jepsen's report:
"MongoDB’s default level of write concern was (and remains) acknowledgement by a single node, which means MongoDB may lose data by default.
...Similarly, MongoDB’s default level of read concern allows aborted reads: readers can observe state that is not fully committed, and could be discarded in the future. As the read isolation consistency docs note, “Read uncommitted is the default isolation level”.
We found that due to these weak defaults, MongoDB’s causal sessions did not preserve causal consistency by default: users needed to specify both write and read concern majority (or higher) to actually get causal consistency. MongoDB closed the issue, saying it was working as designed"
MongoDB is horrible, I get it.
What do I use in this situation:
1) I need to store 100,000,000+ json files in a database
2) query the data in these json files
3) json files come from thousands upon thousands of different sources, each with their own drastically different "schema"
4) constantly adding more json files from constantly new sources
5) no time to figure out the schema prior to adding into the database
6) don't care if a json file is lost once in awhile
7) only 1 table, no relational tables needed
8) easy replication and sharding across servers sought after
9) don't actually require json, so long as data can be easily mapped from json to database format and back
10) can self host, no cloud only lock-in
Recommendations?
Elasticsearch? http://smnh.me/indexing-and-searching-arbitrary-json-data-us...
Depends on what your queries look like, I guess.
Just adding that I have used elasticsearch for a use case under the above constraints several times in the past and it worked well.
Ironically once because mongo was such a pain to work with I dumped the data from it into ES to get the better API, usability and Kibana.
I don't think it's that simple (being horrible). MongoDB can be great for some specific situations, perhaps yours. It's just that it's not for many others, and you'd need to be an expert to find this out from the docs.
Postgresql with 1 table with JSON fields?
I think it's remarkable this report has been out for a week now and no one at MongoDB has commented on it. At least, not that I have seen.
Maybe they're too busy spending their MDB money.
I genuinely am confused by comments like this.
Are companies not supposed to invest money into their product, sales, people etc ?
And why does being listed on the NASDAQ imply being flush with money ?
> Are companies not supposed to invest money into their product, sales, people etc ?
> And why does being listed on the NASDAQ imply being flush with money ?
It was intended to be a playful reference to MDB's stock price being on a tear right now, not simply being listed on NASDAQ.
Expand the timeline on the graph to "Max", it's at an all time high.
The culture changes to one of sales & corporatism from one of engineering and focus on product. Why would they be here?
"We found that due to these weak defaults, MongoDB’s causal sessions did not preserve causal consistency by default: users needed to specify both write and read concern majority (or higher) to actually get causal consistency. MongoDB closed the issue, saying it was working as designed, and updated their isolation documentation to note that even though MongoDB offers “causal consistency in client sessions”, that guarantee does not hold unless users take care to use both read and write concern majority. A detailed table now shows the properties offered by weaker read and write concerns."
That sounds like a valid redress, or am I missing something ?
Kyle's point is that it's arguably valid but certainly unhelpful: the default settings are liable to lead to data loss. Moreover, he draws attention specifically to transactions as something which you would expect to make things safer, but in fact there's a rather arcane part of the documentation that notes that you need to manually specify both read and write concerns on every transaction individually if you want transactions to behave consistently, regardless of the concerns specified at the database level.
Basically, there are a large number of pitfalls that it's very easy to fall into unless you have an encyclopaedic knowledge of the documentation, and you need to ignore some of the words that are used (like "transaction" or "ACID") because they carry connotations that either do not apply or only apply if you do extra work to make it so.
> the default settings are liable to lead to data loss
In Mongo's defense, the defaults are similar to what you would likely have with a replicated MySQL/Postgres cluster (single node accepting writes with slaves replicating from there; no concept of write concern). My assumption here is that he is assuming the primary dies before the writes have replicated to the secondaries; that is exactly how master-slave fails too. Maybe there are systems folks can use for having write concern in those databases, but in the companies I've worked for we didn't have them and we definitely didn't have automated failovers
How is this any different than DynamoDB where you specify that you want either eventual consistency vs strong consistency? DDB also does eventual consistent reads by default.
Is the argument that Mongo’s documentation isn’t clear?
I trust Kyle when he tells me that the behaviour he observes is surprising. From the analysis (https://jepsen.io/analyses/mongodb-4.2.6):
"In order to obtain snapshot isolation, users must be careful not only to set the read concern to snapshot for each transaction, but also to set write concern for each transaction to majority. Astonishingly, this applies even to read-only transactions."
"This behavior might be surprising, but to MongoDB’s credit, most of this behavior is clearly laid out in the transactions documentation… MongoDB offers database and collection-level safety settings precisely so users can assume all operations interacting with those databases or collections use those settings; ignoring read and write concern settings when users perform (presumably) safety-critical operations is surprising!"
There is difference between “Mongo’s documentation sucks” and “Mongo is technically deficient”. The former can be corrected by updating the documentation.
Yes, I agree as far as the end user is concerned, they are losing data either way.
I think the implication here is that "Mongo's documentation is deliberately bad in order to hide their technical deficiencies," i.e. they're hoping people will use the defaults, be impressed by the speed, and never realize until it's too late that they're not getting the consistency they were promised.
DynamoDB conditional writes are strongly consistent. Defaulting to inconsistent reads was reckless and I would never defend that, but the worst case is non-repeatable stale results, never lost writes.
It’s clearly documented everywhere that reads are eventually consistent by default.
That's the right way to disclose a dangerous default, but defaults should be as safe as possible, and users should think very carefully about whether they can get away with opting out. Consistency failures can be very non-intuitive, and hard to clean up after.
Oops. Turns out I was right.
https://news.ycombinator.com/item?id=23271211
The timing of this is absolutely beautiful.
So now we shouldn’t ever trust a project because they don’t have good technical writers?
I don’t have a dog in the Mongo fight. I haven’t done an implementation on top of it in years and probably the next time I do something with “Mongo” it will probably be AWS’s Document DB with Mongo support. That’s based on AWS’s own code and storage tier and doesn’t have the same characteristics as Mongo proper.
For what it's worth, Document DB doesn't support a lot of the Mongo API, such as $$ROOT in aggregations, and it can't use indices on (paraphrased) "SELECT * FROM x WHERE id IN [list]" if the list length is > 10.
If you ask me, if there's something worse than Mongo it's Document DB.
Given a choice between the limitations of DynamoDB and DocumentDB, DocumentDB is still far more versatile and written on top of the same storage engine and is just as reliable.
And yes I know most of the data modeling tricks around using GSI’s and LSI’s.
> So now we shouldn’t ever trust a project because they don’t have good technical writers?
> the newer MongoDB 4.2.6 has more problems including “retrocausal transactions” where a transaction reverses order so that a read can see the result of a future write.
Oh, Jepsen and MongoDB again? Somebody get the popcorn!
Unfortunately, not an entertaining showdown - too one-sided.
I remember having immensely enjoyed the original "Call me maybe" analysis [https://aphyr.com/posts/284-jepsen-mongodb]. Sometimes it's just fun to see someone beaten.
Because MongoDB is web scale?
Some readers might not be familiar with that particular meme: https://m.youtube.com/watch?v=b2F-DItXtZs
IMHO it perfectly describes the hype-reality disconnect at the early days of MongoDB. Yeah it was that bad.
Mongo has improved since, the hype has toned down and the NoSQL space is more crowded these days.
i remember diaspora chanting about using mongodb.
then a year or two later they admitted that their data model mostly fitted the relational model, and that they spent a lot of time basically reimplementing relational integrity in application code, in ruby.
yeah, diaspora has never been fast. I'm not sure they can blame it on mongodb though.
I remember the Mongo hype when it came out and I really couldn't understand it. You are just throwing away a lot of useful features of a relational database because "schemaless" and "big data". The majority of people using it were on single server setups.
The damage is done. I have to use this crap at work when we should be using an SQL database. We have been planning a migration since before I started a year and a half ago. I won't be surprised if we are still on Mongo in another year and a half.
Fantasy: "Don't use special database features (by which I mean, like, any features) and make sure our ORM supports a ton of different datastores because we might want to change to a different database at some point and don't want to be tied to this one."
Reality: Three app rewrites later plus another application written talking to the same DB, and the database is still the same.
Oh that brings back memories ;)
I still remember when MongoDB was the new kid on the block and it was lauded as the only thing you should be using here on HN.
I'm glad my gut instinct was correct and that it really wasn't worth the hype. It reminds me of Ruby on Rails.
I've never used RoR but I know people that still swear by it. It's outdated by today's "standards," but ActiveRecord was and is still a gem (heh) and a lot of RoR's foundational principles have been adopted by the existing major frameworks.
Regardless of technical acumen, I believe RoR doesn't deserve to be compared to Mongo for one reason: the RoR developers never tried to gaslight their users into thinking they're the reason everything broke; they never said only "if you know what you're doing" can you avoid these hidden pitfalls.
Every time I see a post about Mongo it makes me wonder what could have been if RethinkDB was managed differently.
I worked at one company where the network traffic just on the MongoDB master was around 2gb/s. We had machines with terrabytes of memory, and Mongo worked fine - until we had some replica set nightmares. Mongo support is amazing, but when replication breaks it's very hard to diagnose (usually it was our fault, but it felt very fragile).
I used mongodb for 1 year for a milti million user app. I abondened it. The reliability and stability is just not good. I wanted it to be good, but it turned out to be a different
Ok, so defaults suck, marketing is misleading, documentation and error messages are not exactly obvious. Assuming you are already stuck in the soup, putting those issues aside and getting practical instead instead of throwing more fire on the discussion:
If you set w: majority and r: linearizable/snapshot, both on collection, client and on transactions. Plus assuming you accept snapshot over Isolation. How bad are those remaining cases in reality and how do these issues compare to other databases? The final "read your future writes" error looks quite scary and does not seem to be caused by configuration error, same with "duplicate effects".
"Informally, I would summarize the CAP theorem as: If the network is broken, your database won’t work."
- Dwight Merriman, former CEO, and "one of the original authors of MongoDB" [1]
A word to the wise suffices. Sometimes the word in question is implied by other words.
For those who get this oblique post, note that throwing the above bon mot in an interview session for a "distributed systems engineer" and asking for an opinion is a excellent way to differentiate between Peter Principle and Principal Engineer.
[1]: https://web.archive.org/web/20100903213540/http://blog.mongo...
Discussed previously:
Surprisingly, it seems not to have made the front page: http://hnrankings.info/23191439/. There's clearly community appetite to discuss this, so we won't treat the current submission as a dupe.
“ Did HN's antispam measures get a lot more aggressive recently? The last handful of Jepsen reports have really struggled to make it to frontpage, despite significantly higher vote-to-age ratios than comparable posts. Once they're on FP, they reliably hit top 10, but Dgraph's (1/2) ” https://twitter.com/jepsen_io/status/1261640852666855426
Funnily enough I emailed aphyr earlier this afternoon to let him know that the current submission was at #1 on HN and that https://news.ycombinator.com/item?id=23285249 had strangely failed to make the front page despite all the upvotes it got. There wasn't any moderation in either case. Nothing has changed recently. There's just a lot more randomness than people assume...2000 submissions a day competing for the same 30 slots creates a lot of weird high-pressure effects.
Our company migrated away from MongoDB, here's a talk about how we did it, in case you're thinking about what is involved and how to do it safely: https://www.youtube.com/watch?v=Knd3m2qh0o8
Ubiquity used MongoDB for their CloudKey Gen1 series. When there was an unexpected shutdown, there’s a random chance it would lose its configuration [1]. If your SD backup didn’t work, you’d lose configuration for all WiFi hotspots. If you did client installs like I did, this was a total nightmare. How did they solve it? Release new, more expensive hardware with a battery backup acting like a UPS. Never solved Gen1 issues. Imagine your phone corrupting after a hard reset. Thanks Ubiquity & MongoDB
[1] https://community.ui.com/questions/MongoDB-corrupt-after-eve...
If you want to be "that guy" on parties, ask people what MongoDB is trying so solve. If they bring up the typical "noSQL document store" stuff, aks them why you'd want to use MongoDB for that.
MongoDB uninstalled our cloud hosted cluster once and the site was down and we needed to setup a large database from backups. Their response was very unhelpful. I would never touch MongoDB again.
Regardless of tech, MDB is a weird stock that go up steadily every time.
I have never understood the stock price. I tried shorting at one point, that was a mistake.
It looks like relatively few people clicked through to read the analysis itself, so @dang's kindly offered to repost it. You can find the analysis here:
https://jepsen.io/analyses/mongodb-4.2.6
... and the corresponding HN thread here:
If you're looking for MongoDB done right, it does exist and it's called RethinkDB. For some reason it didn't catch on and become popular — but it's nicer, and most importantly, it doesn't lose your data.
Data point: I have been running my production system (a fairly complex SaaS) on RethinkDB for the last 4 years.
RethinkDB is no longer supported, its major caveat.
Yes. Although the degree of "support" always depends on how much you pay for it :-) I doubt MongoDB is "supported" in the way most people understand that word.
From my point of view, RethinkDB is not regularly developed and improved. There is progress, but it's slow. Which is a pity, because it's a really good database, and one that tries really hard to be correct above all else.
The only other correct distributed database with strict serializable guarantees that I know of is FoundationDB, which nowhere near as easy to use as RethinkDB is (but it's somewhat easier with their document layer, which pretends to be MongoDB, just done right).
Main argument for using document-oriented databases: https://martinfowler.com/bliki/AggregateOrientedDatabase.htm...
Anyone has a recommendation for a NoSQL database?
https://news.ycombinator.com/item?id=23253870
(not Mongo obviously)
This question sounded familiar - turns out I replied to it in another thread: https://news.ycombinator.com/item?id=23286054
To repeat my (non)answer:
There is no way to recommend NoSQL database without knowing what you need it for because NoSQL databases are highly specialized systems. If you need general-purpose database use an SQL one.
It's kind of a weird question, now that I think about it. Why would anyone seek out a database based on what it doesn't have?
I'd actually say the reverse. SQL databases are highly specialised datastores: they make sense if you need one particular transaction model and one particular query language and are prepared to coerce your data into one particular model to do so.
If you're starting from just "I need to store some data" I'd look to e.g. Riak or Cassandra before looking to an SQL database.
SQL DBs are not specialized.... they're incredibly general...
You are never starting from "I need to store some data" you're always going to start from "I need to store and read some data" otherwise /dev/null would work if you are not going to read the data back.
the problem with cassandra and riak is precisely the read aspect of the problem which quickly degrades the performance of those systems.
I've used both cassandra and postgresql at scales most companies never reach. cassandra I'd only touch for immutable time series data and only if that information was large enough to not fix on a single server and i didn't care about consistency. everything else is a SQL rdbms.
For simple reads, the SQL model forces significantly worse performance: MySQL benchmarks found that 75% of the time for a pkey lookup was spent on parsing the SQL. For more complex querying, SQL databases can be fast... and they can also be extremely slow, and you can't tell for any given query just by looking at it.
The much-vaunted consistency comes at a significant cost: index updates block writes, and more insidiously, it's very easy to be surprised by a deadlock or a stale transaction with a long-running query. I've seen an SQL database stop committing any new writes because someone ran a seemingly innocuous query 23 days ago. And a lot of the time - including every web use case I've seen - you can't actually make any real use of those consistency guarantees.
Writing either a transformation pipeline that serves the same function as a secondary index, or a deliberate map-reduce style aggregation, takes more up-front effort. But it means you understand what's actually going on a lot more clearly and are much less likely to hit that kind of unpleasant surprise.
I wonder, given your experience, did you ever try FaunaDB? It grew from the not optimal experience when scaling databases like Cassandra etc @ Twitter. Consistent + relational + multi-region.
[repost - asking for help] I am disappointed with the direction that MongoDB took this past few years. Going ACID shows in benchmarks [1] and it’s not advisable if you are using MongoDB for stats and queue. (No one uses MongoDB for financial transactions despite the changes.)
And the recent change to a restrictive license is worrisome as well. I have been thinking of forking 3.4 and make it back to “true” open source and awesome performance. (If any C++ devs want to help out, reach out to me! username @gmail.com)
Why not use PostgreSQL instead? It supports a JSON document data type natively. It also has exceptional stewardship as an open source project.
Mongo should never be a first choice, but a last choice for edge cases.
I really enjoy using PostgreSQL only I just don't know how to make it scale easily. Running it on large VM in the cloud works fine until you have lots of data or need it easily accessible. How can you have the data in three different regions (e.g. Europe, US, Asia) when you using something like Google Cloud? Seems to be a hard problem to crack.
generally speaking the only reason that would be difficult would be due to legal constraints on the data locality.
if it was just performance than read only replica's in each region would solve most of the issues.
for the legal case generally I just end up with a separate postgresql DB in each region and during login the region is determined by user/company.
usually ends up being like 1k LOCs total.
Try with Aurora Postgres Global Database, it's on AWS, but should be accessible from other clouds.
Postgres has terrible indexing with json. It doesn’t keep statistics so simple queries sometimes take much longer than expected due to query planner not knowing much about the data.
“It doesn’t keep statistics” is a weird way to say “I expect full table scans to always be fast”.
Create a functional index.
I’m not sure if I can even understand what you are talking about or why you said that.
If you have pid,{name:’val’, others...}. And an index of name with a million John and one Jane. Good luck getting fast results.
> If you have pid,{name:’val’, others...}. And an index of name with a million John and one Jane. Good luck getting fast results.
Fast results: https://dbfiddle.uk/?rdbms=postgres_10&fiddle=dd9370966b1528....
Postgres does actually keep statistics on json columns, but if you've got a functional index on the table and the query uses it then it doesn't matter if there is one "jane" and a million "johns". You're looking up a key in a btree index.
No it doesn’t. But what do I know running several clusters in production for a $100 million business? Please read up on the subject before arguing.
> SELECT most_common_vals
> FROM pg_stats
> WHERE tablename = 'json_test'
> AND attname = 'json_column';
> {"{\"name\": \"john\"}"}
Hmm. Looks like it does though. Not that it makes a damn bit of difference because if you haven't got a functional index (i.e the stats are next to useless) then you're doing a full table scan, and in that case it sounds like you “expect full table scans to always be fast” :)
And sure, the statistics don't help with the query planner, unless you've got a computed column, but again see "I expect full table scans to always be fast" and re-consider the statement "postgres doesn't keep statistics on json columns" given the fact that it actually does, just like any other column.
Read up indeed!
I’ve seen that as well, the default estimate for jsonb can seriously confuse the query planner. There is a patch in PG13 that addresses this as far as I understand, but I’m not familiar enough with PG internals to be sure I’m reading that right. I’ll be playing with this when PG13 is out, the jsonb feature is really useful, though I wouldn’t recommend to shove relational data into it. Many things are much, much harder to query inside jsonb than regular columns.
There are ways around the statistics issue in some cases, e.g. defining a functional index on a jsonb property will collect proper statistics.
DB noob question: if you know that you should be indexing on a json attribute, can’t you put it into a «proper column» and index there?
There are a number of ways to do this:
* Extract the attributes you're interested in into their own columns, index these. With the extraction happening outside the database, this is the most flexible option.
* Similar to above, use a trigger to automatically extract these attributes.
* Also similar to above, used a generated column[0] to automatically extract these attributes.
* Create an index on the expression[1] you use to extract the attributes.
My use a JSON in PostgreSQL tends towards the first option. This works well enough for cases where documents are ingested and queried, but not updated. The last three options are automatic - add/change the JSON document and the extracted/indexed values are automatically updated.
[0] https://www.postgresql.org/docs/12/ddl-generated-columns.htm...
[1] https://www.postgresql.org/docs/12/indexes-expressional.html
You could, of course. But that would mean that you are effectively not using json anymore. You need to pull the data out of your json on each write, update in two places, and so on. And if you need to delete a json column, what do you do with the other one? You need to delete it also. You are then managing two things.
There is always a trade off. If the column is important enough, then you are right, it should stand on its own, but then you lose the json flexibility. I personally almost always only use jsonb if I know I only care about that overall object as a whole, and rarely need to poke around to find an exact value. As a the grandparent comment mentions, if you do need a particular value, then it might be slower if your JSON records are too different (if you think about it, how can you calculate selectivity stats on a value if you have no idea how wide or different JSON records are?).
> Why not use PostgreSQL instead? It supports a JSON document data type natively.
Yes, that's the thing, it's just a field type. It's not really that different than dumping your JSON in a TEXT column. MongoDB is fun because it's truly JSON - BSON - so you don't have to run migrations, you can store complex documents, and have a more object oriented way of storing your data than SQL.
You should probably read the Postgres documentation [1] before you make erroneous claims like this. Postgres JSON fields can be destructured, queried, and aggregated sufficiently to cover at least the 90% cases in MongoDB usage.
I'll grant that Postgres probably isn't as much fun as Mongo, what with all its tiresome insistence on consistency and reliability. I would, however, argue that quantity of available fun isn't really a figure of merit here.
> It's not really that different than dumping your JSON in a TEXT column
That was true of the initial "JSON" type support.
It is very much not true of the "JSONB" type, which was added in 2014 as part of Postgres 9.4. JSONB uses a binary serialization that supports efficiently selecting into JSON documents, putting regular BTREE indexes on specific fields inside the documents, or even putting Elasticsearch-like inverted indexes on complete JSON documents.
This is so not true that it hurts. Postgres jsonb is highly queryable.
It's completely different than dumping json into a text field...
Read the docs, you can do a lot of fancy JSON stuff in plain ol' Postgres. It's really powerful and guarantees your data.
Postgres already handles JSON well. MySQL does a good job now too. And there are tons of other JSONb/document-stores like Couchbase, CouchDB, RavenDB, MarkLogic, ElasticSearch, ArangoDB, CosmosDB, AWS DocumentDB, and even RethinkDB that still exists.
It's a nice goal but there's likely not much of a commercial market for it, if that's your roadmap.
> And the recent change to a restrictive license is worrisome as well. I have been thinking of forking 3.4 and make it back to “true” open source and awesome performance.
Please do; someone needs to take that first step, and then many more could potentially contribute.
the license change was needed to keep MDB alive. Amazon's documentDB is just a fork of mdb before the new license.
<rant>
This corruption is brought on by the stock market.
Have a look also at Shopify. They go and tack on 2% fees when customers use Google Pay or Apple Pay to checkout with. They recently announced that FB would be pulling ecom sales within app, and yet Shopify plans to charge 2% on top of FB fees. That's what I could gather despite the pricing being rather opaque.
Is this a step forward or backwards? Charging 2% / transaction for modern Internet protocols running on cheap hardware across a public network?
</rant>
Obligatory https://www.youtube.com/watch?v=b2F-DItXtZs
Can anyone share any positive experiences with MongoDB? I wouldn’t think MongoDB as perfect like any other piece of tech, but the unanimous hatred for it seems a little overblown. Not trying to discredit the bad experiences people have had with it. Just curious to know where people are using it successfully
This has been a known issue for a while:
https://hackingdistributed.com/2013/01/29/mongo-ft/
MongoDB: Broken By Design
Might want to read up as this involves a completely different set of issues.
And most of those listed in the blog were fixed many years before 2013.
Actually, I read both articles. In fact, the author of the first article was my very own distributed systems professor in school. The persisting issue in both articles is a non rigorous specification of when a write actually completed. Both articles point out that a fault tolerant database should be ACID compliant, which does not live up to MongoDB’s claims.
This is a good of the HN MDB hate: everything referenced has been addressed long before 2013. It was a new DB then and early adopters should know what they’re getting into
No. New doesn’t mean broken. Hyperdex was brand new at the time and still managed to be consistent with data.
MongoDB is the /dev/null of databases
How is Cassandra as an alternative to MongoDB?
Did you ever play a game of Civilization and attack an entry-level Warrior with a stack of machine gun infantry?
Well, the warrior has lower upkeep costs. Keep that in mind.
It's not.
The only thing similar about the two is that they both store data and have the letter D in their name. Otherwise they are completely different, Cassandra being a BigTable style database and MongoDB being a document one.
I mean, they are completely different. MongoDB is more-or-less a traditional RDBMS with automated failover and trying to staple on more advanced features. Cassandra is a masterless DynamoDB-ish database with features like hinted handoffs. You really need to know how consistency and distributed systems work if you're looking to pick Cassandra. It's a great implementation, you just can't compare it to MySQL/Postgres/etc like you can with Mongo
wat.
I hope this is a joke.
Dan Luu suggested on Twitter that MongoDB trolled Kyle into testing Jepsen again. I think they've made a mistake though. :-)
It seems that the only tangible benefit remaining for DocumentDBs over SQL platforms (PostgreSQL, SQL Server, etc.) is scalability. Jr. devs thinking they can have a career in software dev without learning SQL is not a benefit.
Typical HN posts of late hating on Javascript and MongoDB from database elitists -- the thing is there's a tool for a job and as engineers we need to figure out what tool best suits our use cases. It could very well be a NoSQL database such as Mongo or a relational one like Postgres or MySQL.
> the thing is there's a tool for a job
Really? Which job do you belive needs a "maybe store some of this data, sometimes" kind of database?
I'm not defending mongodb in and sense and had stern talks with some of my junior developers who were too eager to try out this new hot mongo thingy on a new website, but there are plenty such jobs.
For example, climate data gathered from hundreds of thousands of devices every minute can very much survive some data to be lost. Or some astronomical observations data.
I wouldn't choose mongoDB for it, though.
your example is a perfect use case for postgresql via the timescaledb extension.
I actually agree. I love postgresql and we've been using it for all our projects since our company was founded (well, except mobile apps, obviously), and it never failed us.