Snowflake’s response to Databricks’ TPC-DS post
snowflake.comCan someone ELI5 what Snowflake and Databricks are? I spent a few minutes on the Databricks website once and couldn't really penetrate the marketing jargon.
There are also some technical terms I don't know at all, and when I've searched for them, the top results are all more Azure stuff. Like wtf is a datalake?
A data lake is a system designed for ingesting, and possibly transforming lots of data, a "lake" where you dump your data. This is different from an eg postgres db (a single source of truth for a crud app for example), because it captures more data (eg events) and it's normally not consistent with the single source of truth (the data may arrive in batches, imported from other database, etc). Because the volume of data is normally huge, you need a cluster to store it, and some way of querying it.
Snowflake and data bricks are companies that operate in this space, providing ways to ingest, transform and analyze large volumes of data.
Snowflake is (amongst other things but primarily to me) SQL database as a service, designed for analytical queries over large datasets.
It separates compute and storage, so there's just a big ol' pile of data and tables, then it spins up large machines to crunch the data on demand.
Data storage is cheap and the machines are expensive per hour but running for shorter times, and with little to no ops work required it can be a cheap overall system.
Bunch of other features that are handy or vital depending on your use case (instant data sharing across accounts, for example).
I've used it to transform terabytes of JSON into nice relational tables for analysts to use with very little effort.
Hopefully that's a useful overview of what kind of thing it is and where it sits.
Snowflake is a hosted database that uses SQL. Two distinctions it has is that (1) it lets users pay for data storage and compute power separately and independently and (2) it takes decisions about data indexing out of your hands.
Databricks is a vendor of hosted Spark (and is operated by the creators of Spark). Spark is software for coordinating data processing jobs on multiple machines. The jobs are written using a SQL-like API that allows fairly arbitrary transformations. Databricks also offers storage using their custom virtual cloud filesystem that exposes stored datasets as DB tables.
Both vendors also offer interactive notebook functionality (although Databricks has spent more time on theirs). They're both getting into dashboarding (I think).
Ultimately, they're both selling cloud data services, and their product offerings are gradually converging.
A data lake is a company wide data repository. All the "data streams" from all of the different departments will flow into the data lake. Aim is to use this data to get both macro and micro insights.
They are a data warehouse with analytics? So data warehouse as a service in the cloud?
So they can collect data from different places like sql, images, etc. I think a better question would be what type of data can't they ingest?
Once you have your data i guess you can run some analytics to find out what your data tells you
A data lake can be home to many different data formats e.g. parquet, AVRO, Thrift, protobuf, ORC, HDF5S, CSV, JSON all co-existing together. Spark lets you create a virtual abstraction over all of this, and query it as though it was a homogeneous database. There's no need to import data into a centralized format and schema.
This really all ties back to the "old" Hadoop days, and is an evolution of compute over data not in a fixed and managed format/schema.
I'd like to add some points: Ive used Snowflake for several years. Snowflake works with structured and semi-structured data (think spreadsheets and JSON). I've never tried working with pics or videos - and I'm not sure it would make sense to do that.
I've evaluated Databricks. It works with the above mentioned structured and semi-structured data. I also suspect it could process unstructured data. My understanding is that it runs Python (and some others), so you can do any "Python stuff, but in the cloud, and on 1000s of computers"
Databricks used to be an Apache Spark as a service company. And Spark is a predominantly Scala code base. PySpark is just a Python binding for the real engine popular in ML circles. In the last couple of years the Databricks platform migrated from open-source Spark to a new proprietary engine written in C++.
You're referring to PySpark, which still does all the heavy lifting in the JVM.
People who downvoted this, please take a minute and reflect that your world is not the whole world. There is a serious question in this comment and there are myriads of topics _you_ have no clue about.
sure, but if I see the term 'data lake' I'm gonna Bing it, with the first result being https://aws.amazon.com/big-data/datalakes-and-analytics/what... which explains it nicely.
ELI5 is for reddit, generally here we expect you can google it to get the ELI5 explanation before giving us your hot take in a comment
Yeah, that's exactly the kind of content I found unsuitable when I did a web search for the term. It spends a whole two sentences giving an explanation that tells me very little about how data lakes are anything more specific than a cloud-hosted database solution, and moves on to
> Organizations that successfully generate business value from their data, will outperform their peers.
at which point I'm like
> ok, I'm reading a covert advertisement about Fancy Cloud Technology aimed at some kind of big-spending manager, which is unlikely to tell me meaningfully what this actually is
and I'm out. I was looking for content that was in a more neutral, purely educational genre, and wondering what collection of non-cloud analogues it replaces/is composed of. Someone writing in the comments
> I used it to transform several terabytes of JSON into nice relational data for analysts without too much effort
is way, way more direct and helpful than mentioning that 'unlike data warehouses, data lakes support non-relational data'. Like great, it's a cloud thing that supports a variety of databases. But what is it?
> before giving us your hot take in a comment
I didn't give any take at all? I just really found all the sources that came up on the first page of search results to be almost in the wrong genre for me, and expected (correctly) that people on this site would be able to produce descriptions in 1-5 sentences that worked way better for me.
Pretty much all of the answers I got here were really good, and I'm glad I asked.
> What is a data lake?
> A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.
This may be self-explanatory for you, but what it means in practice is not as self-evident as you believe. For all it describes, it could be an FTP upload directory that loads things into an sqlite database. It's not until the scale is invoked (multi-terabyte/day) that the inadequacies of a naive solution become apparent. For those in that area of the industry, Snowflake is already known. (Seriously, if you're running into issues with limitations of RedShift, it behooves you to take a look at Snowflake.) For those that aren't, data warehousing is unfamiliar, never mind data lake. For those outside the ML sphere, the finer points of training runs are also non-obvious.
It’s probably just me but the distinction between datalake and data warehouse seems like splitting hairs. Unstructured data can always be stored on structure databases. What’s the main reason for both to coexist?
History matters here and I don't know how well this is documented, but: data warehouses have been around since the 70s or so, data lake is a newer term. Data warehouses came from an era where nearly all data was stored in the database itself (typically Oracle), owned and controlled by a single or few groups, and there were only a few databases, which were the source of truth (the two databases would normally be a transaction engine handling real time load (just what's required to authorize a credit card transaction, for example), and a "warehouse" which contained all the long-term data like every transaction that had ever occurred.
Data lakes are more modern and came about as people realized they had 30 databases and the business wanted to do queries against all of them simultaneously (IE, join your credit card transaction history with historical rates of default in a zip code), quickly. The data warehouse solution was to use federated database queries (JOINs across databases), or force everybody to consolidate. A data lake is a single virtual entity that represents "all your data in one place".
It's based on a weak analogy where a warehouse is a place where you put stuff in very well organized locations while a lake is a place where a bunch of different waters slosh together.
Storing unstructured data in a database is dumb because databases cost about 10X storage space due to indexing, while unstructured data often can just sit around passively in a filesystem (and/or have a filesystem index built into it for fast queries).
I view this through the lens of web tech, for example, see the wars between the mapreduce and database people and how Google evolved from MapReduce against GFS to Flumes against Spanner, showing we just live in an endless cycle of renaming old technology.
It's absolutely correct that the terminology doesn't map perfectly
This was really helpful, too. Thanks!
It used to be that way. Old data warehouses (built on relational dbs) couldn't handle large scale data, and old data lakes used to be hard to use (write a map-reduce job to query data).
It is barely true nowadays.
i worked at excite.com right after the IPO, and front and center in the HQ building was a MASSIVE glass wall showcasing the oracle data warehouse machine room.
i didn't enjoy working w/either the datastore directly, or the DBA team that ran it either. an early, more old-white-dude "i just want to serve 5T"
Snowflake conceding they have a 700% markup between Standard and Premium editons which has zero impact on query performance is ... well, it's something. I'd start squeezing my sales engineers about that, definitely not sustainable...
Also proof that lakehouse and spot compute price performance economics are here to stay, that's good for customers.
Otherwise, as a vendor blog post with nothing but self-reported performance, this is worthless.
Disclaimer: I work at Databricks but I admire Snowflake's product for what it is - iron sharpens iron.
How do you get 700% markup? The difference between Standard and Enterprise is 50%. Enterprise does have features which do make workloads run faster, but this benchmark didn't need them.
I've used Snowflake for the past few years, and it's worth pointing out that when it comes to overall cost, there's a lot you get with Snowflake for free. For example, they have HA across 3 AZs out of the box, included in the price and with no configuration required.
If I'm reading what Databricks published correctly, it seems that they've only used 1 driver node for this benchmark, in other words it's a dev setup. If they want to compare apples-to-apples then they should configure, and price, a multi-AZ HA set-up.
I'm not sure if this is still applicable to Photon, however - can anyone confirm?
The _data_ should be replicated, but the compute infrastructure doesn't need to be. Many companies I suspect would be fine having to restart pipelines on driver failure (increasing tail latency, basically) if it yields a substantial cost reduction.
Take all the problems you have had with data warehousing and throw them in a proprietary cloud. That is Snowflake. They are the best today.
Databricks started with the cloud datalake, sitting natively on parquet and using cloud native tools, fully open. Recently they added SQL to help democratize the data in the data lake versus moving it back and forth into a proprietary data warehouse.
The selling point in Databricks is why move the data around when you can just have it in one place IF performance is the same or better.
This is what led to the latest benchmark which in the writing appears to be unbiased.
In snowflakes response however, they condemn it but then submit their own fundings. Sound a lot lot trump telling everyone he had billions of people attend his inauguration, doesn’t it?
Anyhow, I trust independent studies more than I do coming from vendors. It cannot be argued or debated unless it was unfairly done. I think we are all smart enough to be careful with studies of any kind, but I can see why Databricks was excited about the findings.
Whose result can be trusted is beside the point - I actually believe both experiments were likely conducted in good faith but with incomplete context. But that’s beside the point. The point is there’s no good reason to start a benchmark war to begin with.
> While performing the benchmarks, we noticed that the Snowflake pre-baked TPC-DS dataset had been recreated two days after our benchmark results were announced. An important part of the official benchmark is to verify the creation of the dataset. So, instead of using Snowflake’s pre-baked dataset, we uploaded an official TPC-DS dataset and used identical schema as Snowflake uses on its pre-baked dataset (including the same clustering column sets), on identical cluster size (4XL). We then ran and timed the POWER test three times. The first cold run took 10,085 secs, and the fastest of the three runs took 7,276 seconds. *Just to recap, we loaded the official TPC-DS dataset into Snowflake, timed how long it takes to run the power test, and it took 1.9x longer (best of 3) than what Snowflake reported in their blog.* https://databricks.com/blog/2021/11/15/snowflake-claims-simi...
Delta lake is not meaningfully more "open" than whatever Snowflake (or BigQuery and Redshift) are doing. It does not require any less "moving data around"
With all these, the data sits on cloud storage and compute is done by cloud machines - the difference between Databricks and the others is that with Databricks, you can take a look at that bucket. But you're not going to be able to do much with that data without paying for Databricks compute, since the open source Delta library is not usable in real world.
Since commercial data warehouses are an enterprise product for enterprise companies (small companies can use stick with normal databases or SaaS and unicorns seem to roll their own with Presto/Trino, Iceberg, Spark and k8s, nowadays), the vendor and the product needs to be most of all reliable partner. And Databricks behavior does not inspire confidence of them being that.
If I'm outsourcing my analytical platform to a vendor, I want the to be almost boring. Not some growth hacking, guerilla marketing, sketchy benchmark posting techbros.
At the end of the day, anyone making years lasting million dollar decisions in this space should run their own evaluation. Our evaluation showed that there's a noticeable gap between what Databricks promises and what they deliver. I have not worked with Snowflake to compare.
Delta lake is very much open. You can install delta lake and run it yourself. It's a transaction layer running over parquet files. You can go to the delta.io GitHub and install binaries yourself. Snowflake cannot be run independently of their cloud.
The rest of this is some vague claims of Databricks being unreliable techbros blah blah which is just emotionally charged hot air rather than being based on anything.
RE who to pick. Run them side by side. Use snowflake for non technical staff/BI load in prepared cuts of data. it's batteries included and less knobs to twiddle for optimisation. Databricks/spark has a learning code and isn't suitable for non-technical staff. But it gives a lot more options for processing for all the stuff that doesn't fit neatly into data clustering.
Sort of. You can stop using Databricks service, and keep using Delta lake. But Databricks code is not open. Delta Lake is not equivalent to Databricks delta. The value prop is that customers, if they choose to not retain databricks service, can migrate off databricks and still use the open source version of delta lake, which again, is not as good as databricks delta.
Ok you've got me there it's not 100% the exact same code Databricks are using there are some optimisations (that normally do end up downstream anyway). But I think it's getting a bit philosophical to say it's not open when you can run a delta lake "on-prem" and shuffle data between databricks and your own setup with few/no changes. Now Databricks SQL product afaik is not open and that's a proprietary C++ engine comparable to Snowflake so I think these discussions might get a lot more confusing in the future when databricks doesn't just mean various flavours of spark.
Yes Photon is completely proprietary. Databricks does have a "delta" version, but it is actually completely baked into the databricks runtime. So we are both correct. Ali (Databricks CEO) actually has gone on record to say Databricks is 90% proprietary code. There is an open source version, but it is not as good. The culture within Databricks though, is completely open source. Unlike Snowflake, the culture is definitely not open source. I think it affects the culture too.
By learning code I mean Learning curve. You need to be able to code a little bit at a minimum to use Spark effectively even if a lot of the time you can just go with the SQL interface it isn't actually a SQL database under the surface so that can be a bit misleading if you dont know what's going on.
* Databricks is unethical
* Nobody should benchmark anymore, just focus on customers instead
* But hey, we just did some benchmarks and we look better than what Databricks claims
* Btw, please sign up and do some benchmarks on Snowflake, we actually ship TPC-DS dataset with Snowflake
* Btw, we agree with Databricks, let's remove the DeWitt clause, vendors should be able to benchmark each other!
* Consistency is more important than anything else!!!
I don’t think they are saying benchmark is not important but rather public benchmark war being a distraction.
If people have never heard of Databricks, now is the time because a 100 billion company just started a war against them. Great marketing win Databricks.
Databricks is $28B valuation and 2800 employees, Snowflake is $109 valuation and 2500 employees.
They are both billion dolar companies, we're hardly talking David and Goliath here.
for DB that's old number - recent valuation is $38B
To be fair, I've been equating Databricks for a month or so. Databricks is coming after Snowflake. Snowflake doesn't care. Snowflake has a pretty solid moat with:
EASY SQL, data sharing (they have a marketplace), simple scaling
You'll need to revisit this again. In the last two years Databricks has built a lead and a bigger moat. They're essentially nice chaps with a huge community backing them. And we all love their open source tools which essentially powers not only their big data platforms, but everyone else's too (AWS, GCP).
Databricks introduced an open source data sharing feature earlier this year. I don't know Databricks well enough to comment on the other two.
The interesting part is that Snowflake omits Databricks' performance scores in their graphs. Here is how they compare on TPC-DS benchmark, based on two companies' self-reports:
* Elapsed time: 3108s (Databricks) vs 3760s (Snowflake)
* Price/Peformance: $242 (Databricks) vs $267 (Snowflake)
Needless to say, these numbers seriously need a verification by independent 3rd parties, but it seems that Databricks is still 18% faster and 10% cheaper than Snowflake?
The way I read this is: DataBricks benchmarked against us, and they messed it up. Here is hou YOU should evaluate Snowflake performance. And, by the way, it is pretty easy to do it.
Databricks broke the record by 2x) and is 10x more cost effective, in an audited benchmark. Snowflake should participate in the official, audited benchmark. Customers win when businesses are open and transparent…
Databricks and snowflake should pay an independent third party to re-run these. In-house benchmarks by either company don't count with results this different.
Databricks didn't run the Snowflake comparison in-house. From their article it says: "These results were corroborated by research from Barcelona Supercomputing Center, which frequently runs TPC-DS on popular data warehouses. Their latest research benchmarked Databricks and Snowflake, and found that Databricks was 2.7x faster and 12x better in terms of price performance."
I don't trust a supercomputer center to do a good job running a TPC benchmark (I do trust them to run LINPACK benchmarks).
Audited how? If you look at the Snowflake response the numbers being posted by Databricks look outright faked or otherwise false.
There's an official TPC process to audit and review the benchmark process. This debate can be easiest settled by everybody participating in the official benchmark, like we (Databricks) did.
The official review process is significantly more complicated than just offering a static dataset that's been highly optimized for answering the exact set of queries. It includes data loading, data maintenance (insert and delete data), sequential query test, and concurrent query test.
You can see the description of the official process in this 141 page document: http://tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v3....
Consider the following analogy: Professional athletes compete in the Olympics, and there are official judges and a lot of stringent rules and checks to ensure fairness. That's the real arena. That's what we (Databricks) have done with the official TPC-DS world record. For example, in data warehouse systems, data loading, ordering and updates can affect performance substantially, so it’s most useful to compare both systems on the official benchmark.
But what’s really interesting to me is that even the Snowflake self-reported numbers ($267) are still more expensive than the Databricks’ numbers ($143 on spot, and $242 on demand). This is despite Databricks cost being calculated on our enterprise tier, while Snowflake used their cheapest tier without any enterprise features (e.g. disaster recovery).
Edit: added link to audit process doc
Thanks for the additional context here. As someone who works for a company that pays for both databricks and snowflake, I will say that these results don't surprise me.
Spark has always been infinitely configurable, in my experience. There are probably tens of thousands of possible configurations; everything from Java heap size to parquet block size.
Snowflake is the opposite: you can't even specify partitions! There is only clustering.
For a business, running snowflake is easy because engineers don't have to babysit it, and we like it because now we're free to work on more interesting problems. Everybody wins.
Unless those problems are DB optimization. Then snowflake can actually get in your way.
Totally. Simplicity is critical. That’s why we built Databricks SQL not based on Spark.
As a matter of fact, we took the extreme approach of not allowing customers (or ourselves) to set any of the known knobs. We want to force ourselves to build the best the system to run well out of the box and yet still beats data warehouses in price perf. The official result involved no tuning. It was partitioned by date, loaded data in, provisioned a Databricks SQL endpoint and that’s it. No additional knobs or settings. (As a matter of fact, Snowflakes own sample TPC-DS dataset has more tuning than the ones we did. They clustered by multiple columns specifically to optimize for the exact set of queries.)
>That’s why we built Databricks SQL not based on Spark.
Wait... really? The sales folks I've been talking to didn't mention this. I assumed that when I ran SQL inside my Python, it was decomposed into Spark SQL with weird join problems (and other nuances I'm not fully familiar with).
Not that THAT would have changed my mind. But it would have changed the calculus of "who uses this tool at my company" and "who do I get on board with this thing"
Edit: To add, I've been a customer of Snowflake for years. I've been evaluating Databricks for 2 months, and put the POC on hold.
it's different - rxin talks about this: https://databricks.com/product/databricks-sql
when you run Python, it's on Spark, although you now can use Photon engine that is used for DB SQL by default
Credit to you for these amazing benchmark scores via an official process. You've certainly proved to naysayers such as Stonebreaker that lakes and warehouses can be combined in a performant manner!
Shame on your for quoting a fake non-official score for Snowflake in your blog post with crude suggestions to make it seem you're showing an apples-to-apples comparison.
I run a BI org in an F500 company that uses both Databricks & Snowflake on AWS. I can tell you that such dishonest shenanigans take away much from your truly noteworthy technical achievements and make me not want to buy your stuff for lack of integrity. Not very long ago, Azure+GigaOM did a similar blog post with fake numbers on AWS Redshift and it resulted in my department and a bunch of large F500 enterprises that I know moving away from Synapse for lack of integrity.
On many occasions, I've felt that Databricks product management and sales teams lack integrity (especially the folks from Uber & VMW) and such moves only amplify this impression. Your sales guys use arm-twisting tactics to meet quotas and your PM execs. are clueless about your technology and industry. My suggestion is to overhaul some of these teams and cull the rot - it is taking away from the great work your engineers and Berkley research teams are doing.
Snowflake claims the snowflake result from Databricks was not audited. It’s not that Databricks numbers were artificially good but rather Snowflake’s number was unreasonably bad.
Please also refer to my comment below on the value of the TPC audit process: https://news.ycombinator.com/item?id=29208172
Hey jiggawatts - TPC is the official way to audit benchmarks in the database industry. They’ve been around for a bit, but let me know if you want more info, I’m happy to share more about them.
It sounds fundamentally busted if a competitor can submit benchmarks for someone else. TPC is great in general, but I didn't realize it had such a gaping flaw.
TPC submissions take real time/$/energy/expertise, so I don't know anyone who has ever done it casually. Ex: It was a multi-company effort for the RAPIDS community to get enough API coverage & edge case optimization for an end-to-end GPU submission on the big data one (SQL, ...), and even there the TPC folks made them resubmit if I remember right.
Also, note how the parent's response did not actually answer 'audited how'. Pushing the work to the questioner is on the shortlist of techniques studied by misinformation researchers. I'm a fan of both companies, so disappointing to see from a company rep.
Check my reply, Leo.
The audit question is on Databricks marketing unaudited Snowflake TPC numbers. I do think Snowflake is big enough to run TPC, but how you guys choose to market is on you.
But: I think it's cool both companies got it to $200-300. Way better than years ago. Next stop: GPUs :)
Ah ok. Wasn't clear. I think some repro scripts will be available soon.
The results are so crazy different that either Snowflake or Databricks are wrong or outright lying.
This is my point also, and I'm being downvoted for it.
If two people are in disagreement about the same facts, then one of them is either misinformed or lying. It's that simple.
If the only recourse seems to be to sink to the level of mud-slinging, with no clear ability to point to the audit trail and say "this is where it all went wrong", then it calls into question the value of that auditing process.
I'm personally unimpressed with the TPC process in general. I remember one "benchmark" that showed the performance of a 2RU server breaking some record, and it was a minor footnote that it was using a disk array with 7,500 drives in it -- dedicated to that one server for the duration of the test. That's an absurd setup that will never exist at any customer, ever.
I ran that same software myself on literally the exact same server, and it couldn't even begin to approach the posted TPC numbers on typical storage. It was at least two orders of magnitude slower.
The rub was that its inefficient usage of storage was the main problem, and the vendor was pulling a smoke & mirrors trick to hide this deficiency of their product. The TPC numbers were an outright fraud in this case, at least in my mind.
So to me, TPC looks like a staged show where the auditors are more like the referees in a WWE wrestling competition.
The TPC audit process tends to be thorough and strict.
Possibly you missed a configuration that was included in the Full Disclosure Report or Supporting Files?
The Databricks official, audited benchmark was executed against Databricks SQL which is a PaaS service that doesn't allow special tuning btw.
That doesn't allow end users any configuration, but this doesn't apply to the company itself which can apply settings from the background on behalf of end users.
I didn’t miss it. That doesn’t make it any less misleading.
This is the sort of FUD testing that gets thrown back and forth between companies of all kinds.
If you're in networking, it's throughput, latency or fairness. If you're in graphics its your shaders or polygons or hashes. If you're in CPUs its your clock speed. If its cameras, it's megapixels (but nobody talks about lens or real measures of clarity) If you're in silicon it's your die size (None of that has mattered for years, those numbers are like versions not the largest block on your die) If you're in finance, it's about your returns or your drawdowns or your sharpe ratios.
I'm a little bit surprised how seriously databricks is taking this, but maybe it's because one of the cofounders laid this claim. Ultimately what you find is one company is not very good at setting up the other company's system, and the result is the benchmarks are less than ideal.
So why not have a showdown? Both founders, streamed live, running their benchmarks on the data. NETFLIX SPECIAL!
Exactly. Not sure about Netflix special, but there are experts that have dedicated their professional careers to creating fair benchmarks. Snowflake should just participate in the official TPC benchmark.
Disclaimer: Databricks cofounder who authored the original blog post.
The benchmark itself is kinda useless, so I don't see why they should. If you look at tpc-h for years, you had exasol as a top dog, but in the real world that meant nothing for them.
Exactly, companies learnt from Exasol Out of the box performance is the name of the game Executing a benchmark as complex as TPC-DS without tuning by Databricks or Snowflake is a big accomplishment
Come on, you're going to make a ton of money on the IPO now focus on the things that matter in life... ie: starring in a netflix special.
I dont get still how much optimization was done for the Snowflake TPC-DS power run. This is what I am seeing so far and what i am foggy on -
DB1.Databricks generated the TPC-DS datasets from TPC-DS kit before time started. Databricks starts time then generated all queries. Then Databricks loaded from CSV to Delta format (also some delta tables were partitioned delta tables by date) and also computed statistics. Then all of the queries are executed 1-99 for TPCDS 100TB
SF1. Databricks generated the TPC-DS datasets from TPC-DS kit before time started. Databricks starts time then generated all queries. Then load from S3 to Snowflake tables by - (i'm not sure about these next parts) - creating external stages and then "copy into" statements I guess? Or maybe just using copy into from an s3 bucket, that part doesnt matter much. But its not clear did they also allow target tables to be partitioned/clustering keys at all? Then all of the queries are executed 1-99 for TPCDS 100TB
Its just hard to say exactly what "They were not allowed to apply any optimizations that would require deep understanding of the dataset or queries (as done in the Snowflake pre-baked dataset, with additional clustering columns)" means exactly. Like what does that exactly mean. At a glance though, this looks very impressive for Databricks, but just want to be sure before I submit to an opinion.
Personally I think it’s a great response and very well written. I didn’t jump on the congrats-Databricks wagon when the result first came out because of the weird front page comparison against snowflake. Both companies are doing great work. Focusing on building a better product for your customer is much more meaningful than making your competitor look bad.
It is well written, but there's some sleight of hand here and there too. Like using your lowest tier product to demonstrate price/performance against a competitor's highest tier. The Snowflake lowest tier doesn't have failover, for example...or compliance features.
This is incorrect. Every edition of Snowflake is deployed across multiple availability zones with automatic failover in the case of failure or AZ outage. This is included in the price and requires no configuration by the customer. Cross-cloud/region failover requires the top edition and a few lines of SQL to configure (plus cloud egress costs for data replication).
The higher editions of Snowflake include features like materialised views, dynamic data masking, BYOK, PCI & HIPAA compliance etc., non of which are required for the benchmark.
I'm getting it from Snowflake's own page:
https://www.snowflake.com/pricing/
Amongst other things, listed under the enterprise tier, and not lower tiers, is "Database failover and failback for business continuity".
"The higher editions of Snowflake include features like materialised views, dynamic data masking, BYOK, PCI & HIPAA compliance etc., non of which are required for the benchmark."
Yeah, but they are referencing a price/performance comparison to a Databricks tier that DOES have those things. That's the point. Update your own numbers with a lower tier, but don't update the competitor tier too?
The "failover and failback for business continuity" is specifically for cross-region/cloud, i.e. this is something you explicitly have to do. tbh I've never used it, as I guess this would be only for very large accounts. But all editions have automatic failover between AZs out-of-the-box.
[Edit] Highly Available would be a better description per region, as that's out of the box with no configuration. e.g. if a node dies, your cluster will automatically heal and resubmit your query. If there's an entire AZ outage, your query should be resubmitted in another AZ. I think this is why failover/back is called out separately, as that's not automatic, incurs additional costs etc. Here's a link with an explanation: www.snowflake.com/blog/how-to-make-data-protection-and-high-availability-for-analytics-fast-and-easy
I didn't know DB did MVs, masking etc., so yes, that makes sense. Maybe a better idea would be to have a minimum offering comparison, and then a maximum offering comparison (with multi-AZ failover, masking feature costs etc. included) - the reality for a customer would be somewhere between those extremes.
Exactly. That’s why I think public benchmark war is just a waste of time. There will ALWAYS be some subtle differences between the two platforms that results will never be apple to apple.
The audience for these posts are enterprise managers who don’t actually understand their compute needs.
For the more technically inclined, don’t let any corporate blog post / comms piece live in your head rent-free. If you’re a customer, make them show you value for their money. If you’re not, make them provide you tools / services for free. Just don’t help them fuel the pissing contest, you’ll end up a bag holder (swag holder?).
Linking to the discussion on the follow up from Databricks: https://news.ycombinator.com/item?id=29232346
I've been a customer/user of Snowflake. They make it simple to run SQL. There is a bunch of performance stuff that I don't need to worry about.
I'm interested in using Databricks, but I haven't done it yet. I've heard good things about their product.
"Posting benchmark results is bad because it quickly becomes a race to the wrong solution. But somebody showed us sucking on a benchmark, so here's our benchmark results showing we're better."
I disagree. It makes sense for Snowflake to response to what-they-think-is an unreasonably bad result published by Databricks. And they focused more on Snowflake’s result and only compared dollar cost against Databricks. It’s consistent with their philosophy that public benchmark war is beside the point and mostly a distraction.
Their cofounder was behind vectorwise, which kicked ass in benchmarks, but died as no one even heard of it. You can run the benchmark queries fast, that's great, but can you handle code migrated from vertica? Will you optimiser come up with a good plan for queries built on 15 layers of views? That's what companies in the real world have, not some synthetic benchmark that you can make sure you can run for marketing purposes.
The thing is even that response doesn't show them to be better. As someone pointed out, they're comparing their cheapest offering with Databricks' most expensive one and saying they're 3% better in price-perf. What does someone read into that?
I'm not familiar with this realm to comment on veracity of claims but it could very well be
"Posting benchmark results is bad because it quickly becomes a race to the wrong solution. Someone misrepresented our performance in a benchmark, here are the actual results."
The main question I have for DB is, how good is their query optimiser/compiler? It's fun that you can run some predefined set of queries fast. More important is, how good you can run queries in the real world, with suboptimal data models, layers upon layers of badly written views, CTEs, UDFs... That is what matters in the end. Not some synthetic benchmark based on known queries you can optimise specifically for.
@AtlasLion you are right real world performance matters. We test extensively with actual workloads, and the speed up holds there too. For example: lots of real world BI queries are repeated over smallish data sets of 10 to 50 GB. We test that size factor and pattern all the time.
Performance is only one part of the story. The major advantage Snowflake (and to some extent Presto/Trino) brings to the table is it's pretty much plug and play. Spark OTOH usually requires a lot of tweaking to work reliably for your workloads.
I think the comparison was Snowflake vs Databricks SQL. Databricks SQL is a PaaS service just like Snowflake. Also, it uses their Photon engine, which is a proprietary engine written in C++. It is not Spark.
I'm aware that Databricks is a PaaS service, but what Databricks runs under the hood is Spark (with a few proprietary extensions). So your jobs/queries do require some tuning just like with OS Spark.
Spark has had SQL engines (SparkSQL/Hive on Spark) for a long time. Photon is just a new, faster one. Photon tasks also run on Spark executors only, so it's not independent of Spark[1]. Also, while it's proprietary now, I wouldn't be surprised if Databricks open-sources it in the future, like they did with Delta Lake.
1. https://databricks.com/blog/2021/06/17/announcing-photon-pub...
Very much true. I saw a joke tweet recently something along the lines of - It's amazing how many data engineering scaling issues these days are being solved by just paying Snowflake more money.
Spark does take a lot of tuning, but then I'm guessing Databricks offer that service as part of your licensing fee? (I'd hope so if they're selling a product based on FOSS code, there has to be a value add to justify it)
> I'd hope so if they're selling a product based on FOSS code, there has to be a value add to justify it
They have some proprietary features like DBIO [1]. They also have some cloud-specific features like storage autoscaling [2] that would not be available in OSS Spark. Even Delta Lake [3] used to be proprietary, but I suspect the rise of open-source frameworks like Iceberg led them to open-source it.
Shameless plug - when working at a since-shutdown competitor to Databricks, I'd come up with storage autoscaling long before them [4], so it's not unlikely that they were "inspired" by us :-) .
1. https://docs.databricks.com/spark/latest/spark-sql/dbio-comm...
2. https://databricks.com/blog/2017/12/01/transparent-autoscali...
4. https://www.qubole.com/blog/auto-scaling-in-qubole-with-aws-...
The open source Delta is not a replacement for the real thing - they did not include features like optimizing small files (small file problem is well known in big data, and much more of a problem once streaming gets involved) and others. It is more of a demo of the real thing. Which does not stop them from repeating everywhere how open they are, of course.
EDIT: the delta also still keeps partitioning information in the hive metastore, while iceberg keeps it in storage, making it a far superior design. Adopting iceberg is harder due to third party tools like AWS Redshift not supporting it - you have to go 100 % of the way.
>the delta also still keeps partitioning information in the hive metastore, while iceberg keeps it in storage, making it a far superior design.
Check out https://github.com/delta-io/delta/blob/3ffb30d86c6acda9b59b9... when you get a chance. You don't need hive metastore to query delta tables since all metadata for a Delta table is stored alongside the data
>they did not include features like optimizing small files
For optimizing small files, you could run https://docs.delta.io/latest/best-practices.html#compact-fil...
So much to read. TLDR; Databricks still holds the world record and they beat us on price/performance
> At the end of the script, the overall elapsed time and the geometric mean for all the queries is computed directly by querying the history view of all TPC-DS statements that have executed on the warehouse.
The geometric mean? Really? Feels a lot easier to think in terms of arithmetic mean, and perhaps percentiles.
Geometric mean is commonly used in benchmarks when the workloads consists of queries that have large (often orders of magnitude) differences in runtime.
Consider 4 queries. Two run for 1sec, and the other two 1000sec. If we look at arithmetic mean, then we are really only taking into account the large queries. But improving geometric mean would require improving all queries.
Note that I'm on the opposite side (Databricks cofounder here), so when I say that Snowflake didn't make a mistake here, you should trust me :)
> But improving geometric mean would require improving all queries.
No. Improving the geometric mean only requires reducing the product of their execution times. So if you can make the two 1 ms queries execute in 0.5 ms at the expense of the two 1000 ms queries taking 1800 ms each then that’s an improvement in terms of geometric mean.
So… kind of QED. The geometric mean is not easy to reason about.
Usually making a 1 ms query execute in 0.5 ms is a lot harder than making a 10 second query execute in 5 second.
One of the benefits of geometric mean is that all queries have "equal" weight in the metric, this keeps vendors from focusing on the long running queries and ignoring the short running ones. It is one way to balance between long and short query performance.
A similar concept is applied to TPC-DS for data load, single user run (Power), multi user run (Throughput) and data maintenance (Concurrent Delete and Inserts).
Check clause 7.6.3.1 in the TPC-Ds spec in http://tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v3....
> Usually making a 1 ms query execute in 0.5 ms is a lot harder than making a 10 second query execute in 5 second.
Eh, okay... It produces the same reduction in geometric mean though, right?
I genuinely think DeWitt clause is good for the users (bad for researchers). Without it, especially in the context of cooperate competitions, the company with the most marketing power will win. Users can always compare different products themselves. I am likely wrong but please help me understand.
What do you know, here's an article[1] from 2017 about Databricks making an unfortunate mistake that showed Spark Streaming (which they sell) as a better streaming platform to Flink (which they don't sell).
I really hope this is not the case again.
(yes, I understand my sarcasm is unneeded, I couldn't help myself)
[1]: https://www.ververica.com/blog/curious-case-broken-benchmark...