ELT your data with PlanetScale Connect
planetscale.comPlanetscale is really being pushed on this site lately but all I see is theory and promises of doing things different for vague reasons. I hope I'm wrong or just not understanding something but it feels like a CEO trying to sell a product without understanding the technical reasons why a project like this hasn't existed before. I can't imagine that planetscale is going to solve something that Google, amazon, Facebook, apple, and Netflix all failed to solve for a similar use case and for well documented reasons.
I thought the common denominator for this was (and has been, for decades) ETL? Extract, Transform, Load and in that order, because what other order would make sense?
Getting such basic things wrong doesn’t exactly give the reader the impression that the writer knows the subject particularly well.
Its not a typo but describes a product difference. ELT is a common term for a new type of data eng workflow that isn't mapped to ETL. In most ELT products the raw data is loaded into and then transformed by the data warehouse tool. As opposed to the ETL pipelines many of us are used to where the data gets transformed in a separate process before being dropped into the DW.
data warehouses are now processing engines that scale compute independently of storage. you can simply dump your raw data into warehouse and do your transform there. Hence ELT.
Using a data warehouse as a processing engine is one of the dumbest things to come out of the data world.
ELT is genius marketing though!
why is it dumb?
Allows providers to double and triple bill a user when under ETL they would only charged once.
not sure i follow this. Why wouldn't you pay more if you are doing more inside the warehouse. Also setting up systems to run places for pre wh transform isn't free.
They aren’t the only ones using ELT instead of ETL. I hear they are different but I have no motivation to even Wikipedia the topic to find out.
ETL: dumb data warehouse. You do your own transform before inserting the data.
ELT: smart data warehouse. You dump the raw data and transform in the data warehouse later.
Or, DWH is just storage vs DWH is storage + transform.
When I built a ETL pipeline before the term ELT existed, we had multiple pipelines that transformed data in the data warehouse as well as transforms of the raw data on the way in. The transforms on the "way" in were minor like renames or formats while the transforms from within the dwh were much more involved.
Every pipe from db to dw is going to be some form of ETL, it's just that with "raw" data the transform is minimal. Unless your going apples-to-apples data storage, there's going to be some transforms in the actual underlying data types.
I agree with a comment down below, it's a genius marketing term.
> we had multiple pipelines that transformed data in the data warehouse
I've done this in the past too. But that isn't simply what ELT is. In the past compute was tied to storage which made these transforms interfere with load on the actual system.
ETL is the standard terminology that I learned, going back decades.
See dbt (data build tool) for an example of ELT workflow.
Emergency locator transmitter your data?
how many ETL/ELT projects do we really need?
i feel like every day of a month there's a new flavour coming out
This isn't an 'elt project'. They are supporting ELT out of their planetscale database for their users.
Step 1: write an open-source connector for an open-source ELT tool
Step 2: paywall customers’ access to their own data so you can brand it as a premium feature/product
Step 3: ???
Step 4: profit
worse, theyre pricing is based on innodb_rows_read, even includes cached rows.. so every single ELT job costs $$$ based on how many rows you have, even if all in buffer pool cache already
double-worse, innodb_rows_read known to be buggy! example https://jfg-mysql.blogspot.com/2022/03/rows-examined-not-tru... -- this bug is in customer's favor but what if other bugs are not?!
This is at least the second time you've made this exact same comment on an article. So in an attempt to respond to what I will assume are arguments made in good faith...
A database stores and serves requested data. Pricing is based on your usage of this service — how much data are you requesting it to serve in this case. It does not matter if a piece of data is cached or not, it's a database service not a block device service — and certainly not one with an independent cache in front of it that is somehow magically free to operate. Are you insinuating that e.g. Redis should always be free because its keys/records are in memory? I'm sorry, but I fail to see the logic in your argument here.
Of course ELT jobs consume resources. You're using the service, and you're paying based on your usage. This is simply how serverless / usage based models work. Again, I fail to follow the logic here.
innodb_rows_read is a count of the rows that MySQL (the query execution layer) reads from InnoDB (via the storage engine API). This does not count reads from internal / system / data dictionary tables as that is covered by innodb_system_rows_read. It's not "a bug" that it is what it is — InnoDB internally reads (b+ tree) pages and index records (as it uses index organized tables), and that's what happens with Index Condition pushdowns, it can apply query predicates directly as it's examining the index leaf nodes; it's only when pushing results back up to MySQL through the storage engine interface that these InnoDB records get converted to MySQL's generic row format (which includes going from InnoDB's arch independent big endian format to system/little endian format). This metric is telling you exactly what it should (not a bug) — how many rows does MySQL read from InnoDB — there are other handler and InnoDB specific internal metrics that you can use if you want other numbers (e.g. information_schema.innodb_metrics). JFG's blog post was (correctly) noting that if you as an system operator are trying to calculate the full cost of a query then you cannot rely solely on how many rows MySQL reads from InnoDB but instead look at how much work InnoDB does (which is much more challenging as you have b-tree traversal and maintenance, MVCC overhead, prefretching, etc involved) — and ultimately how many system resources are used in total (I/O amplification being one factor with InnoDB's index organized tables and update-in-place model). So parsing and optimizing costs, how many bytes are read from disk, pages from memory, was a temp table used, was a sort file used, etc. Again, this innodb_rows_read metric's intended behavior/meaning is NOT a bug and this metric will always favor the user for usage based billing. Your insinuation that this is somehow buggy and cannot be trusted as a metric so beware... here I would say that you are simply mistaken. This was an intentional decision made to offer simplicity, transparency, and to benefit users (we also go through great efforts to subtract rows read due to internal operations, just as MySQL itself does).
Hopefully this helps to allay your concerns.