Ask HN: Good starting point for OLAP/OLTP from RDBMS background
I have experience doing traditional web apps using PostgreSQL and some bit with SQL Server - back in 6.5 days.
Lately, I have been hearing about OLAP/OLTP/OLAP cubes etc at various meetings.
Doing a search does not seem to throw up a canonical reference on all these.
What would be a good way to go about learning these things with my background - which would be more traditional RDBMS based webapps/business apps. There is no good canonical reference. OLAP for RDBMS is different than OLAP as originally defined and as provided in specialized tech stacks for purposes of planning apps, business modeling and so on. Absent a good canonical reference for what OLAP actually is, hard to have a good one for the RDBMS flavor. The Data Warehousing Toolkit by Ralph Kimball is the classic for dimensional design in that vein. I’ve been working in OLAP since 5 years before the term was coined, happy to discuss further. Thanks. My initial question arose where does products like SnowFlake/RedShift falls into this. They seem to be coined as datawarehouse but seems they do the same as OLAP work. But looking at Snowflake docs I read: """Currently there is no option similar to an SSAS cube in Snowflake. Once data is loaded into the databases Snowflake allows to query the data in a way similar to traditional OLTP databases.
For data aggregations the Snowflake library offers rich sets of in-built functions. Together with UDFs, SPs and Materialized Views we can build custom solutions for precomputed data aggregations.
For data analysis we still have to rely upon third party tools. Snowflake provides a variety of different connectors to access its database objects from other analytical tools. There are plans in near future to introduce an integrated tool for data aggregations, analysis and reporting.""" So they say, its not really an OLAP. And how exactly it differs from OLTP? Is a web app OLTP? The difference between OLAP and OLTP, which I will interpret a bit from the original "OLAP Manfesto", is that OLAP is about providing access to data that leverages as much resource as necessary to produce analytical results in the query. OLTP was about transaction processing, and the emphasis was on reducing the number of elements that were required to be locked and changed during transaction processing. Normalizing data reduces moving parts. Long chains of aggregations and calculations consume lots of resources. OLAP was introduced to contrast the performance of specialized machinery with what RDBMS were typically good at, circa 1993, which was imperfect since DSS databases had been around for a long time by then. A typical dimensional star schema is good for giving you various sums, counts, etc. Unlike classic specialized multidimensional analysis and planning tools, you can derive an arbitrary set of subselections and additional calculations within a query, subject to the limits of the SQL version. For example, concomitance and basket analysis of what products sell with other products gives you a result that has two product keys, product and 'along-with-product'. In an OLAP cube with just a single product dimension, you can't derive this at all. So, given a data set, the RDBMS gives you a greater ability to filter and aggregate. It just comes at the cost of generating the right SQL. For the specialized (sometimes called multidimensional, but they're all multidimensional) tools like SSAS, you trade away overall flexibility of querying when you create a cube, since you predefine all the major relationships, but you gain expressiveness in how you model your calculations, and you make consuming the data easier. An RDBMS provides no metadata that supports user traversal of the dimensional space, and there are lots of useful requests that cannot be modeled as a relational query without a huge amount of information that needs to be injected into the query. As a very small but tractable example, suppose you want to see result rows sequenced as (budget, forecast, actual)- this is not alphabetical, so the query needs to define some intermediate table with rows of labels and sort keys like (budget, 1), (actual, 2), (forecast, 3). Specialized systems make it much easier to both create a precise result form, especially when the data would map to multiple fact tables otherwise, and they also make long chains of calculations much easier to specify. A simple set of metadata definitions and a small amount of query text can specify what would take many KB of SQL. The code required to provide a user interface to the results is also probably higher for relational. A relational query only has 1 cursor. A query in MDX or similar language has N cursors, 1 for each axis of result, plus the matrix data set formed by crossing the axes. This allows much easier specification of the result that will be rendered for the user to consume. Shredding a single SQL result cursor into a crosstab requires plenty of code, and issuing a batch of queries (1 for each crosstab edge, 1 for the matrix) requires less code but still enough. Hope this helps!