Amazon RedShift vs. local PostgreSQL
rarestblog.comWhat this test is essentially doing is comparing Postgres against a single node of Redshift. It is not surprising that Postgres is faster. But Redshift is not meant to be used on a single node.
What Postgres & Redshift represent are are two different products for two very different problems. Postgres is good for small sets of transactional data like orders in a shopping cart system (less than 1TB). Redshift is good for big sets of data involving user behavior and clickstream analysis (greater than 1TB). I would not want to manage clickstream data on a single instance of Postgres nor would I want to manage an order system in Redshift.
A better test of Redshift would be to see how it compares to Asterdata...particularly with both in AWS. That should be telling.
We don't run a shopping cart, but one of our databases at present is at 11.3TB on PostgreSQL 9.1 and we're by no means dealing with small sets. We routinely juggle several Gigs at a time when we need to do analytics. We didn't see a reason to put this on a cloud since bandwidth + electricity is still cheaper for us than bandwidth + storage in the cloud at present.
If you have a few servers to spare, I'd recommend installing Cloudera Impala on them. You can use Apache Sqoop to pull the data out of Postgres and into HDFS.. Directly after, you can run SQL queries which will query the data in parallel (similar to redshift).
I don't think comparing RedShift to Postgres is accurate, RedShift was not designed for transactions, it was designed to store/query billions of rows using a columnar storage format...it's more like an analytic database (Greenplum, Teradata). Also, these databases are designed to scale out, and so you usually don't really see compelling performance gains until you start adding a few nodes to help influence parallelization.
With that being said, I'd be interested to see how RedShift compares to Impala.
I really don't understand what's going on here.
* You're measuring request latency. What part of that (for RedShift) is due to the network? (EDIT: I re-read and saw you're using `SELECT 1` as a gauge for round-trip latency and subtracting it from the results. Are you only doing this for RedShift, or also for local PostgreSQL? To me, it seems like that heuristic is over broad -- it encapsulates not only network latency, but syscall overhead, query parsing, etc).
* In your tests, PostgreSQL without indices performs on-par with RedShift. Does RedShift not support indexing? Is there some metric you're trying to show by not using indices? As designed, this benchmark does not map to any use-case I've ever seen.
* I noted in post that "SELECT 1" was done in 100ms round-trip, so I subtracted that from all points where it was reasonable. (No point to subtract it from 8 seconds).
* Not sure about the index support. Didn't try.
My idea was quite simple. I have some data at work (databases up to 30GB). Sometimes we hope to find something better. The main question was - will RedShift help, will it be radically faster? Will it be radically easier?
The answer for me - no, it won't help in my case, we need that 30GB data in real time, it looks like RedShift is more when you have 1TB+ data. Yes, it is radically easier.
Thanks for the reply. I figured for a dataset of that size, the main bottleneck might be not indexing -- maybe RedShift stores row data in a higher-latency medium while keeping indexes in-memory. Curious, I checked the documentation[1] and found this:
> Amazon Redshift doesn’t require indexes or materialized views and so uses less space than traditional relational database systems.
Reading through the rest of their FAQ, it sounds like they echo your conclusion -- RedShift shines the most for use-cases where the dataset is large enough that, to use PostgreSQL, you'd have to shard out multiple instances.
--
I don't see your python code defining a distribution or sort key for Redshift which is an important design consideration. (For my own use case of log analysis, I sort on datetime and use an "even" distribution). Also doesn't look like you ran "vacuum" or "analyze" after doing the loads to Redshift. So the query optimizer has no statistics to drive its decisions.
And as others have pointed out, your 30 GB data set is pretty tiny. You could look at some of the in-memory DB options out there if you need to speed things up.
very interesting. one of the reasons we picked mysql for a very high-volume app over postgres is that we have RDS and didn't want to do backups/snapshots/etc. Could we now use RedShift as a postgres-API RDS?
You wouldn't want to replace RDS (or MySQL/postgres) with Redshift for OLTP workloads. Redshift is built for analytics and complex query workloads. The major benefit is the ability to do MPP (Massive Parallel Processing) and distribute the work out to nodes. I believe you can have up to 100 nodes in the cluster with the 8XL machines and 32 with the single XL.
RedShift is tuned for analytics, so if you're using MySQL today for analytics, it could be a good fit. But if you're looking for more of an OLTP engine, RedShift is probably not the right choice for you.
You need to run this with a column store database like Infobright. Postgres is more of a transactional database, Infobright is suited towards the similar large dataset analytics that this is aimed towards.
Redshift isn't Postgres. It's built on top of Postgres, similar to how Infobright is built on top of MySQL.
Redshift is built on ParAccel, not Postgres. It speaks the Postgres API, but the storage engine is not Postgres.
Some of the code is based on Postgres, based on their own marketing materials: "Paraccel has leveraged Postgres for some of its parsing and planning functions". So the head node continues to have Postgres origins, but not the compute nodes. Also see this write-up on their paper discussing the links to Postgres: http://dbmsmusings.blogspot.com/2009/07/paraccel-and-their-p...
I run 3 large oracle RDS instances I wonder if redshift could be effectively used the same way, we have been thinking about migrating to postgresql.
How big is your data?
16tb per instance
Cool. :)
I wouldn't know a better way to deal with that with Postgres than sharding across instances. But my "big data" is about 0.75 TB so it fits nicely in one instance. I don't know how people with real problems do it.
Wouldn't it have been better to do an EXPLAIN ANALYZE for the timing measurements instead of having the results returned locally?
It would certainly be interesting to see what Redshift returns when you do that.
Honestly I don't know. Do you think it would be better? Could you explain why?
because it shows the estimated cost (plan) of the queries independent of the magically subtracted network roundtrip time. it also serves to show if the redshift even supports it.it.
Ahhh, you meant on the RedShift side. I did not think of it. It would have definitely been better. Good point!
It seems that it actually supports it http://docs.aws.amazon.com/redshift/latest/dg/r_EXPLAIN.html
This is a pretty interesting. I wonder how query performance differs between Redshift and local PostgreSQL for other types of benchmarks as well, say TPC-H queries. (And I guess how Redshift scales out as the dataset size increases in TPC-H.)
I'd like to see some more information about the local setup, including hardware and the postgresql.conf. Otherwise, this tells me very little in terms of comparison.
In anyway this test won't tell you much, just how different systems behave to bigger load.
The local setup was quite usual: PostgreSQL 9.2, Mint 13, default conf in VirtualBox in iMac i5 12GB. (read: home computer, no tuning)
For me the result is that mostly RedShift is on par with local PostgreSQL, sometimes even winning for <5M rows. So with better PostgreSQL tuning you can probably stretch it, but not for as much as RedShift can do for REALLY big data.
Also the big deal was that RedShift scaled linearly.
It scaled linearly, but also went unresponsive for five minutes. (Yeah yeah, it's a new service).
The default Postgres configuration is pretty weak. work_mem is set way to low, for instance, and that's bitten me a few times. I wouldn't say it's unrealistic--lots of people run with it that way in production and never find out how easily they could speed things up. Even me, for years.
But ultimately I'm more swayed by your interaction with it and I hate the endless benchmark tweaking that comes after every blog post about performance testing stuff. The point of this Redshift thing is hugeness first and foremost, so it's interesting.
At least on the small scale, I'd expect realistic "local" hosting to outperform any offering from Amazon.
For example, until your data scales above 20 GB, you'd be able to host it on a $5 SSD-based server from Digital Ocean. Most databases are bottlenecked by I/O. So switching to SSDs gives you the biggest performance boost.
On the high end, Amazon's offer probably will be better. (After all, the major draw to Amazon is "automatic scaling", so that you don't have to worry about Replication or other server administration duties at the high end). But considering how powerful a $5 SSD-virtual machine is today, I think a more realistic test would be with some sort of SSD-based cloud server.
Like others have mentioned, Postgres and Redshift are very different animals, Postgres is a row store and Redshift is a column store. On large data sets analytic queries that return a few columns will significantly outperform a row store DB.
We have found that Redshift is comparable to other columnar databases we work with, while we cannot publish any comparative benchmarks, we did put a blog post on what we found (link in another comment here)
Data warehousing often involves star schemas, which means lots of joins in your queries. I'd love to see how a real world OLAP tool performs on this.
While you could use Redshift as a source for OLAP, most OLAP tools will have their own data store. But if you are referring to ROLAP, then it can perform well if tuned properly for the star schema. This would include BI tools like Microstrategy and Mondrian with Jaspersoft.
The main issue with Redshift is the lack of multiple sort orders on a table. Take a look at our blog post on first impressions gleaned during the preview. Disclosure: we are one of a couple of systems integrator partners for Redshift.
http://www.full360.com/2013/02/14/aws-redshift-full360-first...