Settings

Theme

Postgresql 9.6: Support parallel aggregation

postgresql.org

267 points by eMerzh 10 years ago · 63 comments

Reader

pilif 10 years ago

I am totally impressed by the work 2ndQuadrant is doing: many of the recent innovations to Postgres have been done by them and all of that without any obligation for them to be doing so. The BSD license would allow them to add all of these things to a proprietary fork that they could be selling.

Or they could just release their own fork under an open license and focus on just adding features.

But that's not how they work. All of their contributions are pushed upstream which is a very considerable effort with how conservative Postgres is at accepting new functionality.

Aside of that: there are 2ndQuadrant employees in the #postgres IRC chat room, helping people with daily support issues. This is their core business and yet they still help people for free (within reason). This is bloody impressive.

If I'm ever at a point when I need help with a Postgres issue, then they will be very first of the list of companies I would consider.

Thank you very much for all that you are doing.

  • pgaddict 10 years ago

    While I do work for 2ndQuadrant and think that David did a very fine job on this great patch, it'd be a bit unfair not to mention this patch stands on parallel infrastructure built by EDB (and Robert Haas & Amit Kapila in particular) and others.

    Kudos to them.

  • sandGorgon 10 years ago

    I wish they would build a RDS alternative. I would use them in a heartbeat. There are all these companies/startups who are not in the target market for consulting... But who would gladly pay for hosted postgres.

    • pilif 10 years ago

      Just a little heads-up: Postgres is very easy to run on your own and even in the default configuration runs well even for a considerable amount of users.

      Over here, we only started to seriously thinking about what we're doing once we were handling in the order of 10K transactions per second.

      Once you are at that level, you're probably going to need optimizations specific to your application and a generic database hoster might not be able to help you anyways.

      I get that as a startup you don't have people for everything, but can you really afford to outsource the knowledge about the central piece of your application where all the value is stored at?

      • sandGorgon 10 years ago

        I have a fairly sophisticated docker VM with wal-e, replication and everything running. Yes - I am aware of what you say.

        Hosting the dB has never been the knowledge about central piece of application. you may disagree (and I respect that), but for me it has been similar to building a RAID-10 dedicated server vs using AWS.

        You can argue that maintaining data resiliently is a critical part of the organization - but at what stage? In the first 3 years of a startup, you are iterating the product. You are pretty much agonizing over drop rates, conversion rates for every single minute of your life. The fact that data will crash NEEDS to come a distant second.

        Dropbox has only just moved out of AWS. Storage was probably the most critical part of Dropbox, but it chose (rightly) to focus on customers first.

        What you are talking about will come - but it will come after some time. And till then I would love to pay some postgres devs to run a hosted dB for me...like RDS.

    • htn 10 years ago

      Please check out our Aiven (https://aiven.io). Managed & hosted pay-per-hour PostgreSQL that is available on GCE, AWS and DigitalOcean.

      • sandGorgon 10 years ago

        Very cool! I would definitely use you guys once my aws credits run out. Quick suggestion - please make your pricing front and center. If you didn't tell me on HN, I would have dropped off on your home page. Check cloudways pricing page for a very similar business model pricing page.

    • jpgvm 10 years ago

      If you don't mind hosting it yourself check out Flynn our pg appliance is very similar to how RDS is constructed. Or if you are more adventurous look at Joyent Manatee which the Flynn code is based off.

      • sandGorgon 10 years ago

        But that is precisely what I want to avoid. For most startups at an early stage, what we are short of is people. It's the github vs self hosted argument - most people with limited people resources would default to github.

        RDS is the most expensive thing we pay for. It's worth it.

pgaddict 10 years ago

FWIW, a simple benchmark by David Rowley (one of the authors of the patch) are here:

http://blog.2ndquadrant.com/parallel-aggregate/

  • andruby 10 years ago

    The article claims a near linear speedup for a very large query. On a 64 core machine, the query takes:

    1375s with 0 workers

    131s with 10 workers

    56s with 30 workers

harel 10 years ago

It looks like in 9.4 PG went after the document databases and now they are after Oracle. I don't know why I get excited with point releases of postgres, but I do.

  • stuartaxelowen 10 years ago

    Because they are building an open source industry leading database. What's not exciting about that!

    • harel 10 years ago

      Its not just that. I waited patiently for 9.4 to come out as I needed just that functionality of indexable JSONB for something I'm working on. It was pointless for me to start the project without it. 9.4 came out better than I could have expected. Then followed by Upserts, and now this... I'm directly and personally affected by each point release. Its fantastic.

    • pbreit 10 years ago

      Is anyone still starting with MySQL anymore?

      • harel 10 years ago

        "mySql" is dead to me. Literally. Anything Oracle touches becomes a potential legal liability. Its Maria DB now and although I do maintain a legacy 100GB+ MARIA db, I will never use it again for new projects. The thing about postgres is that once you start using it you understand there is really very little competition in both the open or commercial worlds. My previous big project relied on mongoDB for statistical data. At the time there was no other compelling document database for our use case. Today, this decision is no longer relevant because if I can get the same functionality and performance out of Postgres I will choose postgres. The ability to combine the SQL with "NoSQL" is just too compelling.

  • pgaddict 10 years ago

    Because in the PostgreSQL world those are actually major versions, not "point updates" ;-)

chris_wot 10 years ago

There is going to come a point where Postgres passes Oracle in terms of features and performance.

There is also a point that Postgres will perform with enough features needed by most businesses that they'll choose it even though it doesn't match Oracle on a feature by feature basis.

We have hit peak Oracle. From this point forward it's going to be hard for Oracle to regain momentum. Expect a lot of FUD against Postgres - the more you see, the more worried you know Oracke execs are becoming.

  • willlll 10 years ago

    Oracle is writing Russian IT shops to try and convince them from moving to postgres http://www.postgresql.org/message-id/CANNMO++6tPiwBv2OKcy-Hh...

  • reactor 10 years ago

    First hand experience, I work for a bank which is a Fortune 100, and our department was heavily relying on Oracle Exadata, paying between 1-2 millions (don't know exact figure), we moved away to PostgreSQL and Casandra. Cost factor was not the only motive (1-2 millions was actually penny for this bank) but our Director was convinced that exadata was unnecessary.

    • chris_wot 10 years ago

      Out of interest, what are the benefits of Exadata, or is it really a bit of emperorer's new clothes? Can you not get similar performance with similar good hardware?

      • jerven 10 years ago

        Exadata:Hardware used to be competitively priced (at least in the x4 age), its the licensing for the software on top that makes it really expensive.

        The other thing is that some oracle features only work on exadata or other engineered systems. The hybrid columnar compression for one and query offload as well as some forms of the in memory stuff. The problem is that these are very high end features and most of us live happily without these. Postgres is going after the bread in the oracle db market while exadata is the exclusive cheese shop. Much smaller market even if the margins are better.

      • reactor 10 years ago

        We were on x5-2 and it was a beast, but in reality we never needed those specs. I think, Oracle sales guys got to those MD's who were calling the shots.

  • kbenson 10 years ago

    > There is going to come a point where Postgres passes Oracle in terms of features and performance.

    I wouldn't be so sure. Oracle has deep pockets, and I'm not sure there's a reason why anything PostgreSQL wants to roll out they can't pay to prioritize and have done sooner.

    > There is also a point that Postgres will perform with enough features needed by most businesses that they'll choose it even though it doesn't match Oracle on a feature by feature basis.

    No argument there.

    • chris_wot 10 years ago

      Oh, I'm not saying this is an overnight thing or that Oracle will be destroyed. In fact, I don't want to see them destroyed, but what I am interested in is that their market power and influence wanes over time. Ironically, this will be good for Oracle - they will be forced to dump immoral and illegal business practices.

      The problem commercial businesses have when it comes to open source is that you might be able to reduce customer take up, by you can't compete with it like you would other businesses. In the closed source world you can purchase a company and shut down its product and thus kill off the competition.

      As Microsoft have found, with open source software, that's not possible.

      The other unfortunate thing for Oracle is that when they attack Postgres they have to publish lists of competitive advantages. All this does is give Postgres developers a todo list, and they then work towards implementing the features that matter.

      • kbenson 10 years ago

        Yes, in a way open source acts as a vibrant market of competitors forcing proprietary products to innovate where they may have been willing to stagnate, and in a way immune to some of the more anti-capitalistic ways of preventing innovation (e.g. buying out the main competitors). As unions worked to combat some of the major criticisms of capitalism by Marx (as I understand it), open source works to combat some of the major criticisms of companies in the software market (patents are put in an interesting light in this theory).

        • chris_wot 10 years ago

          It's funny, I really feel until open source became widely known as a viable thing that patent reform was dead and non-existent. Not only that, but things like the Sonny Bono Copyright Extension Act passed just before the start of the true mainstreaming of open source culture, but wouldn't fly now.

  • pbarnes_1 10 years ago

    This migration has been happening for many years anyway.

    I used to work consulting doing these types of migrations for customers for a huge IT services company.

filereaper 10 years ago

Short of doing a deep dive into Postgresql, do any universities use Postgresql as a basis for their DB implementation courses? I wanted to dive into Postgres and was hoping for some training wheels. Thanks.

  • __jal 10 years ago

    PG is, IMHO, probably one of the best code bases for this.

    I'm far and away not a database hacker, but have read parts of the code at various times to better understand what I was seeing. The codebase is incredibly well written and organized, and the documentation (both developer and user) is top-notch.

    The only problem I can think of with using it in a course would be choosing what to focus on with only a semester. Add in the long revision history and I think there are also multiple theses in there on the sociology of open source.

  • spariev 10 years ago

    There is a "Hacking PostgreSQL" course[1], which tries to explain PostgreSQL internals and architecture, but it is available in Russian only[2]

    [1] http://postgrespro.ru/education/courses/hacking [2] https://www.youtube.com/watch?list=PLaFqU3KCWw6Jfb8IBNk3hZ07...

  • calinet6 10 years ago

    CS186 at Berkeley was entirely in the guts of PostgreSQL. But then, I suppose that's not too surprising given the history.

    Personally, I found it a delight to work with, and that has translated into a great respect for the product itself. It is no great surprise that the pace of development stays strong and that engineers gravitate toward it; it's set up for exactly that.

  • aut_dan 10 years ago
  • morgante 10 years ago

    My university does, though unfortunately we don't have any public course information.

  • ddorian43 10 years ago

    Dive in the documentation + mailing list + postgresql planet.

kazagistar 10 years ago

Blog post with some details and benchmarks by one of the authors of the patch:

http://rhaas.blogspot.com/2015/11/parallel-sequential-scan-i...

  • pgaddict 10 years ago

    That's not about parallel aggregate but about parallel scan.

    Also, Robert is not the author of the patch, he did a review and committed it. The actual author are listed in the commit message: David Rowley and Haribabu Kommi.

kbenson 10 years ago

I would love to see some sample benchmarks for the type of gains you might see from this. I always see these interesting new PostgreSQL features being posted, and they sound cool, but it's hard to know how much they help in practice. I understand it's often highly workload and data dependent, but something would be better than nothing.

  • pgaddict 10 years ago

    David Dowley (one of the authors of the patch) posted some measurements on TPC-H Q1: https://news.ycombinator.com/item?id=11332713

    The machine has 4x E5-4620, so 32 physical cores. And with 30 workers it gets ~80% of the theoretical speedup. Not bad, I guess.

  • olavgg 10 years ago

    It would help a lot for typical data warehouse queries that involves aggregating millions to billions of rows. SELECT EXPLAIN ANALYZE will tell if its relevant for you ;-)

    • kbenson 10 years ago

      > SELECT EXPLAIN ANALYZE will tell if its relevant for you

      That assumes I run PostgreSQL already, which I don't. I am interested in possibly switching at some point if it's worthwhile, but it's hard to muster the effort to do concerted testing of a representative sample of my data, including possibly changing how queries are done to take advantage of specific features, when I have little information to go on.

      Not that I expect PostreSQL to do in-depth analysis of everything, but it would be great from both a promotional and technical standpoint if there was something like "we've seen something like X% speedup of queries utilizing Y, and up to Z% speedup in extreme cases." I mean, I assume they at least have rudimentary numbers for this, otherwise they would be making blind changes without knowing whether it improved or degraded performance. Providing just enough to get people interested in doing their own benchmarking (and possibly publishing them) would be great for everyone.

      Edit: One of top comment is actually what I'm talking about (but apparently for a different feature). So it does get done, which is really nice. :)

      Edit2: Now there's the link to the blog post for this feature. :)

      • Someone 10 years ago

        If you aren't using it now, and thus don't know how fast or slow it would be on your data and queries, why would you care about percentage speed ups?

        It could be a 1% speed up from insanely fast to slightly more insanely fast, or a 100% speed up from unbearably slow to just extremely annoyingly slow.

        I gather that people considering migrating who want some reassurance before investing time in testing should be more interested in such things things as benchmarks that show whether PostgreSQL can saturate your hardware for various query types, benchmarks that compare its query planner against that of competitors, and benchmarks that show how well it works under load.

        • kbenson 10 years ago

          >> I am interested in possibly switching at some point if it's worthwhile

          > If you aren't using it now, and thus don't know how fast or slow it would be on your data and queries, why would you care about percentage speed ups?

          I'm working under the assumption that similar operations between PostgreSQL and MySQL for operations that aren't using an advanced feature that the other plainly doesn't support will be roughly comparable in performance. I understand this isn't true, but it's useful in that it gives me a baseline to work with when reasoning without forcing me to migrate my application (or at least a significant portion of the data) to PostgreSQL just to test. If I'm then confronted with a feature that one supports and the other doesn't that has the capability to in some cases reduce the query time by an order of magnitude, then it's starts approaching the criteria needed for me to invest that time to see if it's worth migrating.

          > I gather that people considering migrating who want some reassurance before investing time in testing should be more interested in such things things as benchmarks that show whether PostgreSQL can saturate your hardware for various query types

          Isn't that exactly what this is? Throwing CPU cores at aggregating data to reduce time? I asked for benchmarks regarding this specific feature, to see what type of gain we were talking about with this. I didn't ask for comparisons to other databases because those benchmarks are generally harder to perform fairly, so I would have been (and am) happy with comparative gain benchmarks.

HoyaSaxa 10 years ago

A slight tangent, but I am still shocked that many open source projects lack downloads via https (including PostgreSQL). Sure you can offer some layer of security by signing the distributions, but ultimately users are lazy.

  • snuxoll 10 years ago

    The -release RPM (which contains the yum repository configuration + GPG signing key) for any RedHat family distros is available over HTTPS. Check any of the links on this page http://yum.postgresql.org/repopackages.php#pg95 - all available over HTTPS.

    The GPG signing key used by the apt repo for Debian and derivatives is also served over HTTPS @ https://www.postgresql.org/media/keys/ACCC4CF8.asc, and the instructions for use direct you to install it as such.

    It literally matters not at this point whether downloads are delivered over HTTPS or not outside of anonymity (which is almost moot, because you are obviously downloading PostgreSQL or the few related packages in these repositories) since package signatures are verified.

  • ak4g 10 years ago

    As a project, I don't think PostgreSQL wants (or should want) anybody running a DB from the source tarball that's (eventually) stamped as 9.6 - it's existence is a mere artifact of the actual packaging work that goes on during a release. In all cases you should be installing packages via your OS's installation mechanism, and those packages will be authenticated (or not, as the case may be, but it's not in-scope for Postgres-the-project, and it shouldn't be different for installing PG vs. any other piece of software on the system). A production-ready setup needs an OS user, logging/logfile rotation, some tooling for managing DB clusters, and a half-dozen other things I've never even thought about. If you just want to check out the source, clone the repo.

    But in a production environment, you always want to use a package provided by your OS vendor. Postgres has excellent packagers for both Debian and RHEL-flavored distros (and I imagine more) and you really want to have the system-level considerations thought through by someone by someone who knows what needs to be thought about.

    • pdkl95 10 years ago

      > you should be installing packages via your OS's installation mechanism.

      My OS's install mechanism downloads the source tarball. It does authenticate the download, of course. The point being you don't know what platform someone is using, and the source may be the only way they can install PostgreSQL.

      > PostgreSQL wants

      It doesn't matter what they want - if the source is available for download, it will be used. Unauthenticated downloads are an "attractive nuisance" that puts users at risk. The actual download links[1] at www.postgresql.org do use https, but the HTML that contains the download URLs irresponsibly redirects https requests back to http. While the download of the actual source tarball is authenticated, the URL to that download can easily be modified in transit.

      [1] https://www.postgresql.org/ftp/source/v9.5.1/

  • takeda 10 years ago

    It's OSS so there's not much to hide.

    I'm sure that someone who has capability to not only tap into but also modify traffic on the fly, won't have problem obtaining valid certificate from one of the hundreds CAs that everyone has in their web browsers.

    If anything I feel https actually hurts, because it gives you false sense of security and you're more likely not to validate files with GPG, which you should do whether it is https or not.

  • harel 10 years ago

    Out of curiosity what does it matter if your postgres download is via http or https? Is there any benefit of hiding your download from prying eyes?

    • HoyaSaxa 10 years ago

      One main benefit of https is confidentiality, but for OSS downloads it more importantly provides an integrity and authentication benefit as well. Sure checksums and/or package signing should provide the same benefit, but most users cannot say they have checked these for every installation (and checksums are useless if served via http because they can be tampered via a man-in-the-middle attack).

    • brazzledazzle 10 years ago

      To put it another way it's less about prying eyes and more about MitM attacks.

      • derefr 10 years ago

        I really do wonder whether there's a verified "path of trust" where I can get a copy of e.g. Debian (without knowing what the canonical website for Debian is), install it on a machine, connect it to the internet and update it, without being at all vulnerable to a MitM attack at any step.

        I would be suspicious that it's the very first step that poses the most danger: there's little way to know whether minor-distro-X is "properly" hosted at minordistrox.com or distrox.org or even distrox.github.io, other than by blindly trusting the relative PageRanks of each.

        ---

        Something that would be very nice, I think, would be a way to rely on the package-signing infrastructure of the OS you've already got (and trust), to guarantee for you the copies of any new OS images—even for different OSes!—you download. Sort of the same way you download new versions of OSX through the Mac App Store, but going a bit further.

        Now that I think about it, the browser-preloaded HSTS list might do the trick... if it was coupled with a mapping of 'important well-known downloads' to a single (HSTS-preloaded) domain you should be allowed to get those downloads from. (Presumably with heuristic detection, so it could find "something looking mostly, from its metadata, like a Debian ISO image.") Then your browser would just tell you you're being phished if you're trying to get a Debian ISO from debianisamazing.info.

      • RI_Swamp_Yankee 10 years ago

        This is true, and MitM attacks are becoming less and less theoretical as more traffic moves to the various wireless protocols. Rogue AP's and cell towers are a thing in these modern times.

    • im_down_w_otp 10 years ago

      You can validate the certificate used to establish the connection and thus the endpoint, if you want to, and thus avoid one potential vector of someone serving you a malicious postgres build. Because you're putting your trust in the network.

allan_s 10 years ago

it's pretty interesting especially for those of us who are creating analytics tools on top of postgresql, the increase performance would certainly permit less denormalized work-around , like creating "manual" aggregate directly in the database (with all the problem of keeping them in sync with the rest)

and I'm still here waiting for 9.5 to arrive in RDS of amazon :(

tiglionabbit 10 years ago

Does it have generic upserts yet?

  • ddorian43 10 years ago

    What do you mean by generic ? Upserts were implemented in 9.5

    • tiglionabbit 10 years ago

      As in you don't need to write a unique PL/pgSQL routine for each kind of upsert you want to do.

      Oh, it does have upserts now. Awesome =]

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection