Settings

Theme

Techempower Web Framework Benchmarks Round 10

techempower.com

110 points by pneumatics 11 years ago · 65 comments

Reader

chrisan 11 years ago

Come for that stats, stay for the comedy

> The project returns with significant restructuring of the toolset and Travis CI integration. Fierce battles raged between the Compiled Empire and the Dynamic Rebellion and many requests died to bring us this data. Yes, there is some comic relief, but do not fear—the only jar-jars here are Java

Brilliant!

redstripe 11 years ago

There's something I don't understand in these comments. Why is everyone interested in language comparisons instead of the huge difference between EC2 and bare metal?

  • sauere 11 years ago

    Got to agree. Bare Metal performance isn't appreciated enough. I know a few companies that run fine with a mixed Metal/AWS combo. Metal handles 80% of the workload, and if that for some reason fails, EC2 instances are fired up to take over until it is fixed. This setup doesn't work for every scenario, but it is something to take into consideration.

    • ckluis 11 years ago

      I remember being in a meeting talking about the tax companies using X-Metal servers for the whole year and scaling Y-cloud instances for the 3 months they have excessive usage.

  • buster 11 years ago

    Indeed.. I am wondering the same.. I fear that at some point the baseline for performance will be some virtualized server and people will forget just how fast real hardware can be.. At some point people will think it's normal to host a simple blog on 5 virtual servers ;)

    Also, by its nature EC2 should be terrible for serious benchmarking, since you have no control whatsoever about the infrastructure.

  • wmf 11 years ago

    Because people think they can't use bare metal and thus its advantages are irrelevant.

  • stefantalpalaru 11 years ago

    From what I know about the Xen virtualization used by EC2 the instance variability is too high to get consistent results over time and the overhead for this type of benchmarks is so big it's not even funny.

    That's why the bare metal results are the only relevant ones. The playing field is fair and stable. Let the battle begin ;-)

Joeri 11 years ago

Every time I'm struck by how slow the big PHP frameworks are compared to raw PHP. Either nobody cares about making those tests perform better, or something is very wrong in the architecture of those frameworks.

I expect the cause is that too much code is being loaded for every request. PHP tears down and rebuilds the world for every request, and the popular frameworks load a lot of code and instantiate a lot of objects for every request.

  • aikah 11 years ago

    > Either nobody cares about making those tests perform better

    Very few do. Just look at Symfony's complexity , the devs managed to shove proxies into the IoC container to "lazy load services" so you could inject them into the controller class constructors without actually instantiating them until you call a route handler's controller instance method... Using complex class hierarchies in PHP has a big cost, while "raw PHP" barely executes the underlying C code. To be fair people using heavy frameworks come for the battery included first, not really for the speed of the router. And since devs want more batteries, it's unlikely it will get faster.

  • panopticon 11 years ago

    PHP frameworks generally rely on heavy amounts of caching at every level (database, bytecode, Varnish, etc) to make up for this.

    Why this is the status quo is something I also question.

  • makeitsuckless 11 years ago

    Because we know these raw tests are bullshit since it's almost trivial to set up the frameworks and the infrastructure to make it perform perfectly fine.

    And because of the share-nothing architecture it scales like a mofo before we have to start jumping through hoops (and 99% of us never reach that scale).

    The lack of raw speed is a minor inconvenience that's easily fixed in practice.

    Also, I doubt big PHP frameworks are the only ones with a disadvantage that only shows up in these kind of benchmarks. It's like testing cars solely for straight line speed.

tolas 11 years ago

Still no Elixir/Phoenix inclusion? I'd really love to see how it stacked up.

DAddYE 11 years ago

Few notes:

* Impressive Dart

* JRuby > MRI (I'd like to see JRuby 9k)

* Padrino that offers basically everything that Rails does performs impressively well. [Shameless Plug]

  • cheald 11 years ago

    My experience has been that JRuby 9k is slightly slower than MRI in a number of cases right now, mostly because its IR has been completely rewritten, and is still pending a performance pass.

    That said, it's still a huge step up from the 1.7 series, and once the team starts knocking out performance problems it should be pretty magnificent.

aikah 11 years ago

I'd like to see the memory benchmarks as well.It's fine to have something fast but the more memory the more expensive the boxes are.

vdaniuk 11 years ago

Okay, Nim is really fast in these benchmarks on ec2 servers, impressive.

Does anyone has experience with nim web stack? Is it ready for prime time? How much effort is required to create a simple CRUD json api?

On a sidenote I am really looking forward to comparing rust and elixir results in the next round of benchmarks.

saryant 11 years ago

The chart says Play Framework didn't complete but looking at the output, the logs say it did.

https://github.com/TechEmpower/TFB-Round-10/blob/master/peak...

What am I missing?

  • richdougherty 11 years ago

    An error occurs, which is logged to stderr, but the benchmark logs don't capture stderr so it's hard to know what's happening. (Or maybe Play redirects stderr to a log file?)

    The test passes in the preview runs, in the TechEmpower continuous integration tests and in the EC2 tests so it's probably some transient error that only occurred in the final bare metal test. Maybe there's a race condition in the Play 2 test scripts which only shows up sometimes.

    I've spent a fair bit of time maintaining the Play 2 benchmark tests so it's very frustrating to get no result on the final test. Oh well!

    • saryant 11 years ago

      Out of the box, the start script from "play stage" does not redirect stderr.

      Though I didn't think to check the classpath when I was poking around the TechEmpower github repo. I wonder if a logback.xml slipped in somewhere that's siphoning off stderr to some unknown destination?

  • wheaties 11 years ago

    It's not that they didn't complete, they never even started. Definitely configured incorrectly for the tests.

    • kainsavage 11 years ago

      Agreed. Our logging has undergone some solid improvements in the last week or two, and so round 11 will, if not resolve this issue completely, make the logged output more useful for tracking down issues like this during the preview runs.

  • bhauer 11 years ago

    Sorry, the links to logs were going to logs from a preview run. I've just changed the links to direct to the final logs. In the final run, play2-scala did not respond:

    https://github.com/TechEmpower/TFB-Round-10/blob/master/peak...

    • saryant 11 years ago

      Generally that error happens when Play tries to bind to a port that's already in use. Looking at the start scripts for the Scala projects, the RUNNING_PID file is just being removed if it exists but your script should probably kill that PID if the file exists before deleting the file.

      https://github.com/TechEmpower/FrameworkBenchmarks/blob/mast...

      • kainsavage 11 years ago

        Our suite takes a nuke-from-orbit approach when it comes to killing processes, as this has come up in every round. The idea is that now all tests are run with a specific user, and instead of relying on the test to shut down properly (which MANY could not reliably do), we simply nuke all processes owned by this runner.

        It has the downside that if a process forks other processes and drops them into another different user (recently addressed for hhvm, for example), we cannot capture that. However, we have made great strides in trying to avoid that. Additionally, the logging for the application WOULD suggest if a port were bound prior to start-up, and that does not seem to be the case in this example.

        • saryant 11 years ago

          Yeah, you should see a java.net.BindException in that case. Play sends it to STDERR rather than STDOUT, I thought maybe that output was being redirected but that doesn't appear to be the case.

          The other primary cause of that "oops" message is when evolutions can't be applied but that also doesn't appear to be the case.

  • virtualwhys 11 years ago

    Maybe they need to pass `-Dhttp.address=...` to the start script[1]; Play binds to `0.0.0.0` and port 9000 by default.

    Here's the corresponding error from the log in Play[2]. Not sure what else it could be...

    [1] https://github.com/TechEmpower/FrameworkBenchmarks/blob/mast...

    [2] https://github.com/playframework/playframework/blob/2.2.x/fr...

sauere 11 years ago

Bottle handling 5x more requests than Flask. Impressive, but overall Python framework performance ist still... meh.

WoodenChair 11 years ago

Dart dominated the multiple queries test type.

  • Cyph0n 11 years ago

    That is quite surprising. Anyone have an idea why that is?

    • wheaties 11 years ago

      That's easy to answer, instead of using a relational database or even PostgreSQL's JSON data store, it's using MongoDB. I hate MongoDB but for tests like these where there's absolutely no writes what so ever, Mongo is going to fly. Hence, I read this as comparing MongoDB read performance to PostgreSQL, MySQL or SQL Server perf (or any of the other DB that are listed.)

      That's why benchmarks like these have to be scrutinized. I like looking them over but in reality, they're not Apples to Apples.

      • EugeneOZ 11 years ago

        If each framework use different DB, then all results with DB queries are useless (for me). Very frustrating. Really, any framework with Redis will have HUGE allowance

        • plorkyeran 11 years ago

          You can filter by DB, and most of the frameworks (including the Dart ones) are run against multiple DBs.

          • EugeneOZ 11 years ago

            Thanks, I see it, but it significantly reduces set of results. For example, there is no MySQL+Dart runs.

    • kainsavage 11 years ago

      Actually, not really. We checked the code to ensure that there was no gaming the system and it definitely APPEARS to be making separate database queries as we require in our rules. In fact, we had this same question in round 9 and had a number of people audit it. We cannot explain it other than it might be pretty darn fast.

      • emn13 11 years ago

        A better requirement would be to define some minimum level of durability, and some minimal level of freshness in the face of concurrent modifications.

        Frankly, who cares if a caching driver avoids some database queries entirely if it still provides the same level of durability and freshness guarantees? If mongo+redis are OK, what's wrong with a plain hashtable?

      • z5h 11 years ago

        The benchmark requirements aren't specific about durability requirements of the database. https://www.techempower.com/benchmarks/#section=code

        This is where a difference between Redis and other databases will exist depending on configuration.

sker 11 years ago

I'd like to see some ASP.NET running on Owin. Perhaps I'll find the time to add it myself before round 11.

hamiltont 11 years ago

I've been working with this project for a while, here's some unorganized thoughts:

   1) Statistics
   2) Running Overhead
   3) Travis-CI
   4) Memory/Bandwidth/Other info
   5) Windows
   6) IRC
   7) Ease of Contributing

1) Currently, the TFB results are not statistically sound in any sense - for each round you're looking at one data point. EC2 has higher variability in performance, so that one data point is worth less than the bare metal data point. Re-running this round, I would expect to see at least 5 to 10% difference for each framework permutation. See point (2) to understand why we're not yet running 30 iterations and averaging (or something similar)

2) Running a benchmark "round" takes >24 hours, and still (sadly) a nontrivial amount of manpower. It's currently really tough to do lots of previews before an official round, and therefore tough to let framework contributors "optimize" their frameworks iteratively. I'm working on continuous benchmarking over at https://github.com/hamiltont/webjuice - it's a bit early for PRs, but open an issue if you want to chat

3) As you can imagine, our resource usage on Travis-CI is much higher than other open source projects. They have been nothing but amazing, and even reached out to chat about mutual solutions to potentially reduce our usage. Really great team

4) We do record a lot of this using the dstat tool. dstat outputs a huge amount of data, and no one has sent in a PR to help us aggregate that data into something easy to visualize. If you want this info, it's available in the results github in raw form.

5) Sadly windows support is struggling at the minute. We need something setup like Travis-CI but for our Windows system. CUrrently windows PRs have to be manually tested, and few of the contributors have either a) time to do it manually in a responsive manner or b) windows setups (a few do, but many of us dont). Any takers to help set something up? FYI, we have put a ton of work into keeping Mono support just so we can at least test that changes to C# tests at least run and pass verification, but naturally that isn't as nice as really having native windows support

6) join us on freenode at #techempower-fwbm - it's really fun meeting the brilliant people behind the frameworks

7) If I had to pick one big thing that's happened in between R9 and R10, it would be the drastically reduced barrier to entry. Running these benchmarks requires configuring three computers, which is much harder than something like pip install. Adding vagrant support that can setup a development environment in one command, or deploy to a benchmarking-ready AWS EC2 environment, has really reduced the barrier to getting involved. Adding Travis-CI made it better - it will automatically verify that your changes check out! Adding documentation at https://frameworkbenchmarks.readthedocs.org/en/latest/Projec... made it even easier. Having a stable IRC community is even better! Tons of changes have added to mean that it's now easier than ever for someone to get involved

MCRed 11 years ago

Reading thru these tests they are measuring database performance as much as framework performance.

They are also single node which is great if you're entire system is only going to ever need one machine's worth of capacity (Eg: vertical scaling)

vinceyuan 11 years ago

Some frameworks which I never know performed very well. But looks like they are not mature. Which framework do you recommend? I used Node.js/Express, Rails and Sinatra but am not satisfied with them. I am learning Go.

cagenut 11 years ago

Since the "peak" hardware is a dual E5-2660v2 thats 32 threads, so a c3.8xlarge would be a much more comparable instance.

  • kainsavage 11 years ago

    We aren't trying to measure each hardware set as apples-to-apples, but rather give the reader an idea of how performance characteristics for a chosen stack are affected by hosting environment. Specifically, we wanted the middle-of-the-road EC2 instances versus the extremely high-end Peak option to illustrate that difference.

    • cagenut 11 years ago

      thanks for the response

      I've noticed a weird trend where amazon created various slices of instance types a long time ago, and people have mentally gotten used to using larger ones far slower than moores law adds cores. So people will refer to something with 2 cores as "middle-of-the-road" and 32 as "extremely high-end" when in my brain thats "a cell phone" and "a 2 year old server".

merb 11 years ago

Keep in mind most of these benchmarks won't happen in production. Especially not the netty and lwan ones.

dilatedmind 11 years ago

what are the benefits of using this benchmark over using ab?

  • hamiltont 11 years ago

    The main benefit is this allows rough comparison to a ton of other frameworks. Just running ab against your one server setup gives you one RPS/latency result on one hardware setup - that's good to know as an absolute metric, but tells you very little about your performance relative to other frameworks.

    This project gives you RPS/latency metrics for many frameworks, on a few hardware setups. This enables a rough comparison of "how does my framework perform relative to all these other well-known or established frameworks". Naturally, the comparison is not perfect - there are a ton of reasons that measuring just requests/sec and latency doesn't allow complete comparison between two frameworks. However, once you accept that it is basically impossible to fully compare any two frameworks using just quantitative methods and these numbers should inform your choice of framework (instead of totally control your choice of framework), we can talk about why it's valuable.

    Want to run a low-cost server in language X that you happen to love? This project can provide guidance about which frameworks written in language X are performing the best. Want to ensure your service can support 50k requests per second without loosing latency? This project can provide latency numbers for you to examine that let you know which frameworks appear to maintain acceptable latency even under high load.

    If you wanted to, you could re-create this project by running ab against 100+ frameworks - that's the cornerstone of what is happening here. Granted, we currently use https://github.com/wg/wrk instead of ab, but the principle is the same - start up framework, run load generation, capture result data. Most of the codebase is dedicated to ensuring that these 100+ frameworks don't interfere with each other, setting up pseudo-production environments with separate server/database/load generation servers, and other concerns that have to be addressed.

    Over time, this project has started collect more statistics than just requests/second and latency, which makes it more valuable than just running ab. As more metrics are added and more frameworks are added, this becomes a really valuable project for understanding how frameworks perform relative to one another.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection