Rap Genius (YC S11) responds to Heroku’s call for ‘respect’
venturebeat.comIt doesn't matter how efficient or inefficient RG was with their Rails app. It's almost certainly true that they could have done things better on their end, and their performance penalty wouldn't have been as severe -- but that really is not the point.
The point is that one company promised a level of service with their product that they did not deliver, and the difference was significant and persistent. The fact that the consumer could have used the product more efficiently is immaterial to that fact.
Other things that don't matter:
-that RG could/should move to another provider. That is of course their choice now, but it does not change the money they've spent and wasted with Heroku.
-that the routing problem is hard. If anything this makes it worse - it's a hard problem so people would pay a lot of money for a solution. What matters is that Heroku claimed to solve it and did not.
-that other consumers of the product managed to figure this out before RG. Heroku was still advertising through their documentation that they offered a routing solution, and they did not make clear to their customers that a significant feature of their product was now different.
Furthermore, Heroku appeared to obfuscate this fact and shift blame to the customer during the time RG was trying to diagnose their issues.
Now, by attacking RG's tone, Heroku have employed argument-level DH2 [1], which at least according to pg is not even worth considering. They have at least acknowledged their mistake, but to me that means that by extension they have sold something that they did not deliver on. The only honest way to move forward is for Heroku to offer some kind of compensation to the customers that were affected.
Yes, the comment quality on HN seems to be quite bad when it comes to Heroku threads. Why do so many CS professionals appear to be attaching themselves emotionally to software tools? That's pretty much what I have to conclude if you can't admit that this PaaS provider has screwed up and deserves more scrutiny when deciding for the platform of your next project/migration.
Isn't one of the great things about the Software startup scene that we can decide freely on what tools to use? Except for very niche markets we always have alternatives, even if it means a bit more work on our sides.
I'm not really certain why, but there is a much greater tendency for members of the Ruby community to get emotionally attached to certain tools or services, and to defend them unequivocally, even when this is completely unjustified.
I haven't seen this to such a high degree with any other programming language/platform/technology community. Yes, there are developers in these communities who do prefer certain tools, but they're generally reasonable when it comes to criticism of these tools, or the suggestion of using alternatives. It's much rarer to see this when dealing with Ruby developers.
On more than one occasion, I've witnessed several different Ruby developers yell and scream in meetings when told they can't use a particular library or framework. I've never seen this kind of reaction from the many Java, C#, C, C++, Fortran, COBOL, Ada, Perl or Python developers I've worked with over the years, for instance.
Everyone knows Heroku screwed up, there not much left to say about that part of this story... so then we get to the application.
there not much left to say about that part of this story... so then we get to the application.
If we're going to be bikeshedding, why bikeshed RapGeinus' rails app, surely Heroku's request routing is a more meaty and exciting problem to talk about? Or is it simply because rails is a known quantity it is easier to fling shit at RG for not having the foresight to make exactly the decisions that are obvious to people with hindsight and an incomplete view of their application?
Nobody's bikeshedding anything.... Heroku have admitted it's a real problem, they have begun addressing it and increasing visibility and awareness into it. There is no news or data being added at this point, it's just RG retelling the exact same story.
And the story includes an application where visible parts of a second are used to construct their pages one very slow request at a time (regardless of what Heroku adds) which is interesting for a lot of us because that's not how many other platforms work.
You have to feel comfortable that those people will generally give you good value for your money (since you can’t literally observe everything they do) and that they will tell you when something’s wrong as soon as they know, rather than covering it up.
I used to feel this way about Heroku, and I might again in the future, but I don’t right now.
I have a hard time understanding why, for all the money Rap Genius pays Heroku, they don't simply set up their own instances on EC2 and run the app there themselves. It seems like for a few days work with Puppet or Chef you could automate getting your code onto dozens of EC2 instances and installing the necessary tools/server processes, plus you don't have to complain anymore about how you can't run Unicorn.
Yes I get that there is a certain amount of value in being able to pay someone else to do all these things for you and saving time - but if you aren't happy with the result and the value given the money you are paying (and RG is not), then at a certain point it's time to just bite the bullet and fix things yourselves instead of continuing to be hamstrung by problems that the hosting provider won't/can't fix. There comes a point where you get large enough, and you are paying enough to Heroku, that it would be worth it to do things yourself and eliminate the problems.
This is so true. The fact of the matter is that Rap Genius has obviously had to have someone spend a ton of time diagnosing problems with Heroku - and is objectively cheaper just to host some servers compared to Heroku dynos
This is why I always tell people that Heroku is actually NOT a good solution if you truly need scale. They're good for staging, launch, and an early traffic emergency or two. After that, ONCE YOU NEED TO SCALE, it's cheaper just to run your own servers, because the problem that Heroku is solving for you becomes a smaller and smaller percentage of your overall oeprations budget.
Also worth considering how much time of RG's has been spent not just diagnosing Heroku issues, but giving interviews and writing blog posts about the ordeal. Using Heroku might allow them to spend zero time on "ops" but they've spent some non-zero time now just talking about and raising awareness of this issue!
I had never heard of Rap Genius before this Heroku thing and their app is aimed at dissecting the types of textual messages that are being exchanged back and forth here. Seems like they decided to take the "pick a fight" approach to publicity quite literally...
Although I'm getting bored of this little scuffle, I am glad that they made some noise initially because it let me know that I wasn't crazy. I was trying to profile, understand, and optimize a Heroku app and ultimately gave up because it was relatively easy to migrate.
This reminds me of the quote, "All press is good press," but I can't recall who said it. You're right it's a waste of engineering time, but the shitstorm also has benefits.
IIRC that quote evolved from Oscar Wilde's:
"The only thing worse than being talked about is not being talked about"
I agree with this point, however, how Rap Genius spends its money isn't an issue here. Whatever the reason, they paid and expected to get an adequate service from the company, which they didn't. And on top of this, they found the shady practice at work. And this is a big fucking issue, if you ask me.
I have a hard time understanding why, for all the money Rap Genius pays Heroku, they don't simply set up their own instances on EC2 and run the app there themselves.
Who says they won't do that now?
Obviously when they started, they had no idea they'd have these problems or that they'd spend so much time diagnosing them, because Heroku told them that they wouldn't have these problems to begin with.
Fact of a matter w/ anything outsourced is that you can outsource responsibility, but you can't outsource accountability.
Ultimately RG's devs are responsible for their choice to leave all the admin work up to heroku.
Yeah you would think the cost savings from EC2 and the 60K they spent on New Relic would cover paying for a quality sysadmin to run that stuff.
"Yes, one solution is to run a concurrent web server like Unicorn, but this is very difficult on Heroku since concurrent servers use more memory and Heroku’s dynos only have 512mb of ram, which is low for even processing one request simultaneously."
Is this really accurate? 512mb is barely adequate for serving a single request at a time? I'm not a Rails developer, but that sounds terrible. I'm all for trading off some performance for rapid development, but that seems a bit extreme.
I'm currently running twelve Django apps on one 512MB Rackspace VM. It's a bit tight, and I don't get a lot of traffic on them, but it's basically fine. And that's with Apache worker mpm + mod_wsgi (with an Nginx reverse proxy in front) which probably isn't even the lightest approach. And having been writing apps in Erlang and Go recently, I'm starting to feel like Python/Django are unforgivably bloated in comparison.
It really depends on your application. A fresh rails app will take up ~30mb of memory (iirc, been a while since I checked). Thirty gems and 11,000 lines of code later, yes, it can spike to 256mb.
If I were to toss down an average, seems like ~100mb is what I see most of the time for non-trivial rails apps.
The main Rails app that I work in is a medium sized app and runs at 220mb memory usage on a dyno on averae. It spikes to the 350mb range occasionally probably from image generation with rmagick or PDF generation.
OK. That sounds reasonable and pretty comparable to what I expect to see from a full-stack framework.
I have many apps on Heroku, all running Rails, and mostly running on Unicorn, with three or four workers. Most of the apps I've seen pass me by use no more than 150Mb per worker, and there's a fair amount of work going on in many of them with image processing and the like.
512Mb for a single application sounds incredibly high to me.
EDIT: After looking at the docs, it seems like 512Mb isn't even a hard limit: https://devcenter.heroku.com/articles/dynos#memory-behavior
Atwood's new Discourse thingy recommends 1GB of RAM:
~ http://www.discourse.org/faq/We also recommend a minimum 1 Gb RAM to host Discourse, though it may work with slightly less.Guess it depends.
Note that that's their recommendation for a single VPS that includes the postgres & redis servers on it as well, not just the Rails stack.
Reading things like 512mb isn't enough for more than one request at a time, and one request at a time, and the performance of that one request looking terrible even though it's obviously got an entire vm dedicated to it...
What are (edit:) Rails developers getting in exchange for these enormous penalties that makes it worth choosing?
Most rails apps use nothing like that amount of memory, the norm is more like 80-150MB. There are various factors which affect how much memory you use and of course if your processes are leaking memory they could easily grow over time and hit any limit. Rails itself is taking up around 30MB, so this is all about the specific app code. Another common problem is loading lots of records into memory (say fetching all your user records at once), this will allocate but then not free lots of memory. Personally I find passenger handles this perfectly well out of the box without having to worry too much about memory usage, routing or other issues, but it does require keeping an eye on the app code as the app grows and fixing any issues that come up with memory usage or response time. Those are not problems specific to rails.
Without knowing the specifics it's hard to say for sure, but I really think RG should try a comparison with running their own real VM (not a web worker on heroku) and see how well it runs. If they'd done that they'd probably find and fix the reasons that their processes are taking so long to respond and taking up such a huge amount of memory, because they'd feel more ownership of those problems, instead of playing a blame game with heroku.
This is not rocket science but it is a series of trade offs and heroku seem to have optimised for short running processes which don't take up lots of memory - many web apps run that way and would be happiest with random routing. Yes heroku could do better but at some point you have to take responsibility for your own ops instead of expecting some service to abstract away all the hard stuff, particularly if you're seeing performance issues and have a busy site. The amount they're paying heroku would easily pay for far more vps than they need.
So in summary, Heroku is not for everyone, and rails isn't really the problem here, so there are no enormous penalties for using it, just the sort of problems you see running any web app.
I find it completely astounding that 80+ MB of memory is required to run these Ruby web apps.
I remember doing CGI development in Perl back in the 1990s. We were lucky if our web servers had 32 MB of physical RAM, yet we could easily handle many requests per second to our CGI scripts with a single server. I don't think that the apps then were all that different from what we have today. They still had to interact with databases, perform string manipulation and other logic, and generate and emit HTML.
So it just seems really bad to me that Ruby on Rails requires so much more memory for doing basically the same task. Something is seriously wrong.
I find it completely astounding that 80+ MB of memory is required to run these Ruby web apps.
Because memory is cheap nowadays, people use more of it, in the same way that most desktop OSs now couldn't boot on a 32MB machine, and often require something like 2GB of RAM just to function acceptably. Like money inflation, sometimes this is hard to accept :)
Of course Rails comes with a whole load of convenience code built in, which is loaded for each process and not always shared, people use gems, which are also loaded, people add apps on top, and people use in-memory caches etc. Those figures I quoted are for real world apps which pull in quite a lot of other code, though they are just taken from top/passenger-memory-stats, so take that with a pinch of salt. People use all that code because it's easier than reinventing the world each time, and developer time costs more than memory over the long term as sites are developed.
If you want to cut things right down, hello world in rails is around 40MB. Sinatra (another rack-based Ruby framework) does less and consumes about 20MB per process, so less again. A bare-bones script doing direct db access and some string manipulation (similar to those you were running back in the 90s) would probably take far less again and fit easily on your 32MB server, or of course perl is perfectly adequate too and might take less again (sorry no idea there) if you don't use a framework.
I suspect for most frameworks that you might use in perl though you'd see similar resource usage nowadays, simply because the resources are there and there is less cost for memory, and more cost for development time. Would be interesting to see figures for other frameworks/platforms which do similar work to Rails, as many are probably better on this front. Revel in Go for example takes around 5MB for a simple app, but it does far less at present - I suspect it'd remain far better for memory and performance though, even with extras like an ORM.
Rails developers
What kind of bullshit is this? That's 512mb of shared resource, you decide how many requests it actually is.
usually larger rails app can do 2-3 requests on a dyno. Just configure Unicorn workers to that and set it on your procfile. This is known since 2011 (a week after Cedar as announced)
Just configure Unicorn workers to that
Did you bother to read the article before spouting off half-cocked? RG apparently can't even run Unicorn because they don't have enough available memory.
Speed and ease of development, mostly.
The complaints of what amounts to essentially support contract extortion are something that I've personally experienced.
They were literally ignoring our repeated customer service tickets pleading for assistance or a phone call or something. We were paying them hundreds of dollars per month at the time.
When we finally got through the only people we could get ahold of were salesman. Essentially we were made to believe that only for $1000/mo support contract would we receive customer support.
FWIW Our issue was frequent network timeouts to other ec2 services which were. They did eventually resolve those after months and never did they assist us.
Heroku's platform is a significant accelerator of development for a startup. Using the platform has enabled us to do things faster and better than we'd otherwise be able to do them for the money and time we've invested.
That being said, I look forward to they day they have a true/viable competitor and are forced to compete on service. I'm extremely bitter towards them at the moment as a result of my customer support torture experience.
Yes I got bitten by their lack of customer support a couple of weeks ago. I did a release and the rails asset pipeline stopped precompiling the resources. I'd tested in staging so this came as a bit of a surprise. I promptly rolled back to the previous release(had been working fine for days) only to find that that now was broken as well. With my production app now broken I fired off a request for support. At this time we were running 8 dynos and 3 workers(not to mention a bunch of addons). This was also Saturday afternoon, which turned out to be a bit of a problem, I received an auto-response saying that support was only available Monday to Friday! Paying the premium rates for heroku and not receiving support for a production failure really was a bitter pill to swallow. We're running fast at the moment and don't have time to switch off but when we do will certainly be looking at the options.
Nah, I think Heroku is pretty principled. There's no amount of money you can pay them to get working load balancing or multi-region reliability.
Rap Genius gets a (YC) tag, but Heroku don't?
I've always wondered whether the cut-off is time- or success-based. Maybe pg should write a Boolean return function for that. :P
Big props to Rap Genius for explaining the problem so plainly in the article. Unfortunately, many people of prominence in tech aren't even capable of talking about what they do to laymen.
RE: YC Tag - I think it is because Heroku was acquired.
Dropbox don't get the treatment either. It's not that I mind, it's just funny to see the irregularity of labelling.
Probably because people forget how many companies YC fostered.
Or perhaps Heroku and Dropbox are famous enough that posters assume people already know they were fostered by YC and don't tag accordingly.
This entire thing against Heroku is so disingenuous... The fact that New Relic didn't expose these metrics is not great, but has very little to do with Rap Genius team not knowing about the metric.
Apparently, the fact that requests can be queued at Dyno level was common public knowledge back in 2011! Here's a quote from Stackoverflow answer:
"Your best indication if you need more dynos (aka processes on Cedar) is your heroku logs. Make sure you upgrade to expanded logging (it's free) so that you can tail your log.
You are looking for the heroku.router entries and the value you are most interested is the queue value - if this is constantly more than 0 then it's a good sign you need to add more dynos. Essentially this means than there are more requests coming in than your process can handle so they are being queued. If they are queued too long without returning any data they will be timed out."
Source: http://stackoverflow.com/a/8428998/276328
When you use a PaaS, it doesn't mean you don't need to be serious about it and completely forget about all technical aspects. Granted, it should have been included with New Relic from day one, but hardly justifies such a direct and persistent attack on Heroku.
From the article, it sounds like they were well aware of the logs & queue values, but they were misleading:
Their logs are STILL incorrect. Here’s a sample line:
Those queue and wait parameters will always read 0, even if the actual value is 20000ms. And this has been the case for years.2013-03-02T15:41:24+00:00 heroku[router]: at=info method=GET path=/Asap-rocky-pretty-flacko-lyrics host=rapgenius.comfwd="157.55.33.98" dyno=web.234 queue=0 wait=0ms connect=3ms service=366ms status=200 bytes=25582Interesting... just ran my own test and looks like I have jumped to conclusion too fast. NewRelic and the queue data do not seem to match, so I have some more reading to do.
Here's a quote from Stackoverflow answer
I tend to read (and trust) official documentation before Stack Overflow. I use Stack Overflow and it is great tool and all, but it can be really hit or miss. It doesn't cover every corner of every tech, and unless the answer is availible somewhere on the internet, or the person answering has first-hand experience it can lead to misleading wishy-washy answers.
Ultimately, pointing to SO you're lowering the expectations of a paid service from "the documentation reflects the product" so damn low to the point of "users should read everything googleable about the product they're using, and trust that OVER the official docs. Including mailing list posts from 2011 and a stack overflow question that asks a different question than you're asking"
The problem wasn't that queuing delay was impossible to detect. The problem was that the documentation described a specific load balancing setup that would have guaranteed better performance per dyno, and that setup was not in fact what was being delivered. It was clearly a material misrepresentation, and in any other service context would constitute a deceptive trade practice. That Heroku is being defended at all is a testament to the goodwill they've built up in the tech community, but it doesn't change the fact that they misrepresented their service, even if it was negligence rather than malice.
Why does Lehman say Heroku is "one of a kind in the world"? Isn't Cloud Foundry equivalent? http://www.quora.com/What-are-the-main-differences-between-C...
I'm astounded at the number of "$60k hires a good sysadmin and some EC2 resources" comments. You guys clearly don't understand exactly what Heroku (or a similar service) offers - providing it works.
There's a concept called a Bus Factor. Basically, it's the number of people who, if hit by a bus and made otherwise unusable, it would take to completely rail your business.
With $60k spent on a single sysadmin and an army of EC2, that's a pretty effing small bus factor - 1. So... that one guy gets taken out of action, and they're more or less toast? Yeah, no. Heroku gives them a massive bus factor for perhaps a little bit more money than it would take to cheap it themselves. It's a cheap way to avert risk.
They're probably at the size now where they could handle taking it in-house, but you've still then got to factor in hiring, developing the procedures for ops inhouse etc., and migrating. It's not easy to just flip the switch.
In any case, Heroku's behaviour is pretty shoddy. Though, knowing how much of a pain documentation is, I'm not surprised. I don't think they realised just how bad the change from intelligent to random routing actually was - and didn't treat it as such. This is giving them benefit of doubt though, because the other option is that they didn't publicise it precisely because they knew how bad it is. Scary thought.
I think it's obvious that Rap Genius would be happy with a "I see how its a problem, let us fix it" quote from Heroku - just acknowledging that there is an underlying problem and that there is a future on the platform.
This is the tech world equivalent of tabloids. Please don't promote this mindless back and forth, If you have a problem with Heroku leave and go to one of the other providers. If you don't stay and push them to fix this problem. Either way stop pretending this is some huge event that we must mindlessly obsess over
Indeed, especially considering it's painfully obvious that the problem isn't on Heroku's side but rather on their app's dismal performance. You should be able to easily do a couple of dozen requests per second; this is the kind of performance we're getting out of a single Heroku dyno on a dynamic page with no caching:
Edit: formatting.$ ab -n 1000 -c 20 https://*****-staging.herokuapp.com/********** This is ApacheBench, Version 2.3 <$Revision: 655654 $> Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ Licensed to The Apache Software Foundation, http://www.apache.org/ Benchmarking *****-staging.herokuapp.com (be patient) Completed 100 requests Completed 200 requests Completed 300 requests Completed 400 requests Completed 500 requests Completed 600 requests Completed 700 requests Completed 800 requests Completed 900 requests Completed 1000 requests Finished 1000 requests Server Software: Server Hostname: *****-staging.herokuapp.com Server Port: 443 SSL/TLS Protocol: TLSv1/SSLv3,AES256-SHA,2048,256 Document Path: /********** Document Length: 9670 bytes Concurrency Level: 20 Time taken for tests: 7.130 seconds Complete requests: 1000 Failed requests: 0 Write errors: 0 Total transferred: 10034000 bytes HTML transferred: 9670000 bytes Requests per second: 140.25 [#/sec] (mean) Time per request: 142.606 [ms] (mean) Time per request: 7.130 [ms] (mean, across all concurrent requests) Transfer rate: 1374.25 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 55 59 31.7 58 1057 Processing: 37 82 43.8 66 308 Waiting: 35 74 42.7 57 298 Total: 92 141 53.4 124 1096 Percentage of the requests served within a certain time (ms) 50% 124 66% 138 75% 153 80% 166 90% 199 95% 239 98% 282 99% 301 100% 1096 (longest request)> this is the kind of performance we're getting out of a single Heroku dyno on a dynamic page with no caching
If you read the original article[0], you would know that this is a problem that only affects apps with large number of dynos.
I have not done queuing theory in a long time, but my initial sense is that the math on this one will be generalization of the birthday problem [1], which is Wiki-notable on the sole basis that the probability of sharing a birthday (or in our case, the probability of queueing a request) is far, far higher than ordinary people anticipate for N above 23. Assuming I've captured the essence of the problem correctly, you would see a sharp drop in performance when you start to saturate about 20-30 dynos.
Given that there's an entire Wikipedia article on the sole basis that the behavior of these mathematical functions are nonintuitive, I think it is pretty fair to give RapGenius a pass at being surprised by the math as well.
[0] http://rapgenius.com/James-somers-herokus-ugly-secret-lyrics [1] http://en.wikipedia.org/wiki/Birthday_problem
I did read the original article. The problem is that their stack is not concurrent.
In a non-concurrent web application stack(like Rails), one request is processed at a time and further requests to the same node are queued. This means that if some request takes five seconds to answer, everybody that is queued on that node after that long request has to wait until the first request is fulfilled. That's the behavior they're seeing.
In a multithreaded or reactive web stack, other requests will get processed alongside the long request and, guess what, the problem doesn't happen unless all worker threads are processing long requests because the short requests will get processed alongside the long one by the other workers.
Assuming your stack has, say, 20 worker threads, the probability of your random load balancer overloading your node with 60 long requests given a large enough pool is small, assuming long requests are a small fraction of your load. If your concurrency level is 1, the probability of your node getting overwhelmed by long requests is much higher.
You can see it this way; if you have a stack that can only process one request at a time, the probability of that one single request processor getting backlogged is getting three heads in a row on a non-biased coin. If you have twenty request processors, the probability of that node being backlogged is getting three heads in a row for all twenty processors. Much less likely to happen.
They were told to run Unicorn, which from my understanding just forks the Ruby interpreter a couple of times to run in parallel. They decided not to (or were unable to).
They decided instead to whine about the problem and ask Heroku to build some magic load balancer that would solve all their problems. Even if they did have a load balancer that did least-conns, all of Heroku's traffic does not go through a single load balancer, meaning that separate load balancers could, through bad luck, allocate their requests to the same unfortunate node. [1]
What they did is amateurish; instead of looking at the problem and fixing it by either multithreading their code or switching away from RoR, they blamed their vendors, just like beginning programmers blame their bugs on the compiler or the libraries they use. When Twitter needed to scale, they moved some of their stuff away from Rails to Scala, Facebook wrote hiphop php, their PHP to C++ transpiler, etc.
Was Heroku completely in the clear? No. Their documentation was misleading and I believe they've admitted that. Was it a problem that New Relic didn't show all the metrics needed to isolate the performance issue? Yes.
We'll see how this whole story unfolds, but from my perspective, the more of a stink RapGenius raises, the more amateurish they look.
I'm not Heroku's biggest fan, and haven't used it for more than a couple of one-off fiddles to play with the platform.
But, my sympathy is going to them, because what I see coming from Rap Genius looks like classic blame-game. So a vendors documentation was unclear and your server sucked publicly for some time? Shameful. You didn't know about it because you expected your vendors to give you extra hand-holding? That's really rough. Instead of fixing the issues and moving on, you make it the one thing that everyone thinks about when your company is mentioned... that might not be in your best long term interests.
After this, I would be hesitant to enter into any sort of relations with Rap Genius, and I'm not that sure of what they do or what their product is.
I'm having a hard time understanding your justification for your sympathy going to Heroku.
RG is paying for PaaS from Heroku based on documentation, sales pitches, etc. They're also paying good money for the tools necessary to make business decisions based on data collected from that PaaS. Just given the realm of customer service, why wouldn't you expect "hand-holding" from your vendor? Why is it unreasonable to have that expectation? Why is it acceptable for your vendor to have a fall down response in "optimize your web-stack"? How do you expect them to "fix" this problem without the vendors involvement? What did you expect them to do, change platforms? How are they supposed to "move on" when the issue hasn't been resolved?
Have we gotten so far away from customer service with the likes of Google, that we don't even know what that means anymore? Are we to settle for mediocrity from any PaaS because our expectations are just too high?
To take an analogy from another field that, this is like a cinematographer getting bent out of shape and angry at the camera leasing company because he used a film stock his camera wasn't correctly calibrated for. From what I've read about it, it sounds like the setup on the servers had some issues that made the queue time issue into a problem where it wouldn't necessarily be for another, or even most other, customers. If you are or want to be, known as a technical company, that means you have to take responsibility for your mistakes, even if your mistake was thinking that the vendor was going to catch and fix all the mistakes and bugs in the service that they are providing to you. Building web services that are performant, stable and economical is hard work; you can't take your eye off the ball, and you can't just blindly hope that vendors will take care of your shit for you, you have to know what you're buying, and what you should be getting.
And, if you're going to go public with your complaints, it's best to do so in an understated, fact based manner. In this case, Rap Genius comes off like a guy screaming at the waiter in a fancy restaurant. They may be displeased, they may even be right about the choucroutes en sel being salted cabbage; but lot's of people around think they're making an ass of themselves.
We were promised flying cars and got online Rap lyrics instead.
Here's the other side of the story, from Heroku: http://venturebeat.com/2013/02/28/heroku-chief-opens-up-abou...
Just curious - Why after all this mess, didn't Rap genius recommend Engine Yard (Heroku's competitor). Is it because they had similar issues too, or did they simply ignore not trying to switch over to a different provider altogether? Just curious.
Seems like it would just muddy the waters further if they recommended someone else.
Did they seriously sell a Gem 'New Relic' as a diagnostic tool that flat-out makes up queuing and response latency numbers on requests to their platform? If this is true then hell yes they need to refund all their customers!
New Relic is a third party tool that Heroku resells. The numbers aren't made up; they are measured, but in the wrong place. The result is still wrong numbers, but it's not obvious where to pin the blame.
So what happens when Heroku says "Ok, fine, we can't give you the service you want, please download any data you want to keep and we'll re-allocate those resources to our other customers in 60 or 90 days." ?
This has taken on the patina of a really huge fight between operations and engineering with nobody to step in and say "Hey, we both want to make progress here, let see what we can do." there is no common point of contact here sadly.
What is the end goal? One of these companies being out of business? What? Its pretty clear that Heroku doesn't have any ideas on how to implement routing the way Rap Genius believed it worked, they even said as much. So what is the next step?
For $60,000 per month they can't create a mode where all your dynos are behind a single HAProxy with "intelligent" least-connections load balancing?
That was what they said, now I'm trying to find it again. When they talked about changing Bamboo they said "we can't get this to scale, so we switched paths, sorry we didn't document it well."
Heroku should make this right if they want long term success.
oh snap! R to the G startin some beef! when's the freestyle rap battle going down?
I'm sorry, but Tom Lehman sounds like a real dick to me in this interview. Heroku fucked up royally, sure, but why does RapGenius have to keep bashing them even after they started fixing things?
What did they fix? They've side stepped any real solution for the root problem. As such, the only thing they've "fixed" is new relic, by making it report what is actually happening.
1. They acknowledged the problem
2. They wrote several blog posts explaining what happened and what is going to happen now (fixing) and in the future (more fixing)
3. They fixed their documentation
4. They helped a third party service to adapt their offering to better help their customers (NewRelic)
5. They offered their advice for better solutions for affected customers (Unicorn)
This sounds a lot like fixing to me.
And from what they did until know, this probably won't be the end of it. So why not just talk to them directly and see if it's enough for you - and if not just go somewhere else?
We just seem to have different definitions of "fix". Fix, to me, implies the issue goes away. Are 1, 2, and 3 important? Yes. 4 should never have been an issue to begin with. And 5 is a non-solution given that simply adding more lines of execution does not address the root problem.
In no way have they solved the actual issue (a poor queuing strategy). And so even if you now know that you're getting awful performance due to queuing and you even try to get a multi-threaded strategy going per their suggestion, you will see the exact same issue at scale. That is not a fix.
Their stance on actually implementing a strategy that removes the root issue has been one of silence. Suggesting that "this probably won't be the end of it" isn't useful if you're running a business that relies on Heroku. If that isn't the end of it, then they should be far more communicative about the steps they're taking. Given their blog posts, we have no evidence that further solutions to this problem are being worked on or that they even acknowledge it's something they should fix.
So no, I do not agree with you that that is a lot of "fixing".
> And 5 is a non-solution given that simply adding more lines of execution does not address the root problem.
Actually more threads of execution does solve the problem. The difference with just doubling the number of dynos is that on a single dyno requests can be routed intelligently. The reason why random routing sucks is that request processing times have a fat tailed distribution: there is a small but still significant chance that a request takes really long. If you have that request routed to a random single threaded dyno, then all further requests routed to that dyno have to wait very long before they can be processed. If however you had multiple threads of execution on the dyno, the other requests would simply go to the other thread of execution. So now there would only be blocking if a single dyno gets N really long requests at roughly the same time, where N is the number of concurrent threads the dyno is running. The probability of getting N expensive requests to the same dyno at approximately the same time decreases very fast with increasing N.
Hand waving ahead! Lets say the probability of an expensive request blocking a dyno is p = 2%. Then if you double the number of dynos the probability of blocking a dyno is now p/2 = 1%. If however you have two execution threads on each dyno, the probability of blocking a dyno is now p^2 = 0.01%. If you have 10 execution threads it is p^10 which is very small indeed.
Here is a paper about it which makes that intuition precise and shows that even N=2 is a massive improvement over N=1: http://www.eecs.harvard.edu/~michaelm/postscripts/handbook20...
The problem is that this only works if each concurrent process of your application doesn't use too much memory, since the available memory on one dyno is quite low. For many applications you can't easily have multiple threads of execution on one dyno. The real solution is to have some form of intelligent routing. As the hand waving and the paper above shows, you can make groups of dynos, and then the main router routes to a random group, and within each group requests are routed intelligently. You can take the size of a group to be a small constant, say 10 dynos. So there shouldn't be any scalability problems with this routing approach. If you take the group size small enough, you could even run each group of dynos on a single physical machine, which would make intelligent routing among them even simpler.
This post should be stickied at the top of every Heroku queueing thread. People keep acting like the "intelligent routing" system is trivial to build and has no overhead which are both patently false. It's clear that they can't go back to the old method with their newer (since 2011) architecture so the solution is for apps to fix their own performance issues.
What I said pretty much implies the opposite, so you may want to retract your endorsement ;-) There are various solutions to this problem but almost all involve some action on Heroku's part.
We have, indeed.
I never expected them to completely rebuild their service because of some customers (very small minority, I assume) aren't totally happy and satisfied with their product. That clearly sucks for the affected people.
It's reason for them to leave the product and platform and go somewhere else, where the problem is not an integral part of the produt. But it's not a reason to be a dick.
I'm not so quick to think that Heroku can't improve their routing. www.rapgenius.com resolves to four IP addresses/routers; why not one or two?
> 4. They helped a third party service to adapt their offering to better help their customers (NewRelic)
Unless I misunderstand the situation, NewRelic's heroku reporting isn't some one sided third party service but rather something that at least seems to be jointly produced by Heroku and NewRelic.
NewRelic can't report something that isn't offered up and it would seem to me that Heroku needs to deliberately expose metrics to the NewRelic plugin for it to be able to pick them up.
As it seems to be that these queue times weren't reported anywhere developer accessible it also stands to reason that they weren't exposed to NewRelic.
So no heroku didn't fix some third party service, they fixed their own service (in this regard).
I'm not entirely sure if the headers the new version of the plugin uses were available before, but it sounded like they were. NR wasn't aware that the one they were using didn't report the queueing time before the dynos and Heroku now helped them to fix that.
So yeah, probably Heroku fixed their part and made sure NewRelic reflected that.
How exactly has Heroku fixed things? I think we can summarize Heroku's response to the whole affair as "oops! we got caught ..sorry!"
Especially the remarks about the costs and alleged fails of NewRelic seem totally wrong to me.
As a very happy NewRelic customer, I can say they did exactly what they advertise: Help monitor the application performance in the server (!). The queueing that now seems to be a problem of Heroku doesn't happen in the server that is processing the request, so by default it can't show the time needed.
Actually I'm quite sure that using one part of NewRelic, RUM (real user monitoring), should have shown the problem quite obviously. It shows how long a user had to wait for the request complete, including DNS lookup and network time. So if users waited longer for answers to their requests than the backend time should indicate, every developer should have taken this as a hint to investigate further.
Well, even using the application should have been enough to know that something is wrong when NR reports 250ms backend time, but page need at least 1200ms to return first byte to the customer.
Errrm, unless I'm entirely mistaken the problem is that the queueing does happen in the server that's processing the request, and New Relic just doesn't report it.
Well, as I understood it the queueing happens in some kind of load balancer that is responsible for routing the requests between different servers ("dynos" in Heroku speak) to handle the requests. New Relic is a plugin for your server and hooks into Apache and PHP (or this case Rails) to learn how long it takes to process the requests (and retrieve data from the database and/or a cache). This means, to me, that queueing strictly is out of the scope of what NewRelic normally does.
However it's great that they and Heroku now found a way to report the correct queueing time. As far as I understood it, they use a special header added by Heroku to calculate the time themselves and report it in their dashboard.
I would be pretty pissed if I had sunk tens of thousands of dollars and countless hours chasing ghosts. If you're a startup every dollar and every hour lost is especially costly. If Rap Genius ends up going under from running out of money it's impossible to say that this Heroku nonsense isn't at least partially to blame. If Heroku didn't give them the run-around they would have jumped onto EC2 and this problem and the costs it caused would have been completely avoided.
Tom is from New York. He's also a founder. Both roles typically imply/require a certain amount of dickishness.
Thing is, somebody had to take Heroku to task over this, and until they fix the problem somebody has to keep taking them to task.
I worked in the office beside Tom's for a year (pre-Rap Genius). He's a sharp guy. More importantly, he's right. I don't think being nice has any relevance to Rap Genius' bottom line.
If I have a problem with my ISP or my car dealer, nobody bats an eye at a blog post or an online complaint about the service.
But if I'm a customer of an admired former startup to whom I pay hundreds of thousands of dollars a month, I'm not allowed to go public with my complaints when I--and maybe hundreds of others--have been deceived and have suffered intentionally worse service than what I was promised?
I find the "enforced positiveness/optimism" of the startup community very disheartening. The essence of engineering is honesty (preferably quantified) about capabilities and limitations of systems. In this case, a former startup owned by a public company deceived their customers and then papered over (my impression) a valid, quantitatively-documented customer complaint once it became public.
Tom should be commended for speaking out. If he's right, dozens of startups have spent far more of their precious and limited capital on excess dynos and monitoring tools that could have been better spent elsewhere. I can't imagine a better service to the startup community than making this sort of thing public.
Are you saying being from New York implies "dickishness"?
It's a self-preservation thing. Many New Yorkers are nice by default, but don't you dare fuck with us.
Or what, exactly? I recently moved to NYC from Iowa and I haven't noticed much of a difference except more aggressive driving.
This is exactly what I mean. Thanks.
We're just very direct. There's not much value placed on unnecessary politeness or platitudes. The biggest social faux pas to a New Yorker is wasting our time.
Some people see that as us being rude and others actually appreciate it.