Measure Anything, Measure Everything (2011)
codeascraft.etsy.comShameless plug: For those of you using Celery, you can jump in rather quickly with this decorator I made that increments task counters and sends how long each task took to statsd:
http://www.charleshooper.net/blog/painless-instrumentation-o...
"That which is measured improves."
Be wary of measuring - and hence improving - the wrong thing.
Sometimes you optimize for the wrong metric. The classic example is measuring programmer output by lines of code. The are many more subtle ways this can manifest, though.
We know that measuring programming output by lines of code is bad because we measured that measure.
So, measure everything, including your measures.
When you have a tape measure, everything is something which must be measured.
is it just me or is graphite a HUGE pain in the ass to setup? is there a better option for these graphs. I really want to use statsd in a few projects but it looks like im going to have to set aside 8 hours to figure out how graphite works (and i am a django developer)
I set up Graphite+StatsD for a project back in the spring of 2011, it was a ton of trouble. StatsD was easy, but Graphite was really hard. It was a challenge to find up-to-date documentation where big sections weren't marked simply "TODO". I also never solved a bug where my counters were getting graphed as tiny numbers like 0.02 rather than 1, despite repeatedly wiping the box and experimenting with different config settings. Things seemed to be a little better about 6 months later, but I still couldn't solve the 0.02 problem. I love the simplicity of StatsD, and I liked how Graphite let me use the URL API to build very useful dashboards for both technical and business metrics, but the setup was a big hassle. I spent days fighting with it, a big cost for a startup. I'd use StatsD again, but with a different database/frontend, probably something hosted.
If you're looking to use Graphite but don't want to manage it yourself then I highly recommend hostedgraphite.com. I and many friends at other companies use them and are very happy with the service. They're smart guys.
If you don't want to set up graphite and don't mind paying for a service, I've really enjoyed using Scout. (http://scoutapp.com) Cron jobs report to their app every 3 minutes and they have very nice ad-hoc graphs. They also have some indispensable configurable triggers that have saved us multiple times.
So I'm not being snarky in the least bit. Seriously. I'm just sharing.
I got graphite going on an Ubuntu installed on my macbook while in the air between SFO and ATL.
Literally, discovered it, downloaded it, and graphing metrics before we landed.
Now, that was, I think, in 2010. The last time I installed Graphite was a few months ago, and it seems like they have split the project up a bit. It took a bit longer.
NB: I'm neither a Python or Django 'guy'
I've been using StumbleUpon's OpenTSDB (http://opentsdb.net) with much success, and it looks like StatsD could easily be modified to work with it. The biggest pain with OpenTSDB I've had is that it requires an HBase/Hadoop backend, but a standalone, single-node install isn't terrible to setup.
There are other graphing frontends now, but also you can check out companies like Librato who have statsd-compatible backends and do all the graphing and storage for you.
I was inspired by this post when I first read it a few month ago. Since then, I've used a modified version of StatsD to send data to an in house realtime graphing engine. A lot of our tools are php backends, so it was super convenient to be able to drop the class in and start measuring things.
For anyone interested, I used wrote node.js process takes arbitrary statsd-compliant data point and serves a socket.io enabled front-end for 'zero-config', realtime graphing.
We've found it useful internally for taking quick measurements on various projects. I was going to productize or open source the whole thing, but then life got in the way. Maybe it will see the light of day someday.
Does anyone here use this?
Yes, at IFTTT we use this. I've been pleasantly surprised at the amount of traffic statsd can handle, even without sampling.
We also believe in measuring everything you can. We're interacting with many APIs across many boxes. Statsd + graphite are the tools we use to understand what's happing at runtime.
Graphite has a lot of warts, but it's really powerful once you get used to it. There are plenty of pretty interfaces you can put over graphite, but nothing really matches it for ease of ad-hoc queries.
Typically I'll use graphite to view ad-hoc metrics and build reports. When I find I'm repeatedly viewing a particular graphite report then I'll "hard-code" it in gdash [1] for the rest of the team.
We use this combo to track thousands of separate metrics and we've been pretty happy with it so far.
When I was independently contracting as a systems engineer I convinced clients to invest some time into this whenever I could. For example, I had one client with a set of content aggregators (read: web crawlers) that grew and shrunk dynamically with no centralized logging. When I eventually convinced them it was worth the time, I fixed a variety of their logging issues and also introduced statsd and graphite.
Implementation was easy. statsd is pretty simple to deploy and graphite wasn't too difficult either. To add statsd reporting to your code, it's essentially one line to create the statsd socket, another line of code to declare each timer or counter, and another one to increment. I think more time was spent determining what name to give each metric than it was implementing it in this project.
Now that I'm at dotCloud, I'm working with a much larger distributed system and we use it here also. We liked it enough to build some statsd hooks onto our RPC layer we use for just about everything. Now every time a component makes a remote procedure call, a counter for that call is incremented and the response time is sent to statsd. It's been very useful for troubleshooting odd behaviors and correlating events across the platform.
As people who work with complex distributed systems, we can't know exactly what they're doing. We'll think we know, and sometimes we'll be close. Other times we'll think we know, and then we'll wake up at 2AM because something failed horribly. By being able to monitor the system's behavior (sometimes in gross detail), we can get a little closer to knowing what's really going on.
We use StatsD at SeatGeek (any at my previous job as well) to track as much as possible. In general we try to time each call to an external service and use counters for any exceptions with those services.
We use StatsD and Graphite a ton at lonelyplanet.com: http://bit.ly/lp-perf-and-metrics
Why are there a negative number of cups of coffee remaining?
That's actually explained in the comments but the short answer is that someone left the coffee pot off of the scale. My guess is that the scale is zeroed with the empty coffee pot to be on the scale.
IMHO, graphite is a PITA to setup, and documentation is horrible.
If you're a Redis person, 37signals built a StatsD compatible version using EventMachine which stores data in Redis.