This is not an incredibly complex thing to do, but I recently published a few graphs of reddit commenting statistics on Twitter and people asked how I did it. First, here are the graphs:
Press enter or click to view image in full size
Press enter or click to view image in full size
There are many easy ways to analyze reddit comments, but I chose one that’s a little less efficient — that is, to write two sets of scripts: one to analyze the comments and one to visualize them. This way I can use a really fast language for the analysis (Go) and something more traditional for the visualization; JavaScript with D3. I’m a little new to both, having written and modified existing code in both languages, but I knew I’d learn a lot about new projects if I used both of them.
The gist of things is this:
- Download all the reddit comments for the last year
- Iterate on the analysis, eventually outputting a JSON
- Visualize the JSON in d3
- Take notes along the way.
Analysis
You can download all the comments at pushshift.io, if you do, I suggest you make a donation (to them) if you download a lot. I downloaded all of 2016 and every month I get the new 2017 comments for the previous month.
I will say here that you can probably use BigQuery to do this analysis, pushshift has a great tutorial on it. I could have done that, or I could have set up a Hive instance and queried the data, or I could have done any number of other things but for flexibility and ease of introduction to the data, I use Go.
In Go, I scan every line of every file and run my totals or sums or whatever. It’s all custom code. I dabbled with converting everything to protobuf (notes here, here, here) but that didn’t offer any benefit in speed and the benefit in space savings was underwhelming. I also dabbled in concurrency but the fact is that this is a fairly linear process and I don’t see the tradeoff in dev time fixing this. Since my analysis focuses on a single subreddit, I should have extracted all the comments for that subreddit and just parsed those files but I’m a procrastinator so I didn’t do that.
I currently run one “analysis” (extraction) function and two aggregation functions: to count the number of unique commenters per day, then a second one to count the number of comments authored by “[deleted]” which is a proxy for removed comments. These are simple counts or sets that are stored in a map of dates and incremented or added to when needed. (Code to extract data here, aggregation here.)
One fun trick I use is to see if the bytes of a line contain the subreddit ID code before I parse the JSON. This means I don’t parse most of the JSONs in the input which would really take time. (Here’s the code.)
The output from my Go program is very simple: Just a JSON that I can then feed to the visualization side of things.
Visualization
I chose to use D3 since it’s very well known. I also wanted to do more with a “real” JavaScript project, including build system, a test server, maybe hot reloading, these things. For as much complaining as you hear online about JS frameworks and whatnot, they change fast but they surely do provide me with an upgrade from my previous world of hacked together scripts and refreshing and serving things.
I started with a JS/D3/Python dev server setup and have migrated to WebPack, Babel, ES2015 and NPM. My top two priorities were a fast build system (fast in the human sense of fewer commands and more automation) and the ability to break up my files into many modules. Setting everything took a while but it was worth it. (Help on that here and here.)
D3 is tough, but Mike Bostock is a superhero and basically provides an example for everything. I searched for “bar chart” and found an example which I studied and generated partially-new code from. Thanks dude!
From there, it was easy — a little bit of JS and a few lessons in D3 and you can see the charts are generally pretty simple.
Please feel free to fork or change or submit a PR to either of the repositories!
As a bonus, my latest graph since I started writing this piece:
Press enter or click to view image in full size