All the open source code in GitHub now shared within BigQuery: Analyze all the code!

4 min read Original article ↗

Felipe Hoffa

Press enter or click to view image in full size

8

All the open source code in GitHub is now available in BigQuery. Go ahead, analyze it all. In this post you’ll find the related resources I know of so far:

Update: I know I said all — but it’s not all. I’m updating the answers to these and other questions at github.com/fhoffa/analyzing_github.

The pipeline mirrors code from:

  • Projects that have a clear open source license.
  • Forks and/or un-notable projects not included.
  • Nevertheless, it represents terabytes of code.

Official sources:

In depth analysis

I’m waiting for your contributions — I will add them here:

A series of posts by Robert Kozikowski:

Tips

  • Don’t analyze the main [bigquery-public-data:github_repos.contents] table — at 1.5 TB, it will instantly consume your monthly free terabyte. Use instead the official [bigquery-public-data:github_repos.sample_contents] extract (~23 GB), or one of the full language tables I left at [fh-bigquery:github_extracts.contents_*].
  • How about doing a JOIN between this new dataset and the GitHub Archive to find the most starred files and their patterns? Sample code soon, but see how I played with GitHub stars and Hacker News previously.
  • I’m pretty excited about getting author and committer timezones. We’ll be able to perform some regional analysis here.

Visualizations

Press enter or click to view image in full size

Press enter or click to view image in full size