All the open source code in GitHub now shared within BigQuery: Analyze all the code!

Press enter or click to view image in full size

All the open source code in GitHub is now available in BigQuery. Go ahead, analyze it all. In this post you’ll find the related resources I know of so far:

Update: I know I said all — but it’s not all. I’m updating the answers to these and other questions at github.com/fhoffa/analyzing_github.

The pipeline mirrors code from:

Projects that have a clear open source license.
Forks and/or un-notable projects not included.
Nevertheless, it represents terabytes of code.

Official sources:

In depth analysis

Read Francesc’s step-by-step guide to analyze Go code. Use these patterns for any other language too :).
Run a full JavaScript static code analyzer within a SQL query: Running JSHint inside BigQuery.
Java imports: Most used Java imports, from 2013 to 2016.
Top Angular directives.
Tabs or spaces (the holy wars).
SQL commas — leading or trailing?

I’m waiting for your contributions — I will add them here:

1 hour after the dataset announcement @thomasdarimont was able to find all the java projects that declare certain dependency.
“Popular Java projects on GitHub that could use some help” (analyzed using BigQuery and Dataflow).
“What can we learn from million lines of Groovy code on Github?”.
Filippo Valsorda “Analyzing Go Vendoring with BigQuery”.
Go project uses BigQuery stats to guide design decisions, more than once.
David Gageot analyzes 281,212 Docker projects.
uses R to cluster R packages.
compares most popular gems according to Rubygems.org download data vs GitHub gem calls.
looks at the most popular npm packages and trending keywords. performs a similar analysis. Sergey follows up with a deeper assessment on why almost empty packages duplicate all over GitHub. also analyzes Angular vs React messages.
Brent Shaffer analyzes PHP code and libraries — also test coverage for different languages.
A full run down by “Yet another analysis of Github data with Google BigQuery”.
informs the travis-ci team on the counts for Node versions tested.
reviews 779,236 Java Logging Statements, 1,313 GitHub Repositories to determine “ERROR, WARN or FATAL”?
“Naming conventions in Python import statements”. Then “Naming conventions in Python def function()”.
“Analyzing half a million Gradle build files — Guillaume Laforge’s Blog”, 2017 “Gradle vs Maven and Gradle in Kotlin or Groovy”
@anvaka “analyzed ~2TB of code to build an index of the most common words in programming languages”. Cool visualizations, full code on GitHub, and a lot of comments on reddit.
comes back, linking code to StackOverflow.
finds all kind of metrics for Puppet.
tells us how Googlers used BigQuery and GitHub to patch thousands of vulnerable projects (HN).
found the top imports in Jupyter (.ipynb) notebooks.
went for the top Clojure libraries.
went searching for Stack Overflow code that shows up in GitHub project.
found all the constant regular expressions in Go — to improve Go’s regex capabilities (article).
Matt Warren analysing C# code on GitHub with BigQuery.
“State of npm scripts” (queries).

A series of posts by Robert Kozikowski:

Advanced GitHub search with BigQuery.
Top emacs packages used in GitHub repos.
Visualizing relationships between python packages.

Tips

Don’t analyze the main [bigquery-public-data:github_repos.contents] table — at 1.5 TB, it will instantly consume your monthly free terabyte. Use instead the official [bigquery-public-data:github_repos.sample_contents] extract (~23 GB), or one of the full language tables I left at [fh-bigquery:github_extracts.contents_*].
How about doing a JOIN between this new dataset and the GitHub Archive to find the most starred files and their patterns? Sample code soon, but see how I played with GitHub stars and Hacker News previously.
I’m pretty excited about getting author and committer timezones. We’ll be able to perform some regional analysis here.

Visualizations

Google Data Studio 360 dashboard (previous post about Data Studio).

Press enter or click to view image in full size