Static analysis at GitHub
cacm.acm.orgI contributed to a similar system used within Google (partially open source at kythe.io), that took the very different approach of integrating with the language-native toolchain for each language.
As this article describes, doing this requires per-language integrations and also effectively being able to "run the build" for any given code (because e.g. the C++ header search path can vary on a per-source-file basis), which is untenable for a codebase as large and varied as GitHub's. However, if you can make it work, you get the benefit of having the compiler's understanding of the semantics of the code, which is especially finicky in complex languages like C++ or, say, Rust.
For example, if you look at this[1] method call it refers to a symbol generated by a chain of macros, but the browser is still able to point you at the definition of it.
It's an interesting tradeoff to make: the GitHub approach likely doesn't handle corner cases like the above but it makes up for it in broad applicability and performance. I recall an IDE developer once telling me they made a similar tradeoff in code completion, in that it's better DX to pop up completions quickly even if they're "only" 99% correct.
(To be clear, I absolutely think the approach taken in the article was the right one for the domain they're working in, I was just contrasting it against my experience in a similar problem where we took a very different approach.)
[1] https://source.chromium.org/chromium/chromium/src/+/main:v8/...
Note that this article describes our implementation of “search-based” or “ctags-like” Code Navigation, which definitely has the imprecision that you describe. We've also been working over the previous ~year on a framework called Stack Graphs [1,2,3], which lets us tackle “precise” Code Navigation while still having the zero-config and incremental aspects that are described in the paper.
The build-based approach that you describe is also used by the Language Server Protocol (LSP) ecosystem. You've summarized the tradeoffs quite well! I've described a bit more about why we decided against a build-based/LSP approach here [4]. One of the biggest deciding factors is that at our scale, incremental processing is an absolute necessity, not a nice-to-have.
[1] https://github.blog/2021-12-09-introducing-stack-graphs/
[2] https://dcreager.net/talks/2021-strange-loop/
I read about stack graphs before, it sounds interesting!
I think they help, but ultimately I expect you need a compiler solve the absolute madness of the totality of C++. For example I think getting argument-dependent lookup right in the presence of 'auto' requires type information? And there are other categories of things (like header search paths) where I think you are forced to involve the build system too.
Yup, it is probably fair to say that C++ accounts for like 50% of the complexity of Kythe at Google. Or certainly it feels like it.
And it is also worth noting that Kythe goes a bit deeper than what LSP can accomplish. In particular Kythe is built around a sort of two-layer graph, where it separates the physical code/line representation from a more abstract semantic graph. This allows us to accomplish some things that are very difficult to do in LSP.
Finally, Kythe at least internally has a big reliance on a unified build system (Blaze, or Bazel). It becomes rapidly more difficult to do when you have to hook in N different build systems up front, which is why search-based references are so appealing. Build integration is hard.
Has Tree Sitter been useful to projects like this? Does it have promise to be useful in the future? It seems to be gaining a lot of adoption among Neovim users and plugin developers, but not really anywhere else. I'm curious if that's because of lack of familiarity, or because it's technically deficient somehow.
In short, yes, very much so! Tree-sitter is what we're using under the covers to parse all of the languages that we support. The ctags-like symbol extraction described in the paper comes straight from tree-sitter, too. [1]
[1] https://tree-sitter.github.io/tree-sitter/code-navigation-sy...
Sourcegraph CTO here. It's interesting to read about GitHub's approach and how it contrasts with the approach we've taken at Sourcegraph. One of the key tradeoffs the article highlights is GitHub's decision to take the "shallow-but-wide" approach to code navigation, which has enabled them to provide some level of code navigation for most open-source repositories on GitHub, but at the expense of precision/accuracy (i.e., the system can't necessarily differentiate between different symbols with the same name).
Sourcegraph decided early on to take the opposite approach, favoring precision and accuracy over supporting every public codebase. Part of the reason why is that we aren't a code host that hosts millions of open-source repositories, so we didn't feel the need to support all of those at once. Another big reason is we heard from our users and customers that code navigation accuracy was critical for exploring their private code and enabling them to stay in flow (inaccurate results would break the train of thought because you'd have to actively think about how to navigate to the referenced symbol). We actually built out a language-agnostic search-based code navigation, but increasingly user feedback has driven us to adopt a more precise model, based at first on our own protocol (https://srclib.org) and also the LSIF protocol open-sourced by Microsoft that now enables code navigation for many popular editor extensions.
This is not to say that GitHub's approach is wrong, but more to say that it's interesting how different goals and constraints have led to systems that are quite different despite tackling the same general problem. GitHub aiming to provide some level of navigation to every repository on GitHub, and Sourcegraph aiming to provide best-in-class navigation for private codebases and dependencies.
(Btw, hats off to the GitHub team for open-sourcing tree-sitter, a great library which we've incorporated into parts of our stack. We actually hosted the creator of tree-sitter, Max Brunsfeld, on our podcast awhile back and it was a really fun and insightful conversation if people are interested in hearing some of the backstory of tree-sitter: https://about.sourcegraph.com/podcast/max-brunsfeld.)
you should also highlight that your solution has a significant downside: you have to generate the definitions yourself in order for them to be displayed in source graph
this means you’d have to setup a CI/CD to rebuild index each time you make changes to the code + host it on your own
what’s great in GitHub’s approach is they take that obligation away from customers
I didn't have to do anything and sourcegraph indexed my project: https://sourcegraph.com/github.com/josephmate/OdinCodeBrowse...
Are you taking about a self hosted or private repo? In that case, I am not familiar.
yes, they used "search-based indexing" for your repository
i was talking about LSIF, which enables autocomplete, go to definitions, finding references but at expense of generating your own index
Actually that’s not quite right. We offer auto-indexing now for some languages with more on the way. If neither auto-indexing nor manual LSIF generation are an option, we fall back to our search-based code navigation, which, similar to GitHub’s implementation, trades off accuracy for zero configuration (GitHub’s approach is parser-based while we use a combo of parsing, search, and heuristics that eliminates the need for an index entirely in some cases). This is all documented here for those curious to learn more: https://docs.sourcegraph.com/code_intelligence.
GitHub released a great java parser that I think is related to this work https://github.com/javaparser/javaparser
I'm also using that parser for a side project where developers can cross link their source code and host them statically: https://github.com/josephmate/OdinCodeBrowser#readme
Short (~5 min) video summary, if that's your thing: https://youtu.be/YoOFJApmPKc
This article more about parsing at scale than static analysis at scale.
Parsing is definitely a big part of it, and it's a fair point that for search-based Code Navigation, we don't have to do any real heavy lifting on the analysis side. That said, I think the article describes our non-functional requirements well (zero-config, incremental, language-agnostic). It's those non-functional requirements which are most important for the “at scale” part. I'd go so far as to suggest that any static analysis implementation that can't meet those requirements would be nigh impossible for us to roll out across the entire GitHub corpus.
tree-sitter is a phenomenal project
i’m doing a git-related project myself and use it to generate symbols for source code
if you’re into it too, i recommend also checking out LSIF: https://github.com/Microsoft/language-server-protocol/blob/m...
This is Big Code
figure 2 is repeated for some reason?
That's odd. The error was not present in the original publication of this article in ACM's Queue magazine: https://queue.acm.org/detail.cfm?id=3487022