Show HN: Visualizing a Codebase
octo.github.comI explored an alternative way to view codebases to the typical folder/file list, showing a bird's-eye-view of its structure.
https://octo.github.com/projects/repo-visualization We always look at our code in a file/folder list - I explored an alternative way to view codebases, showing a bird's-eye-view of its structure. This write-up walks through the motivations, ways to use the visualization, and potential future directions (there are many!). There's also an interactive tool to check out your own repos and a GitHub Action if you want to integrate a diagram into a README. I honestly don't understand the point of this. Is it just a visualization of the directory structure? I expect code visualization to more or less ignore the file structure and focus on semantic analysis. For example, show the major components of the system and how they interact. Perhaps the major components are represented by some kind of a module, or a collection of modules. I don't have any concrete ideas. But I was expecting something in this vicinity. ah, but is there a generalizable way to do this? Every code base has files and directories. Personally I find it interesting as tools like these have been helping me to understand the team's and delivery dynamics when I'm joining new dev-teams. This particular is great, as it reminds me of a book that I had read a while back; Your Code as a Crime Scene [1] by Adam Tornhill. Adam is trying to explain something similar, but takes the whole concept onto the next level by explaining how tech debt and hidden coupling can be discovered using the git history and similar file structure visualisations. [1] https://pragprog.com/titles/atcrime/your-code-as-a-crime-sce... me too. I have a set of git history analysis scripts that I often use when I'm joining a new team. they're here: https://github.com/gilesbowkett/rewind but they're a bit stale at the moment. one major weakness they have is that they work on a per-repo basis. no problem at all for monorepos, but for a company with a lot of repos, it'd make sense to use the GitHub GraphQL API to find out which repos see the most activity. This is so cool. I really appreciate how they added the "Search for a file" and "Excluded paths" to the demo. Makes it a lot more useful while still so simple to use. Edit: the more I play with it the more I like it. Also just noticed their feature to deep link to repos (example: https://octo-repo-visualization.vercel.app/?repo=owid%2Fowid...). The future directions they mention also sound really exciting. Seeing files that cause a lot of CI failures, files by # of authors, files by # of changes, all that stuff would be really cool. My favorite tool for visualizing a codebase is Gource. Here's a 1 minute visualization of the Linux Kernel repository from 1991-2015 https://twitter.com/mattrickard/status/1423366779590430721 I was about to also mention Gource. Here's a render I made in December, "1500 Commits (39 months) in 1 and a half minutes": https://www.youtube.com/watch?v=szhYhph4C2U And he's another one I made a few years ago, "100 Commits (18 months) in 1 minute": https://www.youtube.com/watch?v=aIN4mOdL0d0 I used Gource to visualize a hackathon, we had everyone start from a templated repo so it was really easy to show the entirety of participants coding activity (albeit at a macro level) I really really want to like these sorts of visualizations. But they just fall flat on me. The "you can see really quickly..." text is scrolling by and I'm like... "Nope, that picture still means nothing to me." :(. It starts highlighting different parts and I'm completely at a loss on what is highlighted. I do think this can be very effective once I'm trained on it. Such that I plan to play with it. But I just don't visually think of programs in anything close to this manner. Anyone know if studies that explore how we think of our programs? Anecdotally, I don’t visualize my code, however there is a sense of distance and depth between things (in the functional sense, not the lexical sense). I feel like this visualization goes a long way to showing the distance I feel when working on the code, but that’s only because the visualization captures the lexical distance and we generally group functionally close things together. I don't have a sense of depth at all. I do have some concept of linearity of the control flow. But, it is a shallow knowledge that I couldn't visualize or even describe using visual terms. I also confess that I think my best understanding from code is not at all linear in the textual representation, either. Sure, small code fragments are easy to reason about in a top down, left right fashion. But most of my understanding is a lot more disconnected than that. Frustrating, as I do like pure functions, when I can get them. But... I also like knowing what the "metaphor" for a program is such that I can also reason about what it is doing. This is best termed for me, ironically, in a graphical sense. Seeing a cube that is rotated 45 degrees is easier to understand in a Logo like language where the directions are the same, except for the initial condition. (I may just have a soft spot for Turtle Graphics. :D) Interesting! My sense of “depth” is highly correlated to how “foundational” the code is. Meaning that changing “deep” code will require changing lots of things to keep things working or may break unrelated things far away (sorta like brittleness). This makes some sense. That said, I also view it similar to how you would consider a keystone on an arch. That is, it may be something at the surface, not necessarily down below. (Which gives the tell that I view depth as down. I'm realizing as I type this that that is not necessarily true.) I worked on something like this a few years ago, only in VR so you could walk around the visualization and use your spatial recognition abilities in 3D. One part we struggled with was evolving the visualization with the codebase. I see in the demos at the bottom that small changes to the codebase can have a large impact on the visualization (unless I'm missing something), making it difficult to treat the visualization as a fingerprint over time. I wonder if there are plans to address this. This is an area I'm very interested in, happy to chat about it any time. My big bet (10% confident it's correct and will be world changing), is that having languages that rely on spatial position of tokens in a 2D or 3D grid will be a big leap forward and make 3D visualizations quite natural. I think 3-D visualizations of 1-D languages (all our current programming languages are 1-D) will not be so helpful, as you will be looking at transformations, not the actual code as it exists. If anyone is intrigued and wants to write a function that takes as input a parsed Tree Notation program (https://jtree.treenotation.org/designer/) and outputs a a https://www.mecabricks.com/ file, get in touch! I could even fund something like that, if needed. Such a function would then be able to generate a LEGO version of any program written in a Tree Language. From there, I think there could be interesting discoveries to be made related to future version control systems and collaborative editing algos (I think you could beat CRDT/OT/et al). I have also been thinking about building something like this. I believe there should be a presentation layer separate from the code. Give people tools to build an empty building (or city) via floorplans; allow them create pleasing spaces in VR. Then the actual code machinery (literally represented as machines) is built inside the various rooms. My hope is that organizational/bikeshed concerns are funneled into the architectural/interior decoration layer, whereas engineering is housed in those 3D spaces relatively undisturbed. The needs of the code would exert influence on the architecture as new spaces would be required for new functions, or rooms would be rearranged to better reflect dependencies and coupling. > I think 3-D visualizations of 1-D languages (all our current programming languages are 1-D) will not be so helpful, as you will be looking at transformations, not the actual code as it exists. This is the part that I disagree with. The source code is linear, sure, but the structure that ties the different components in a codebase together is high-dimensional. For example, the symbol usage graph is really useful to visualize in 3D, as is the general package structure. And most of these visualizations we see don't even touch on runtime analysis like I was working on! There, seeing different types of data arranged either by where the data is coming from in the source code or where the data is coming from at runtime (analogous to lexical vs. dynamic scoping) is a visualization that benefits from 3D. Yes, I'd agree with you. My wording was poor. I should have said something like "will be helpful, but far more complex to build and use, as they will have to rely on abstract transformations, rather than the natural shape of the code". I could still be wrong, and really need to get those experiments going again! Thanks for pushing back on that and causing me to think harder. P.S. I also think runtime analysis is an even more complex problem, and applaud you for attacking it! 100% agreed. The translation from source code to 3D visualization was ultimately based on choices we made, not on the inherent structure that already was put in place by the developer. The output was still valuable, but when the developer has a say in the output, that's when so much more power would be unlocked! That sounds interesting, can you give more details about that? I work on Vircadia (https://vircadia.com/) and have been thinking for a while that it would be cool to have in-world visualizations of things like the project's structure and github activity. It's a big one, so perhaps the right 3D representation would make the project's structure more understandable at a glance. I haven't worked on the project in a while, so I can't comment too much on the current status. But I'm happy to talk about what I learned, the challenges I faced, etc. if you have any questions about those. The multiplayer aspect is immensely valuable. Having multiple developers walk around the same codebase together, discussing the architecture and the code, is helpful for debugging, onboarding, brainstorming, all the different ways we collaborate as developers. This kind of visualization is not a solo activity, so I can see where Vircadia would come in. I have no idea of which project you're talking about, so I think it would help if you explained what it looked like. Do you have any screenshots? How do you navigate it? How do you search for text in VR? What do you visualize exactly? Is it a static or a dynamic display? Can you drill down to actual lines of code? How do you deal with large projects? Vircadia has 400K LOC and 11K files. Apologies for the hot take, but imo GitHub has been really knocking it out of the park with terrible ideas lately (remember how everyone fell all over themselves during the Copilot release?). This is an absolutely worthless visualization that only impresses those that haven't heavily worked with visualizations. A few points right off the bat: I don't mean to be overly negative, but it's just not a good visualization and no one will ever seriously use this. This is a Show HN post. While you have valid criticisms (small labels, using filenames as labels produces lots of package.json, etc.), the way you shared it certainly violates the site guidelines (“Be kind”). You knew you were being harsh and let your emotional response get to you. But, you should remind yourself that a person was on the other side of this post, and she cared enough to share it. Even if you feel the visualization is unacceptably bad, you should seek to find a way to provide constructive criticism. You’ve got the beginnings of actionable feedback, it’s just covered in invective language (though directed at the work not the person, so that’s something!). Bravo! You need to be there on every one of my Show HNs in the distant past. And so many other's--Where were you shepherding the gnawing HN hoards, slouching toward Bethlehem to be born? Way back when I was in school and much less confident than I a now, it was actually comments like yours that hurt the most. What you're essentially saying is "I know what was said is true and I don't dispute that, but you shouldn't have said it". For me this was like a second blow. Criticism always hurts, but at least the first guy thought I was tough enough to take it. You, on the other hand, not only agree with the first guy, you think I'm so feeble that I need a "nice" person like you to shield me. When people stick their neck out they will sometimes get hurt. This is totally normal and a natural way of learning and developing. Safe spaces don't make successful people. The comment you are replying to didn't even seem emotional to me and certainly didn't attack the person. It was just honest feedback. You are reading way too into the post you responded to. At no point did they agree that the OP is speaking the truth. They said to rephrase it to not just be negativity. But constructive too. You then bring up safe spaces which is irrelevant. It appears you had an agenda from the get go and are attempting to create a narrative here. > They said to rephrase it to not just be negativity. But constructive too. On the one hand I agree with you and I personally always try to be positive whenever possible. But nobody owes you constructive criticism and sometimes it's just not possible to think of anything positive. An attitude of "no negativity here" will just create a culture of yes men which isn't actually constructive at all. I have no "agenda". I have opinions and I have experiences. I expressed how I feel because I assume the commenter wanted to do good and they'd like to know that it's actually harmful to some people, like me. I suppose I also can't believe someone would feel the need to come to the rescue for the OP (as if they can't defend themselves) after such a harmless post. There's no personal attacks, no strong language, nothing. The assumption seems to be the OP is too soft to take it, which I can't help but think has sexist underpinnings. i used agenda in a way of everyone having an agenda to have people see their perspective as a good one if not correct one. You quoted me saying “not to just be negative” then went on to talking about “an attitude of no negativity here”. Those two aren’t the same thing. The above and especially your last paragraph appear to again repeat your pattern of having a typical “anti-woke”* agenda. Along with strong assumptions you build into innocuous situations. It seems like trolling because of how wild the last line is. Which gender is it even sexist against? I don’t know OP’s gender. Why would this ever be the assumption? It’s so far out of right field. * I don’t care for a handful of things from either side of the aisle myself. But there’s a difference between that and trying to shoe horn anti-woke stuff into every convo. Being right doesn’t excuse you from being kind. You can critique something in a constructive way. This isn’t about safe spaces, it’s about keeping communication civil. > but you shouldn't have said it Isn’t quite right. “You shouldn’t have said it like that”. I totally agree that criticism is necessary, and that pandering doesn’t help people. Being constructive is just more productive. If the goal is to get the person to change what they’re doing, be constructive. For example, “instead of repeating ‘package.json’, the labels should be semantic (e.g., < package name >)” would have been constructive. I don't think it's useful in it's current form (except perhaps for newcomers to a project to have an idea of the file structure at a glance). But I also don't think they're trying to present this as a new killer feature they've been working on for years. I'm pretty sure this is just an experiment/exploration done by a few people over a few months to see what they found, then they presented their results. Spot on! It's a space I've been wanting to explore, and so have been tinkering with it this last week while we're in between larger projects. I definitely don't see it being a product in it's current gotten, but wanted to share the exploration! Although I will say that the more I've been visualizing different repos, the more convinced I am that it's more useful than is easy to convey. Have you looked at CodeScene at all? I hadn't seen it until today - looks like a useful product! https://news.ycombinator.com/newsguidelines.html > Be kind. Don't be snarky. Have curious conversation; don't cross-examine. Please don't fulminate. Please don't sneer, including at the rest of the community. > When disagreeing, please reply to the argument instead of calling names. "That is idiotic; 1 + 1 is 2, not 3" can be shortened to "1 + 1 is 2, not 3." > Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith. The guy literally says he timeboxed his exploration on this experiment. I think it's pretty cool and would love to have the option to navigate repositories like this. Author is a woman. Sex is irrelevant. It’s generally considered polite to refer To people the way they want to be referred to. It’s relevant. It's absolutely relevant and was my mistake. I would have corrected it if I had seen it in time – apologies to the author. ...for the HNers who love to talk about how sexism or implicit bias doesn't exist in tech... Any people doing any automatic assuming likely also assumed the person is white, not homeless, well dressed, no mental health issues, and so on. It’s how things work. There’s so few people who actually feel the way you’re describing things. Being sexist and having implicit biases is not the same thing. The OP did apologize for getting the gender wrong so happy ending there. While I wholeheartedly agree, I'd argue there is value in having R&D teams tinkering around with different ideas. They may churn out duds, but theoretically they will produce something valuable eventually. It's probably useless - but it would make for a killer powerpoint slide at your next meeting, wouldn't it? I admire the people who work on pet projects like this one and share them with the world despite getting criticism that is just barely constructive. These are merely design issues- not things that stem from it being a terrible idea. It's a good idea that just needs a v2 design, there is no need to go on a diatribe about unrelated events, this was clearly just someone taking an idea they thought could be valuable and sharing it with others in the hopes they might gain value from it also. I have needed quick visual fingerprints of repos for a very long time, even just a crude outlay of the filesizes was what I needed but this provides even more. I just don't like the tone of this criticism, and nobody would ever do Show HN's ever again if "it's just not a good visualization and no one will ever seriously use this" was the standard for commentary. As a tool for exploring a repo it does have some flaws in navigating, but as a tool for comparing the complexity of two repos it looks very useful. It's immediately obvious where the depth, complexity and 'weight' of something lies. That's useful. Plus, even if the result isn't perfect, the fact people are exploring alternatives to a tree structure is great, because trees suck for anything that's broad and deep, especially in a language you're new to that doesn't have familiar patterns. I tried with https://github.com/racket/racket and for some reason it puts all the content of the subfolder "racket"/"src" in a vertical strip near the middle of the circle instead of spreading the parts evenly. How is each part arranged? interesting! I'm seeing them grouped in the middle. It's a tricky layout, using d3.js's circle packing algorithm, then recursively using a force layout to relax each folder's contents. It's a strange directory because it mix the old and the new implementation of Racket. But IIUC you are not looking at the internal content of the files, only the size. Perhaps you can fill a bug report in the d3.js package. (I'm not sure it's a bug, or just a very weird case.) In the part that says "direct link to your own repository" you can add a link icon for people (like me) that doesn't read all the text. (I was going to suggest that as a feature, but looking again I found it.) I also added it after your first comment so might have snuck it by you ;)
I'm curious what it is that you're seeing, if you have the chance to leave an image! Image with bad editing here: https://imgur.com/a/XxkzWZw It would be nice that if I have chosen a repo above, the link includes that repo instead of the default one. Took a look at my own codebase, which is 99% Rust. All gray, I'm guessing Rust isn't currently a recognized file type? Either way, very nice! I currently use the "dirtree" tool (https://github.com/emad-elsaid/dirtree) to generate diagrams like this of my codebase for documentation: https://github.com/WhiteBeamSec/WhiteBeam/wiki/Code-layout The "eralchemy" tool (https://github.com/Alexis-benoist/eralchemy) is also excellent at visualizing SQL databases: https://github.com/WhiteBeamSec/WhiteBeam/wiki/SQL-schema .ml and .mli files are not recognized either. Still, it's a fun tool. Oddly enough, the page partially loads, hangs, crashes the tab and attempts to reload, hangs, and then crashes the entire running mobile chrome instance on my phone. I don't think I've ever seen that before! I'm guessing the page is just memory heavy and android 11's memory manager can't figure out how to deal with it. ( chrome mobile, pixel 3 xl, android 11 ) 100% cpu for me, browser tab uses 3+ GB ram, chromium hangs then crashes oh no! My best guess would be that there are many element-heavy SVGs on the page. I was hoping for better performance, since the files will almost definitely be smaller than large raster images, but there are likely older browsers that have trouble with that. I'll look into it, thanks for the report! Android 11: Firefox, Chrome and viewing in HN app all crash Same, crashed rather recent phone as well That's the same visualisation used by CodeScene (https://codescene.com/how-it-works/), but there is a more elaborate one that represents affinity (number of connections) as distance to create something like a geographical map: https://homepages.ecs.vuw.ac.nz/~craig/publications/vissoft2... Very fun! Would a similar visualization work for showing the insides of a go binary? It would be super cool to have a way to visualize how different modules add bloat in size (and may pull in other bloaty modules as well) This is cool but using rectangles instead of circles would help this visualization. Circles waste real estate and not friendly to labels (e.g. curved text that is harder to read) There are definitely many trade-offs with using a circle pack layout - I snuck a bit of the reasoning into the collapsed section halfway through the write-up! Overall, this layout worked best for me, with the nesting feeling very natural and the circles feeling very "atomic". But big picture, this write-up isn't tied to the current visualization! It's more focused on exploring _how_ a visual representation could help our understanding of codebases. There are tons of jumping off points, including different vizes! Great idea & execution, I've definitely thought about building something similar and have tried Gource once before. However, I must add a +1 to the parent commenter re: rectangles. Circle packing is pretty, but a proper rectangular tree map interface gives you everything the circle map does with better labels, less wasted space, and less chance of a funky circle layout from the force/relaxing algorithm. This would be great for a presentation where I'm trying to make our codebase look flashy and cool. But for actually getting work done, I'll take a tree map any day. Check out the mac app GrandPerspective if you haven't seen it - great use of tree maps for cleaning up hard drive space. I agree, I like how JS bundle analyzers do it: https://twitter.com/XCSme/status/1410608222872891399 Agree on this. Rectangles can also be aligned and given a logical order. It’s probably less nice to the eye than circles, but more useful. I found this useful on my project. I realized I have many 'dusts' files in directories. Tiny little guys just like grains of sand nestling among the larger circles, looking to be useful. Beautiful structure and images! I love seeing my beautiful work in this beautiful format. It really brings out the beauty! :) Very cool! Reminds me a little bit of gource. This is actually something I'd consider adding to all of my big projects. It really does help with discoverability. Random nitpick: the issue with color coding files is that you may have many different file types leading to colors that overlap. Case in point in the author's create-react-app example: in one of the scrolling "comment boxes", the author calls out that the "tasks/" folder is mainly CSS files which made me raise an eyebrow...why would a tasks folder be mainly CSS files? -- and upon closer inspection of the colored legend .sh files are a VERY similar green. Just to satisfy my curiosity I visited the repo and sure enough, it was just .sh files, without a single .css file. It makes me doubt the experience of the author...how can a folder called tasks/ (in any repo) be .css files? This is cool . I remember using “Understand for C++” that does something like this , a full source code graph visualization- function flow etc. This of course starts as a folder visualization , but I see the value- seeing the big picture Did something similar some time ago: http://quantifiedcode.github.io/code-is-beautiful/ I take it the domain quantifiedcode is not active? It was taken over by LinuxSec. Yeah the startup is defunct since a long time. The best tool I found for exploring code bases and navigating large projects is https://www.sourcetrail.com/. Looks like that’s pretty zoomed in compared to this? That's a great visual presentation, but not really an innovation. The CodeScene tool has that built in together with a set of deep analyses on top (see https://codescene.com/). There are several public conference talks that cover this visualization and related use cases: https://www.youtube.com/watch?v=fl4aZ2KXBsQ Shameless plug, but this was also showing the content of the file!
https://github.com/facebook/pfff/wiki/CodeMap There are lots of examples of OSS projects visualized here:
https://github.com/facebookarchive/pfff/wiki/Examples I find that it's useful not just to look at the current contents of codebase, but how it has evolved over time. For example, after being onboarded, this lets me see where most of the current effort on a codebase is concentrated and what the biggest recent changes have been. For this, i believe that Gource is a lovely tool, which you can just point at a Git repository and it will visualize it: https://gource.io/ Nice implementation. I especially like the curved directory titles. I like the quick insights I can gain from this! Very promising. It's very basic in it's current implementation but I see a lot of potential specially about the "how files are linked" part. It's a nice bird's eye view. One thing I'd like is for there to be multiple metrics to use for the size of packages e.g. lines of code, number of files, number of methods etc. That way you can make sense of what are the heavyweight parts of the codebase. Weird seeing this as a Show HN. That said: since MS and GitHub are the same company… one of the things I really want is to be able to opt in to reference/search into dependencies. I don’t need visualization, I need “yes show me node_modules/*/*.js when normally I wouldn’t want that.” I use a VSCode extension that does this in the file browser, but I want it across everything that determines whether something is hidden. Perhaps it's not fair, but the first repo I thought of trying, aws/aws-cli caused it to freeze my browser's tab. When it finally unfroze, I'm presented with a few large circles and way too many tiny dots to be useful. Guess there's an upper limit on the size of the repo, or perhaps it's more geared to different "shapes" of layouts. Is this site critically slow in firefox for anyone else or just me? It's running at around 1fps... Thank you very much for the article! Everything is available, very useful information! Very interesting and illuminating! Please continue on this adventure, you're onto something great! So cool and useful! Is there anything like this, but for visualising the connections between git commits, files, GitHub issues, maybe even classes etc...? like a knowledge graph or something? if yes, im working on a similar concept (collecting facts and interedependencies among those from various sources: code elements, git, github issues, etc)
lets chat if you are interested to brainstorm about it
https://www.linkedin.com/in/alexbit/
https://calendly.com/alex-from-intuita/15min I got half way down the page to a full screen graph and it was impossible to scroll further. Can we just not do these cute UI gimmicks? How is this any different from just showing the fully expanded tree of the folder structure? Now if we can just click on those circles and start coding. I can't access it, HN hug of death? Hmm, it should be loading! What OS/browser are you on? I prefer tree.
Case in point: look at the `paperjs/paper.js` example they themselves provide. There's a big circle called "packages" and inside that circle, two smaller circles that all contain the exact same files: "package.json," "index.js," and "README.md" -- how is this insightful in any way? I need to go to the repo to actually see that one of the folders is called "paper-jsdom" and the other one "paper-jsdom-canvas." The visualization literally confuses me more than just looking at the repo. - Labels are way too small, so you'll need to zoom in..
- ...but if you zoom in, you'll need to pan...
- ...and if you need to pan, you lose context
- Hovering over "connected files" is just a jumbled mess