Google Is 2B Lines of Code, All in One Place

473 points by sk2code 10 years ago · 346 comments

Reader

antics 10 years ago

Just because people are talking about it: I work at MSFT, and the numbers Wired quotes for the lines of code in Windows are not even close to being correct. Not even in the same order of magnitude.

Their source claims that Windows XP has ~45 million lines of code. But that was 14 years ago. The last time Windows was even in the same order of magnitude as 50 million LOC was in the Windows Vista timeframe.

EDIT: And, remember: that's for _one_ product, not multiple products. So an all-flavors build of Windows churns through a _lot_ of data to get something working.

(Parenthetical: the Windows build system is correspondingly complex, too. I'll save the story for another day, but to give you an idea of how intense it is, in a typical _day_, the amount of data that gets sent over the network in the Windows build system is a single-digit _multiple_ of the entire Netflix movie catalog. Hats off to those engineers, Windows is really hard work.)

rakoo 10 years ago

> the numbers Wired quotes for the lines of code in Windows are not even close to being correct
OTOH you can't blame them for being incorrect if you (as in, Microsoft, not you personally) are being so secretive about the figures. I'm pretty sure everyone would love to see how Microsoft works internally, especially now that you teased us with that Windows build system.
- dsjoerg 10 years ago
  
  Yes you can blame them for being incorrect. If they don't have a correct and relatively up-to-date figure, they should be clear about that. "Microsoft declined to comment on how many lines of code Windows has now" or "Windows XP used 45 million lines of code, but that was 14 years ago so it's not a very good comparison to anything".
  - highmastdon 10 years ago
    
    They're not only incorrect but also stating facts out of context. Furthermore, Microsoft doesn't have any obligation to expose their real numbers if they don't want to.
    If it's ever needed to cite numbers, at least tell what the context is instead of naming some number out of the blue from 15 years ago and assuming it's still the same windows
    
    guardian5x 10 years ago
    
    This makes me think of a parallel world, where all the big IT companies show off there LOC regularly. Starting a "LOC war", as if it would mean anything of value.
hanwenn 10 years ago

The reason Google developed Piper was basically that Perforce couldn't scale beyond a single machine, and our repo was getting too large to even be usable on the biggest, beefiest x86 server that money could buy (single machines with terabytes of ram and zillions of cores).
If Microsoft has close to the same amount of code in a single repository, then they must have also written their own version control service that runs on more than one machine.
The last rumors I heard is that Microsoft bought a license to the Perforce source code, and created their own flavor to host internal code ("Source Depot" ?), which presumably still runs on a single machine.
ohitsdom 10 years ago

> single-digit _multiple_ of the entire Netflix movie catalog
Strange unit of comparison, although I may start using it.
- RyJones 10 years ago
  
  Facebook gets a Flickr worth of photos every few days.
  - toomuchtodo 10 years ago
    
    As someone who subscribes to ArchiveTeam's philosophy, it's going to be a dark day when the time comes to scrape Facebook before it goes under with that much data behind the scenes.
    
    zeckalpha 10 years ago
    
    It's already backed up to Blueray. They'll just hand it over.
    
    ivank 10 years ago
    
    Yeah, right. How and who exactly would they hand that over to, given that there are privacy settings on the photos that people expect to have respected?
    
    BtM909 10 years ago
    
    Sarcasm son....
    
    ivank 10 years ago
    
    You'd be surprised at how many people assume these bigcos are open to doing the right thing when they're shutting down.
    
    dx211 10 years ago
    
    Hey, I've still got my complimentary MySpace zip file kicking around somewhere.
    
    jsmeaton 10 years ago
    
    I had a hard drive crash on me with all of my photos some years back and my "backup" strategy failed. Dumping a myspace backup got me some of my most precious photos back. Thanks MySpace!
    
    ArekDymalski 10 years ago
    
    Even more. Most people don't even realize that things like FB aren't eternal.
    
    nl 10 years ago
    
    It isn't even clear what "the right thing" is here.
  - fao_ 10 years ago
    
    I wonder if this system could be used for 'burns'.
    Such as: "Hacker News has an Ask dot com userbase of number of good posters" (This obviously does not include me ;))
  - limeyx 10 years ago
    
    How many VW Bugs is that ?
- protomyth 10 years ago
  
  Well, I remember when Library of Congress and encyclopedias were used as units of measure. I would guess Netflix is the new stack of media.
- euske 10 years ago
  
  Uh, how much is it actually? (Simple searching didn't seem to get me the answer.)
  We should have a list of these things.
  - molmalo 10 years ago
    
    I did the same thing. I suspect that it's just "A LOT".
- ghgr 10 years ago
  
  That's the same as saying "the same order of magnitude"
  - theoh 10 years ago
    
    But while "1" and "9" are the same order of magnitude, so are "1" and "0.1". To say something is within an order of magnitude of netflix, you cover a possible range from 0.1 to 10 times the Netflix catalog, a range which is ten times bigger if you consider the minimum and maximum sizes...
    
    bostonpete 10 years ago
    
    Um, 0.1 and 1 are not considered the same order of magnitude.
    
    theoh 10 years ago
    
    I was abbreviating. We're talking about an open end of the interval right?
    
    noobermin 10 years ago
    
    He didn't say "within an order of magnitude", he said, "the same order of magnitude." There is a difference.
    
    theoh 10 years ago
    
    So you think he meant within a factor of 3 and 1/3? That's not what the original netflix comment said.
    
    noobermin 10 years ago
    
    "The same order of magnitude" means a multiplicative constant of a number from [1,10). 1/3 is decreasing by an order of magnitude. That is the typical usage in the sciences. Anyway, it's splitting hairs over something even less worthy of splitting hairs over.
    
    dragonwriter 10 years ago
    
    > But while "1" and "9" are the same order of magnitude, so are "1" and "0.1". To say something is within an order of magnitude of netflix, you cover a possible range from 0.1 to 10 times the Netflix catalog
    "is same order of magnitude as" is not a transitive relationship.
    
    vorg 10 years ago
    
    You seem to live in a base-tenny world. Many nerdies train themselves to do hex mental math in kindergarten, and the geekies do it base-36 because those extra letters "are there". So perhaps a 256 to 1296 multiplication ratio (those 2 numbers there are in base 10).
    
    tlarkworthy 10 years ago
    
    the order of magnitude is well known to be defined in base 10. That's regardless of your base of choice.
    https://en.wikipedia.org/wiki/Order_of_magnitude
    
    ant6n 10 years ago
    
    I've only heard of people talk about binary orders of magnitude; hexadecimal no
Pxtl 10 years ago

I assume they don't use msbuild, because they don't completely hate themselves.
4ad 10 years ago

So, how many lines of code does it have?
- antics 10 years ago
  
  I can't say. I work here, but I don't speak for the company.
  - robotresearcher 10 years ago
    
    Wouldn't it be nice if by working there you felt empowered to speak for the company? The company is you and your colleagues.
    
    qnaal 10 years ago
    
    I'd bet he has a lot of colleagues more qualified to question what he cares to reveal about their collective private intellectual property than you
  - koide 10 years ago
    
    And why is that number a corporate secret?
    
    antics 10 years ago
    
    I doubt it is, and I don't think I'd get in trouble for sharing it.
    But, it's not my job to decide whether or not that information should be shared, because it's not my job to speak for the company.
    I'm having a hard time understanding why people think this is not a reasonable position.
    
    robotresearcher 10 years ago
    
    Because in a room full of interested people you said "I know something you don't." and the "ner ner ne-ner ner" was implied.
    
    cbpy 10 years ago
    
    if combined with other data, it could show how inefficient/efficient programmers are
  - 4ad 10 years ago
    
    Well, you just said what it doesn't have, so I guess you speak for the company after all.
    Anyway, let me guess. Judging by how the size of all binaries shipped with Windows varied between releases, I'd be inclined to think Windows 10 does not have significantly more lines of code than Windows Vista.
    So I'd guess at most 100 million lines of code?
    
    antics 10 years ago
    
    You are of course free to consult your employer and draw your line however you like.
    For me, I'm comfortable saying that I don't speak for the company and leaving the numbers within an order of magnitude. When it becomes my job to decide which numbers are and aren't fit to talk about publicly, I'm happy to update you.
    
    wernercd 10 years ago
    
    I wish you'd use standard numbers... like "Libraries of Congress".
    
    fengqu 10 years ago
    
    谷歌
    
    fengqu 10 years ago
    
    微软
- RandallBrown 10 years ago
  
  It's a pretty hard number to come up with. Most employees only have access to a small fraction of the codebase. Even if you had access to all of it, it's hard to say what actually counts as Windows and what doesn't.
dx211 10 years ago

And they probably didn't count all of the test crap and IDW tools that nobody's used since Bill Gates was there, but still get built every time.
Datsundere 10 years ago

Would you rather work on the linux kernel instead of windows?
- azth 10 years ago
  
  What's the relation to the post to which you're replying?
est 10 years ago

I once torrented a WindowsXP+Office2003, ripped, together for about 120MB. Has basic functionalities working great.

dekhn 10 years ago

I'm a google software engineer and it's nice to see this public article about our software control system. I think it has plusses and minuses, but one thing I'll say is that when you're in the coding flow, working on a single code base with thousands of engineers can be an intensely awesome experience.

Part of my job- although it's not listed as a responsibility- is updating a few key scientific python packages. When I do this, I get immediate feedback on which tests get broken and I fix those problems for other teams along side my upgrades. This sort of continuous integration has completely changed how I view modern software development and testing.

nevir 10 years ago

Being able to make sweeping changes to a shared piece of code, and ensure that everyone's up to date (Hi, Rosie!) and not broken by your change (yay TAP train!) is phenomenal as well.
- QuercusMax 10 years ago
  
  I just submitted a change today that affected 266,259 different build targets within google3. I was slightly terrified to make a change of this magnitude, having been at Google for less than three months, but I ran all the affected tests multiple times and it doesn't seem to have broken anything. (Fingers crossed!)
  Working on the google codebase is pretty awesome. This week I've made changes/improvements to libraries owned by three different teams (not counting my own), in C++ and python, when my main project is all in Java. It's super fun. The code search tool is great - it's ridiculously fast, and makes navigating through the codebase very easy.
- Touche 10 years ago
  
  How is this a side effect of it being in the same repository/
  - nevir 10 years ago
    
    Say I make a change to a commonly used library (let's say deprecating a function, and replacing it with another):
    * I can see literally _every use_ of the old function.
    * I can run the tests for everyone who uses that function.
    * * this is automated; the build/test tooling can figure out the transitive set of build/test targets that are affected by such a change.
    * I can (relatively) easily update _every use_ of the deprecated call with the new hotness
    * I can do that all within the same commit (or set of commits, realistically)
    ---
    None of this is impossible with multiple repos, it's just a lot more difficult to coordinate
    
    luckydude 10 years ago
    
    We've been really crappy at marketing this but we have a multiple repos answer that is semantically close to identical to a monolithic repo. Any work flow or command line works identically in both (there are a few corner cases that don't work but almost 100% of the stuff works like you'd expect).
    It's sort of like a partial clone, you can clone any subset you want, work on it, add in other parts, the tool takes care of making sure the stuff you add in is lined up (if you cloned a week ago and you add in another repo, it's rolled back so it matches time/commit wise).
    If you want to search for something it's the same command in a collection of repos as it is in a monolithic repo:
    bk -U grep refactor_me
    but truth in advertising, that searches only what you have populated. The google answer is "ship grep to the data" so they'll search everything. Google wins if you have their datacenter. We win on UI, it's the same everywhere.
    Our design is much more tightly coupled than Git's submodules. We manage the subrepos so that they are in sync just like files are in sync. What I mean by that is if you have two files modified in the same commit, you can't pull that commit and only get one of the files updated, they are both updated because that's what happened in that commit. We've provided the same semantics for collections of repositories. Git doesn't, getting those semantics is an exercise left to the user.
    We get better performance because you can clone as little as one repository, what we call the product. It would be easy and fun to put a Unix distro in this system and have the top repo just have a makefile and you type make X11 and it populates the compiler, libc, the X11 sources, and builds them.
    It's commercial so maybe that's uninteresting but if you want submodules that work and scale in performance, we might be worth a look (sorry for the marketing, if that's against the rules here then I'll learn how to delete a post).
    Read more here, comments welcome, it's beta copy:
    http://www.bitkeeper.com/nested
    
    dekhn 10 years ago
    
    Technically, you can't see every use of a function because dynamic dispatch mechanisms aren't available at source code time. That's why you would normally run tests of all your library's dependents (having this dependency graph is one of the most important parts of the piper/blaze system).
    
    omouse 10 years ago
    
    Yep, this is the key factor, you get to shorten your deprecation cycle to little or none. In contrast, with multiple repos, even for small projects, you will get a few days up to a month of lag time.
  - kyrra 10 years ago
    
    All dependencies between directories are specified as part of the bazel[0] BUILD files. Since dependencies between files/packages are all laid out in BUILD files, you can have a graph of all dependencies between all packages. So if you change something in 1 package it can figure out all packages that should be effected.
    Bazel test[1] doesn't provide dependency testing as far as I know, but it creates the framework to support doing it.
    [0] http://bazel.io/docs/build-encyclopedia.html
    [1] http://bazel.io/docs/test-encyclopedia.html
  - wtetzner 10 years ago
    
    If it's in the same repository, and you update a library, you can update all of the uses of that library in the same commit.
  - dekhn 10 years ago
    
    It's a form of efficiency due to vertical integration. I agree as others mentioned that it would be possible to do this without a single code base but it would be very unwieldly.
amelius 10 years ago

Could you tell something about the level of documentation? For instance, do you have to write a paragraph of documentation for every function that you add to the system? How about adding preconditions and postconditions, and other invariants?
Also, is the code that you add to the repository always inspected by other people? Is that done systematically?
- lrem 10 years ago
  
  It is mandatory that each code change is inspected for correctness, language style and approved by a code owner (all 3 may be the same person, or require multiple, depending on the situation).
  - amelius 10 years ago
    
    But I guess the original programmer can never be the same person as those other 3 roles? :)
    
    mvgoogler 10 years ago
    
    You always need to get at least _one_ other engineer to review your code.
    OWNER and readability approvals may require additional reviewers but not always.
    
    lrem 10 years ago
    
    Depends if his co-workers lock their screens ;)
- mvgoogler 10 years ago
  
  > Could you tell something about the level of documentation? For instance, > do you have to write a paragraph of documentation for every function that > you add to the system? How about adding preconditions and postconditions, > and other invariants?
  It varies a lot by project and by how widely used the code is.
  There are few (if any) strictly enforced rules for documentation or invariant checking. You basically have to convince at least one other engineer that what you have is sufficient.
  The documentation is _generally_ pretty decent for core libraries, but sometimes you just have to read the code.
tajen 10 years ago

If I recoup with latest news: The ranking algorithm is in one repo, Chrome and Android in another, and all the rest is in Piper? What is all the rest? Gmail, Calendar, the defunct Reader, the car, Maps, and OnHub... How does that fit with Alphabet? Is Piper a cross-functional initiative of Alphabet now?
- kyrra 10 years ago
  
  Look at the companies that are non-Google entities (the ones under Alphabet that aren't under Google):
  * Google X - moonshot projects that aren't software centric.
  * Google Fiber - mainly a infrstructure setup. May share some stuff in piper.
  * Google Ventures - investment arm, not code related.
  * Google Capital - more investment stuff.
  * Calico - R&D for biotech.
  Anything that would be software centric will probably still live under Google inc. As well, you are forgetting so many products that Google has. There are various lists of them out there [0][1].
  [0] http://www.google.com/about/products/
  [1] https://en.wikipedia.org/wiki/List_of_Google_products
  - QuercusMax 10 years ago
    
    Don't forget Google Life Sciences -- we have lots of software. (The team I'm on is 100% software, and we're around 25 people split among 3 sub-teams.) All our code is in piper, and depends heavily on the rest of the google infrastructure.
    
    ocdtrekkie 10 years ago
    
    Is Google going to split this out of their infrastructure, or is this whole separation between "Alphabet" companies and "Google" really just a marketing ploy for no change at all?
    
    QuercusMax 10 years ago
    
    At least as far as Google Life Sciences is concerned, there are multiple reasons that are known internally (at least within GLS) that I cannot comment on at this time.
    An astute observer may be able to come up with some plausible reasons (at least for GLS) as to why such an arrangement would be desirable.
    As far as splitting up the infrastructure, we're still going to depend on Google infrastructure and use their services where it makes sense.
- packetslave 10 years ago
  
  The ranking algorithm (and other restricted stuff) is in Piper, but locked down as to who can browse it, check it out, etc. There's some magic in the build system that lets you compile/link against it without needing to see it.
  - thrownaway2424 10 years ago
    
    There is considerable magic of that kind, actually, because the build system has to be able to prevent insufficiently-entitled engineers from being able to back out the protected source code, while still allowing the program depending on the protected code to be built, linked, and debugged.
  - jamiechong 10 years ago
    
    Who writes the magic (Piper)? Are these engineers essentially super admins?
Strikingwolf 10 years ago

It sounds like a really awesome workflow, although parts of it really only work on very large teams. I bet the version control system would be very cool though, hopefully it gets a github equivalent and has better branching than mercurial

sytse 10 years ago

So a monolithic codebase makes it easier to make an organization wide change. Microservices make it easier to have people work and ship in independent teams. The interesting thing is that your can have have microservices with a monolithic codebase (as Google and Facebook are comprised of many services). But you can also have a monolithic service with many codebases (like our GitLab that uses 800+ gems that live in separate codebases). And of course you can have a monolithic codebase with a monolithic service (a simple php app). And you can have microservices with diverse codebases (like all the hipsters are doing).

I'm wondering if microservices force you to coordinate via the codebase just like using many codebases force you to coordinate via the monolithic service. Does the coordination has to happen somewhere? I wonder if early adopters of microservices in many codebases (SoundCloud) are experiencing coordination problems trying to change services.

durin42 10 years ago

Google has tons of services internally that talk via RPC. The monolithic repo means that it's much easier to hunt down and find people that are (say) using an outdated RPC method and help them fix their code.
(Just one example of how it's useful even when things are mostly services.)
- hebecb 10 years ago
  
  To add to your point: let's differentiate between a monolithic REPOSITORY vs a monolithic CODE BASE. Having all the code in a monolithic REPOSITORY allows developers to more efficiently explore and transform the code with a better understanding of the breadth of their impact.
  Whether the CODE BASE is monolithic or not is orthogonal to the repository's nature. I was G for a couple years and I'd say they've done an ok job of breaking things up into libraries and services. Certainly there are interfaces that have done a better or worse job of setting the code up for open sourcing but because of the nature of the repository, large scale refactoring is more efficiently accomplished.
- Touche 10 years ago
  
  This discourages you from ever making breaking changes to an API. On the face of it that sounds good but sometimes you do have to make breaking changes. My guess is that there are many duplicate projects in the Google code base for when a breaking change is needed. This is a way to sidestep the problem.
  - tuckerman 10 years ago
    
    I think sometimes it's actually the opposite. I was able to make a breaking change to an API and update all of the callers to use the new one in a single commit.
    Tests would run on all the clients and, since in my workspace the server was updated simultaneously, I could be more sure it would work.
    
    Touche 10 years ago
    
    That's fine for small breaking changes like an API being renamed, but some times breaking changes require actually refactoring code which is hard/impossible to do without intimate knowledge of a codebase.
    Think python 2->3 or Angular 1->2. These types of changes do happen, and I bet they happen at Google. I don't think anyone is rewriting a downstream app when they make these changes. Most likely they are doing something like forking the library and renaming it, which is just another form of versioning.
    
    packetslave 10 years ago
    
    A talk and a paper on how we do large-scale refactoring in the C++ parts of the codebase
    https://isocpp.org/blog/2015/05/cppcon-2014-large-scale-refa...
    http://research.google.com/pubs/pub41342.html
    
    bubersson 10 years ago
    
    python 2->3 and angular 1->2 indeed need a different strategy and the strategy varies per case. For Angular both versions will have to be supported (basically just with different directory name) and switch most probably won't happen in terms of single applications, but by writing completely new applications on new stack. Lifespan of web application is rather short, so we can afford this.
  - rictic 10 years ago
    
    As someone who's responsible for migrating the users of a large breaking change (~300 projects), it does discourage you from making a breaking change. I'm not convinced that this is a bad thing though. Breaking changes can be both expensive and tempting. By making the author of such a change feel the cost of it you prevent the migration work from becoming an externality.
    It also encourages writing tools for automating this stuff. A big part of the motivation for polyup (https://github.com/PolymerLabs/polyup) was the fact that we'd be responsible for getting people to upgrade.
scrollaway 10 years ago

I'd be interested on an analysis of the meta-differences between those, but I think one of the main ones is a monolithic codebase makes it massively harder to open source components. I'm sure that, working on GitLab, you can agree with that - if GitLab were a massive, monolithic codebase and you wanted to open source specific parts of it, it'd be a huge pain.
- sytse 10 years ago
  
  Indeed, if you have a monolithic codebase releasing your code for other people use is much harder. In git it is possible to create a separate repository for it (using subtree and filter-branch). But the harder part is versioning and release management. At Google they avoid the cost of versioning their services and the release management around that, but this makes it really hard to have the outside world use their code. Maybe this is hard anyway since their code probably depends on a lot of services common to Google (GFS, BigTable). This is probably why they can't release Borg/Omega but have to make Kubernetes instead.
sytse 10 years ago

I think that advantages mentioned in the presentation about the monolithic codebase can be achieved if you have one source code server that is open to everyone (for example GitLab with most projects set to internal). Some of the tools will be easier to write for one repository than iterating over many, but that seems solvable.
The biggest advantage seems to be that when you are an author of a dependency you can propose upgrades to all services that use your application. It is not clear to me but it seems that for small changes you can just force that change on the code owners. This ensures that the dependency author incurs the cost of a change (as is done for API changes in the Linux kernel) and that you do not need to version the API of the dependency.
Interestingly Google recently started marking API's private by default. So they are moving in the direction of explicit API management.
As soon as you work with people that are outside your control (as is common in open source) you would need to version the API as well in my opinion.

ChuckMcM 10 years ago

I will say that I saw and experienced many things that changed my definition of 'large' at Google, but the most amazing was the source code control / code review / build system that kept it all together.

The bad news was that it allowed people to say "I've just changed the API to <x> to support the <y> initiative, code released after this commit will need to be updated." and have that effect hundreds of projects, but at the same time, the project teams could do the adaptation very quickly and adapt. With the orb on their desk telling them at that their integration and unit tests were passing.

I thought to myself, if there is ever a distributed world wide operating system / environment, it is going to look something like that.

RogerL 10 years ago

Can you (or somebody) explain how that works in practice?
Here I am ready to deploy some new feature to gmail. In the meantime, I'm getting a steady stream of API changes. Can I build and release gmail to a specific revision number, and only incorporate the changes when I am ready, or are all release essentially off of the tip of the tree.
I don't need specifics, just the general idea. Where we work basically every project lives in its own branch, which makes it essentially impossible to synchonize changes. Things have to get merged to trunk, then pulled into the appropriate branch, and I don't like it at all.
Do you have version numbers for libraries/components: projext X uses version 1.5.4 of Y, 32.4.18 of Z, and so on, do you pull by revision number, are you all on tip?
It'd be very interesting to hear how you manage this.
- wsetchell 10 years ago
  
  In theory: Everyone works from head, so there is only one version of everything.
  When you write a feature, you follow these steps.
  1. Write code and submit to the main (and only) branch. Hide your feature behind a flag. 2. Releases happen from head (with no regard to your feature) 3. When you want to enable your feature, you flip a flag to turn it on.
  In practice: Things are messier.
- Nitramp 10 years ago
  
  Yes, you sync you workspace to a particular revision and work within that. Syncing is a manually triggered process. And generally speaking, people changing APIs are required to find and fix all users (which isn't that hard with the infrastructure available). There is still some churn created by the constant stream of changes, but it's usually not on the changing APIs level.
- pacala 10 years ago
  
  Every feature has a flag that you can turn on/off to your heart content. Releases are cut at head. No need to use the version control hammer to screw all the zippers.
  - randyrand 10 years ago
    
    Would that be analogous to, every feature is it's own git branch? and you choose which branches to merge when building?
    How are conflicts managed?
    
    pacala 10 years ago
    
    Every feature is in HEAD, behind a flag. Precisely because managing conflicts cannot be automated, relying on branching is unwise. With HEAD based development, you manually solve conflicts exactly once, when submitting to HEAD, then you never have to think about them again.
    Local branches for development, a single HEAD for merging code.
    
    electrum 10 years ago
    
    No, features are configurable at runtime. This allows A/B testing or slowly rolling out a new feature. Depending on the type of change, you might test it on 1% of users, 1% of servers, etc.
devit 10 years ago

The solution to the excessive API change problem is to force whoever changes the API to fix all the consumers himself before the change is accepted.
The Linux kernel generally uses this policy for internal APIs for example.
- Lewisham 10 years ago
  
  We do, mostly. Because Piper is a global repository, we have systems to do global safe refactors, and do so often. If the API changes drastically, there's usually a lengthy deprecation period before the API is switched over.
- stock_toaster 10 years ago
  > The solution to the excessive API change problem is to force > whoever changes the API to fix all the consumers himself > before the change is accepted.
  This doesn't seem scalable. Let's consider the case of one api endpoint being changed by one developer, to add a new param to a function call. Further assume that this impacts hundreds of projects.
  Does it really make sense to make one developer update those hundred projects? Not only will it take forever for it to get finished (possibly never if there are new consumers of this api coming online frequently), but the developer of the core api may not have any experience in the impacted consumers of this codebase. I think the end result of this policy would be nothing once written ever would get updated, and new apis would just be added all the time (api explosion).
  - XorNot 10 years ago
    
    Google's product explosions / surprise deprecations possibly hint that this is what happens? Changing the API becomes cumbersome, so you just make a new product with a new API to do an end-run around the requirement...
  - thrownaway2424 10 years ago
    
    It maybe isn't scalable, but that's part of the benefit. If you want to make a change to a widely-used API, it's going to be a lot of work, and it's not going to be a lot of work for the users of the API, it's going to be a lot of work for _you_ because _you_ are required to do it yourself. This prevents a lot of API churn unless the benefit is clear and sufficiently large.
    If it was any other way you'd rapidly reach a useless equilibrium where random engineers were demanding that thousands of other engineers fulfill unfunded mandates for what might turn out to be negligible benefits.
    
    philwelch 10 years ago
    
    That's one extreme. Another extreme is that you have the API versioning from hell, where you can never get rid of technical debt because any and all API changes will break someone, somewhere, who has no reason to migrate, so you're left keeping ancient code on life support indefinitely.
  - masterj 10 years ago
    
    It's scalable with the right tools. If you can write a transform at the level of the AST that will make the change for you, you can do it in one commit. FB has written about this: https://medium.com/@cpojer/effective-javascript-codemods-5a6...
    Not that it's a silver bullet, but it can make a lot of these cases non-issues.
  - nulltype 10 years ago
    
    It seems bad but the alternative (wait forever for groups that don't understand the change and may not even exist anymore to update their code) seems worse.
  - zenbowman 10 years ago
    
    If all your RPCs take a single parameter, this isn't a problem provided you use universal defaults.
- revelation 10 years ago
  
  I guess that works for the Linux kernel, but I would presume that for a large distributed operation like Google it would be much better to simply deprecate/version APIs and have the project teams update to a deadline.
  I mean, it's presumably impossible to have a single computer running a single OS build all of the Google software and run the testing.
  - teraflop 10 years ago
    
    Google is rumored to have an extremely powerful distributed compilation farm. It wouldn't surprise me if a single developer could make a change that affects the entire codebase and test it themselves.
    
    malkia 10 years ago
    
    The key to this in her talk was TAP
- Touche 10 years ago
  
  > The solution to the excessive API change problem is to force whoever changes the API to fix all the consumers himself before the change is accepted.
  Having people unaware of a project's purpose making changes to its code sounds like a nightmare to me.
  - sowbug 10 years ago
    
    When that's the cultural norm, people adjust accordingly. Tools include liberal use of assertions, defensive tests, and most important, code reviews that catch and remove quirkiness.
    It's a nice environment to work in. In addition to hastening Noogler onboarding, it also increases employee retention. If you are an expert in your project's codebase but get burned out, you can easily transfer to another project and be almost immediately productive.
    Obviously, there's domain-specific knowledge that doesn't transfer easily or quickly from project to project. But that's quite different from self-inflicted code fragility; one's an asset and the other's a liability.
- trollian 10 years ago
  
  This makes it harder to change APIs. Which has advantages and disadvantages.
comrh 10 years ago

> With the orb on their desk telling them at that their integration and unit tests were passing.
Anymore info on this? Is it in house hardware?
- thatoneguy 10 years ago
  
  IIRC, it was built on a hacked up Ambient Orb
  http://ambientdevices.myshopify.com/products/stock-orb
  - shostack 10 years ago
    
    Wow, $150? Shocked there isn't a cheaper USB version that is more easily hackable. I mean, is it really more than color-changing LEDs in a plastic sphere?
    
    andor 10 years ago
    
    blink(1) is a small USB light to give you glanceable notice of anything on your computer or the internet. blink(1) makes it easy to connect data sources in the cloud or on your computer to a full-color RGB LED so you can know what’s happening
    https://blink1.thingm.com/
    
    shostack 10 years ago
    
    Thanks for the link--looks awesome, wish the form factor were a bit larger and not so directly attached to the computer. I'll bet I could create an extension cable for it and rip off the housing to make the LEDs allow more light through, then put it in a translucent plastic sphere.
    
    PhantomGremlin 10 years ago
    
    It's also a perpetual? subscription to a
    nationwide wireless network
    to obtain e.g. the latest stock prices.
- ChuckMcM 10 years ago
  
  Started with a project that used the Ambient Orb to show test status, they were expensive which led to a contest to make the cheapest one, basically an ATTiny with a software USB hid stack tied to an RGB LED. Henner Zeller won that contest and I think maybe 5000 gOrbs were built.
- travelton 10 years ago
  
  I built something similar as a hack project... Closet dome light fixture turned upside down. Rip out the innards. Install RGB LED. Connect LED up to Arduino and pull information from an API. Adjust LED based on your conditions. I spent around $75, but that was years ago, probably far cheaper now.
Strikingwolf 10 years ago

That's actually a good question. What would a world wide environment look like. Currently we have the web, but that isn't really a world wide environment in this sense. I think that this could be a way it would look, but more of a combination of this and a way to split everything down into the parts again without breaking the system. However, I've never worked at G, so I can't know if they have a method for this

sshumaker 10 years ago

Xoogler here. There were tons of benefits to Google's approach, but they were only viable with crazy amounts of tooling (code search, our own version control system, the aforementioned CitC, distributed builds that reused intermediate build objects, our own BUILD language, specialized code review tools, etc).

I'd say the major downside was that this approach basically required a 'work only in HEAD' model, since the tooling around branches was pretty subpar (more like the Perforce model, where branches are second-class citizens). You could deploy from a branch but they were basically just cut from HEAD immediately prior to a release.

This approach works pretty well for backend services that can be pushed frequently and often, but is a bit of a mismatch for mobile apps, where you want to have more carefully controlled, manually tested releases given the turnaround time if you screw something up (especially since UI is really inefficient to write useful automated tests around). It's also hard to collaborate on long-term features within a shipping codebase, which hurts exploration and prototyping.

nulltype 10 years ago

Could you elaborate how the single repo model causes that thing you said in the last sentence?

ksk 10 years ago

Its interesting that they compare LoC with Windows. I suppose that this article wants us to be amazed at those numbers. However, my experience with Google's products indicates a gradual decline in performance and a simultaneous gradual increase in memory bloat (Maps, Gmail, Chrome, Android). Which ironically, FWIW, hasn't been the case with Windows. I have noticed zero difference in performance going from Windows 7 to 8 to 10.

branchless 10 years ago

I'd have to disagree with this. First the baseline: windows is very slow. Second I found later versions slower. Third (and most maddening) every version of windows I've ever used has gotten slower over time (including not installing new s/w and defragmenting).
- sz4kerto 10 years ago
  
  Windows is slow? Compared to what? In what task? Running a game? Boot time? Opening Firefox?
  I have problems with Windows, but it's the fastest desktop os I think, mostly because it's graphics stack is way the best of all. Running a number crunching C code is exactly the same on Windows or Linux. (See all the benchmarks on the Internet.)
  - buffoon 10 years ago
    
    It's really not that fast. The filesystem is a total dog (MFT contention) to the point that manipulating lots of small file is up to two orders of magnitude slower than ext4. This is made bearable thanks to SSDs being on the market. Also the amount of friction getting stuff built and running and maintaining it is detrimental to general productivity meaning you piss execution time out of the window regularly just fixing stuff.
    Note: windows programmer for 19 years now. Only because of the cash.
    
    ksk 10 years ago
    
    I can't say I'm surprised to see people eager to point out how Windows sucks. And sure, maybe it does. However, the fundamental point you're missing is that I don't think that Windows was ever positioned as this OS that was designed for every single type of workload out there (not withstanding marketing noise). Windows is a very general purpose OS meant for general purpose 'mainstream' things. Things that hundreds of millions of people might want to do. Specialty workloads are simply not what Microsoft is ever going to invest any significant amount of time in - unless they see some money there. In that sense, Windows would probably be a far better OS if users could modify it to suit their needs, but thems the breaks. Linux seems to fill that void for some.
    The disadvantage of NTFS which you point out, isn't because of a fuckup. It's not designed for your use case. You might even find Microsoft telling you that themselves here :- https://technet.microsoft.com/en-us/library/Cc938932.aspx
    As to your point about productivity, I can't comment without knowing specifics. As a primarily C++ programmer, I haven't run into any Windows showstoppers that prevented me from shipping. I have run into showstoppers with their dev tools, but I see them as separate from the OS.
    
    archimedespi 10 years ago
    
    At least you don't ever really need to defragment ext4, unlike NTFS.
    
    thetruthseeker1 10 years ago
    
    Can you solve all the problems in ext4 that NTFS claims to solve? No you cant. I am not saying either of the systems is perfect nor either of them is horrible. They are perfect for the use cases they are designed for. If somebody had a file system structure that was unusual(say lots of small files) to NTFS, I think it was his mistake in treating it as a black box.
    
    buffoon 10 years ago
    
    The problem is that a large number of small files is a very common use case. Even Windows itself consists of lots of small files i.e. the source code and WinSxS.
    It should handle general scenarios consistently. We've had a few minor versions of NTFS and now ReFS. ReFS should solve this but it doesn't as it's a copy and paste of the NTFS code initially rather than a complete reengineering effort.
    
    RaleyField 10 years ago
    
    You don't need to defragment NTFS either, defragmenter runs in the background. Defragmentation is more an issue of driver and not fs itself, I suspect ext4 does some form of defragmentation itself otherwise I can imagine edge cases that could ruin its day.
    
    NeutronBoy 10 years ago
    
    Note: windows programmer for 19 years now. Only because of the cash.
    You know why there's cash? Because Windows works for a lot of people.
    
    buffoon 10 years ago
    
    I disagree. It isn't that it works well but merely that it is there. In fact the majority of Windows networks both corporate and small business I can safely say that it barely works and is usually a mismanaged unpatched mess or filled with crapware. Occasionally there's a nice tight network (our operations guys run a tight ship which is cool) but the general case is a pit of incompetence.
    The mantra among the consults I've met in the UK is if you're charging by the hour, do it in .Net on Windows. If you're charging a fixed rate, use Linux and Python.
    I'm not suggesting there is anything better for an end user but I'm pointing out that it doesn't work well enough.
    I still use it however and have a fondness. The accumulated knowledge of fixes is incredibly valuable.
    
    NeutronBoy 10 years ago
    
    In fact the majority of Windows networks both corporate and small business I can safely say that it barely works and is usually a mismanaged unpatched mess or filled with crapware.
    And yet, no vendor can hold a candle to Active Directory, which is the single best thing about running Windows in an enterprise.
    You literally cannot manage SSO, patching, and config management on a non-Windows environment for more than a few hundred machine without the right tools. Shell scripts and Chef aren't going to cut it when you have 20,000 laptops to take care of.
    
    buffoon 10 years ago
    
    You're 100% right. Sort of. I've built many an AD forest in my years and it is certainly powerful. However when your laptops drop off the domain after a month or so, things get hairy. That and numerous other problems such as an entire network of 5000 machines with TrustedInstaller.exe hanging at 100% CPU etc etc and it's not all smooth. I've got my fair share of horror stories there.
    One of our enterprise clients has just bought 500 Chromebooks and we integrated OpenID in our application and no one has to deal with AD, SSO is sorted and zero management overhead. I really like this solution. If someone could build a standalone product with equal quality it would destroy Microsoft overnight. Their ops team is 4 people and two of them are network people to keep the pipes (and APs) working.
    And of course there is FreeIPA and PolicyKit as a contender but that's not really there yet.
    
    ocdtrekkie 10 years ago
    
    I'd really like to know how those 500 Chromebooks are doing in a year.
    
    buffoon 10 years ago
    
    Well it's been two months and I haven't heard any complaining other than one ActiveX based drum scanning thing doesn't work. Turns out the scanner's direct descendant has a web API so we're integrating that.
    To be fair they are financial point of sale machines so it's all process driven and they're not general purpose computers.
    
    ocdtrekkie 10 years ago
    
    I'm just saying I'd like to see how it goes long-term. Particularly in terms of TCO. Google, AFAIK, doesn't have any vendors making Toughbook-style Chrome OS machines yet.
    
    buffoon 10 years ago
    
    Same. For the price of a toughbook you just buy 5 Chromebooks and throw them away.
    
    ocdtrekkie 10 years ago
    
    Sure, but then the question is which lasts longer: Five Chromebooks or one toughbook? :D
    
    buffoon 10 years ago
    
    That's a good question!
    
    RaleyField 10 years ago
    
    > orders of magnitude slower than ext4.
    Haven't tested ext4 and ntfs drivers directly against each other, but an useful trick if you ever need to copy millions of small files from NTFS is to mount it on Linux because Linux driver can work with it way way faster than Windows one.
    
    buffoon 10 years ago
    
    Good tip; many thanks.
  - deathanatos 10 years ago
    
    > Windows is slow? Compared to what? In what task?
    I dual booted a laptop for a while with Vista. (I can't speak to anything later, because I use Linux now, and haven't looked back, so take the appropriate grain of salt.) So with Vista / Gentoo on exactly the same hardware (a Lenovo T61):
    - boot time on Linux was orders of magnitude faster
    - WiFi AP connect was significantly faster[1], esp. on resuming from suspend-to-RAM
    - Windows had a tendency to swap things out if they weren't in use, and had to swap like crazy if you paged back to a program you hadn't used in a while; Linux, by comparison, will only swap if required to due to memory pressure.
    [1] i.e., WiFi was reconnected before I could unlock the screen. No other OS I've had has been able to do this, and it's bliss.
    > mostly because it's graphics stack is way the best of all.
    Riiiight. The T61 had an nvidia in it, and it was fairly decent; drivers were decent between the two OSs, and performance on each was about on par with the other. (I used the proprietary drivers; nouveau performed unacceptably bad — bear in mind this was 7 years ago.)
    > Running a number crunching C code is exactly the same on Windows or Linux. (See all the benchmarks on the Internet.)
    This I will agree with; but what do you do after the number crunching? It's the scaffolding around the program that mattered to me: Linux has a real shell, with real tools. I can accomplish the odd task here or there. But yes, running a "number crunching C code" will perform about equally: you're really only testing the processor, maybe the memory — crucially, the hardware, not the OS.
    
    branchless 10 years ago
    
    Not only is boot faster on linux. When linux boots it's actually ready. When windows login screen comes up you login and then have to wait again.
    
    deathanatos 10 years ago
    
    Well, in all fairness, isn't this true of Linux too? At login, only lightdm (or whatever) is ready; we've not yet loaded the WM, mate-panel, other random stuff, etc.
    
    RaleyField 10 years ago
    
    > I dual booted a laptop for a while with Vista.
    To be fair Vista was the slowest NT 6+ OS, especially booting is way faster on Windows 8+.
    
    izzydata 10 years ago
    
    To be fair, comparing vista to anything is hugely irrelevant at this point.
- ksk 10 years ago
  
  I know that people have had experiences similar to yours. It's fine to disagree, but AFAIK pretty much all benchmarks show that there is no noticeable difference in performance from 7 to 8 to 10 and this matches with my own experience. I refuse to upgrade unless I get similar or better performance. But then again, I'm not really interested in researching every single benchmark. Windows is fast, stays fast, and that's pretty much all I care about.
  - trhway 10 years ago
    
    >Windows is fast, stays fast, and that's pretty much all I care about.
    are you in CO?
    
    ksk 10 years ago
    
    what?
    
    scott_karana 10 years ago
    
    I suspect he was trying to imply that you're high, since cannabis is legal in Colorado ("CO").
    Hence why his post is being downvoted.
    
    ksk 10 years ago
    
    Yeah, I thought they were hinting at that, but its legal here in WA too.
- ocdtrekkie 10 years ago
  
  8.1 and 10 run incredibly well even on very old hardware. I will agree a given Windows install may feel slower over time, and it makes sense to rebuild the PC occasionally, though that may, again, be less so with 8.1 and 10.
  - branchless 10 years ago
    
    Or if you are a non-techie it means forced upgrade is built into your product.
    
    ocdtrekkie 10 years ago
    
    I think Microsoft has tried to address this with 8/10, which have a "refresh" feature, which tries to clean out everything besides Modern apps and your personal data.
    
    branchless 10 years ago
    
    Glad they are addressing a fundamental flaw in windows albeit not in the root cause (the need for a "refresh").
    
    ocdtrekkie 10 years ago
    
    I think it's gotten better over time, but non-technical users have an incredible willingness to push buttons they shouldn't. Unfortunately, your options are to either cripple your product's functionality (Chromebook) or accept that users have the power to do things that they shouldn't. I vastly prefer the latter, and I think the latter is better for consumers, in fact, as well. But to each their own.
- ZanyProgrammer 10 years ago
  
  People defrag modern SSDs?
  - hirsin 10 years ago
    
    They better not. Doesn't improve anything and certainly wears it down. Hopefully the defrag tool recognizes an SSD and refuses to do it though.

lighthawk 10 years ago

"The two internet giants (Google and Facebook) are working on an open source version control system that anyone can use to juggle code on a massive scale. It’s based on an existing system called Mercurial. “We’re attempting to see if we can scale Mercurial to the size of the Google repository,” Potvin says, indicating that Google is working hand-in-hand with programming guru Bryan O’Sullivan and others who help oversee coding work at Facebook."

Why Mercurial instead of Git?

urda 10 years ago

Because Google and Facebook are using Mercurial over Git internally.
Edit: And for those that are just shocked that git isn't the answer.
Facebook: https://code.facebook.com/posts/218678814984400/scaling-merc...
Google: http://www.primordia.com/blog/2010/01/23/why-google-uses-mer...
- Lewisham 10 years ago
  
  Well, Piper conforms to the Perforce API-ish, and Android and Chrome are both on Git.
  Mercurial was pushed internally as being the "better" (for some dimension of better) between it and Git back in 2010, but I think even the most hardline Mercurial fans have realized that in order to meet developers in the middle in 2015, we need to use Git for our open-source releases. We have a large investment in Gerrit [1] and Github [2] now.
  So the Mercurial comment is probably entirely based on scaling and replacement for the Piper Perforce API, rather than anything externally facing.
  [1] https://www.gerritcodereview.com/ [2] https://github.com/google
  - cap_theorem 10 years ago
    
    Though both Android and Chromium still build in a way more similar to that for a monolithic repo. They use repo and depot_tools, respectively, as abstractions on top of Git in order to clone and manage all their Git repositories together as if they were a single large repository.
- cmrdporcupine 10 years ago
  
  I don't think I'm leaking anything by saying that that article about Google is incorrect. Mercurial isn't used at Google, not anywhere I've seen. Probably that article has something to do with the now-defunct Google Code site, which did in fact use Mercurial.
  - jordigh 10 years ago
    
    Google devs are working (with some help from Facebook) on getting hg into Google, but they're not there yet. Look at all the work that Google has been putting into hg. It's not purely out of the goodness of their hearts:
    https://selenic.com/hg/log?rev=@google.com&revcount=200
    I always see a lot of Google and Facebook at the Mercurial sprints.
    
    cmrdporcupine 10 years ago
    
    I can't comment on what Google might be doing in the future with RCS. Well, I can't comment but also don't know. I personally like hg.
    But if I were to take a poll of the informed and opinionated engineers sitting around me, they would almost all likely take git over hg.
    In any case, git is in actual widespread use _now_ at Google, for Android and Chrome, and other open source stuff (some of which was moved recently from code.google.com to github).
- spectral321 10 years ago
  
  Google isn't using Mercurial at any significant scale, but there is work being done to see if it can be made to work as a client for Piper.
- jsolson 10 years ago
  
  There may be Mercurial plans for the future, but to date I've never encountered it at Google. Lots of people use git, though, including folks outside of Chrome and Android who are working against Piper.
- thatoneguy 10 years ago
  
  I was at Google for a long time and it was almost entirely Perforce (some Googlers used git) until it was Piper (which is very similar to Perforce). L&S used CVS at Stanford for the original Backrub/Google code and for some time after the establishment of Google, Inc.
  - farresito 10 years ago
    
    Is Piper a wrapper around Perforce or built from scratch? What is it exactly, if you don't mind me asking?
    
    thatoneguy 10 years ago
    
    It was built from scratch because it was designed as a replacement for Perforce because Perforce is expensive and the G used wrappers around it, anyways.
    The commands are still basically the same as Google's wrappers around Perforce and the learning curve was non-existent for someone who was used to using Perforce at Google.
- DannyBee 10 years ago
  
  There are also a massive number of git repos internal to googl e as well.
  One thing mentioned in the paper, but not mentioned here, is that there are teams that live in GIT, such as android and chrome, but they are not monolithic.
- foobar2020 10 years ago
  
  How do you know that?
  - skj 10 years ago
    
    He/she doesn't. It's false.
- harveynick 10 years ago
  
  That's not true of Google. At least, not on any of the teams I've worked on. I use a Git wrapper around Piper.
sghi 10 years ago

I believe that Facebook ran into problems scaling Git, outlined here - https://code.facebook.com/posts/218678814984400/scaling-merc...
- rakoo 10 years ago
  
  It looks to me like mercurial is easier to hack on than Git (written in clean Python vs written in C), and in their initial stage both mercurial and git weren't up to the task.
georgemcbay 10 years ago

I've used git on repos far less than 2 billion lines of code and it was kind of a nightmare to clone and do basic local checkouts and normal everyday stuff. Out of the box, git's design is not well suited to large repos.
I know there are various workarounds for dealing with large repos but having had some experience using git on big-ish projects I can certainly understand some possible reasons why it wasn't their first choice given the size of their codebase.
cynicalkane 10 years ago

I don't work for FB, but word through the grapevine is that FB wanted to use Git, but the Git folks refused to talk to FB about making it scale. By contrast, hg was happy to have them as users and contributors.
grouseway 10 years ago

Why not IBM? Why not Microsoft? Is that what the HN crowd has turned into? You now have to justify the non mainstream solution like a scared mega-corp career drone who will only go with the safest bet?
Git isn't so awesome that it's inconceivable that people would be willing to use something else.
- scrollaway 10 years ago
  
  > Is that what the HN crowd has turned into?
  It's a legitimate question, damn it. Especially if the chosen solution is not mainstream.
  You don't pick non-mainstream solutions "just because". You do pick mainstream solutions "just because". As in, "just because" they are tried and true, "just because" it's familiar to devs, etc.
  So let me ask the reverse question: Is that what the HN crowd [sic] has turned into? You can no longer ask the reasons behind your tech choices?
- bigtunacan 10 years ago
  
  Personally; I wondered the same thing as the parent, why not Git? For me, this thought was not in a, "How could they possibly not choose Git!?", but more along the lines of, "There are a lot of smart people working at Google; I wonder what technical aspects of Mercurial might make it more suited to such a large code repository than Git?"
  - malkia 10 years ago
    
    I'm at Google now, but would share my opinion of Perforce while I was at Treyarch - Perforce is simply very easy for anyone to use - from software engineer, to game designer, scripter, artist, producer, technical writer, linguist, translator, etc. You can announce your changes by simply saying - it's submitted in CL: 12345678 - grab it and you are done. Eases the communication.
    Git and HG are not. Not that they can't be learned, but the learning curve is much higher.
    Piper and CitC (mentioned in Rachel Potvik's videO) are even more advanced, I can work from my desktop machine, then go and continue from my laptop, or open and edit files directly from critique in an internal web-based editor.
    Key to all this - make it accessible through the web! Anyone can work then from almost any machine. But at the same time the sources is accessible in the file system - even more awesome - so i can use emacs, and whatever other tools are there.
- lighthawk 10 years ago
  
  > Git isn't so awesome that it's inconceivable that people would be willing to use something else.
  Inconceivable? No, I just asked why. Git to me is a great solution for source control, so I want to understand it's deficiency for handling large amounts of code, especially considering it is used for managing Linux source and handles everything in GitHub I've used quite well.
  - spectral321 10 years ago
    
    Git is fine at scaling horizontally (many repositories), so things like GitHub work well. Neither Git nor Mercurial do well at scaling to a single massive repository; getting Mercurial to do so was easier because we can implement it as out-of-core extensions (see http://bitbucket.org/Google/narrowhg, for example).
    There's also internal-only things to deal with, such as the CitC integration. Because of some of the design decisions behind CitC, storing a typical .git or a .hg directory in CitC is essentially impossible. Mercurial's .hg directory is intentionally a black-box - you interact with the repo using the hg tool. The .git directory can be seen as an API - there are at least three implementations that matter for our purposes, and if we change something in the .git directory, we lose the editor/IDE integration powered by the ones we didn't fix (or we did fix, but they haven't been released yet; some of those are linked into products that have a commercial release cycle).

k33n 10 years ago

Comparing "Google" to Windows isn't really a fair comparison. I'm sure all of the code that represents products that Microsoft has in the wild far exceeds 2B lines.

DannyBee 10 years ago

Note that this is just the monolithic repository. Google also has other non-piper repositories containing hundreds of millions of lines too :P
For example, android and chrome are git based.
Note also that when codesearch used to crawl and index the world's code, it was not actually that large. It used to download and index tarballs, svn and cvs repositories, etc.
All told, the amount of code in the world that it could find on the internet a few years ago was < 10b lines, after deduplication/etc.
So while you may be right or wrong, i don't think it's as obvious you are right as you do.
- ocdtrekkie 10 years ago
  
  I'm still trying to figure out why having everything dumped in one big pile is something worth bragging about. I'd far rather have code sorted well into proper repositories.
  - DannyBee 10 years ago
    
    I don't have a strong opinion, but let me turn this around.
    Assume the code is sorted well (it is). Why does breaking them into "proper" repositories help anything?
    What does "proper" even mean when tons of stuff is shared?
    You act as if there is some obvious split that if they had "done it the right way" they would have done. That is 100% not obvious to me.
    How do you demarcate these lines, maintain things across these boundaries, etc.
  - BooneJS 10 years ago
    
    So much is shared, though, right? Which is why Android is sorted into proper repositories but still has the 'repo' front-end wrapper to make sure you're getting the right versions of everything you need.
    If I wanted to change something fundamental, like I found a 10% speedup in Protobuf wire decode by changing the message slightly, there are likely very many services that all need it.
    Everyone at Google operates on HEAD. You're not allowed to break HEAD, and pre-submit/post-submit bots ensure you don't and will block your submit.
    
    ocdtrekkie 10 years ago
    
    From my perspective, at least, this design seems to explain why Google websites are so frequently broken in the ways different services integrate. Because Googlers edit shared resources that affect products they don't personally work on, and they just trust automated tests, which almost certainly miss a lot of the edge cases I encounter.
    I admit that I'm not an expert at large software development, but this seems to nearly fully explain Google's declining code quality.
    
    DannyBee 10 years ago
    
    "I have literally no idea what i am talking about, but here's something that i believe fully explains every problem i have ever encountered" :)
guelo 10 years ago

Agree especially since Google's repo contains their version of almost the entire Microsoft Office suite.
nickpsecurity 10 years ago

I also agree especially as this includes the whole Google software ecosystem and Microsoft has their own ecosystem. Microsoft's whole ecosystem of products that work together and run on Windows is much larger than Windows itself. Might not be as monolithically developed, though.
bluedino 10 years ago

Windows, Office, things like .NET, other sites like MSN and Bing...
- utexaspunk 10 years ago
  
  SQL Server has got to be a few lines...

yongjik 10 years ago

One humorous side-effect of having all that code viewable (and searchable!) by everyone was that the codebase will contain whatever typo, error, or mistake you can think of (and convert into a regular expression).

I remember seeing an internal page with dozens of links for humorous searches like "interger", "funciton", or "([A-Z][a-z]+){7,} lang:java"...

wetmore 10 years ago

> "([A-Z][a-z]+){7,} lang:java
Yeah this one was my favorite of the code search examples, there are some really good ones in there.
- cag_ii 10 years ago
  
  Can you explain this? It looks to me like a regexp that searches Java source for words 7+ characters that start with a capital letter?
  - yongjik 10 years ago
    
    It searches for CamelCase identifiers that are made of seven or more "terms", where each term is a capital letter followed by one or more lowercase letters.
    E.g., ProjectPotatoLoginPageBuilderFactoryObserver.
    (Disclaimer: I just made it up. Not an actual Google project name.)
    "lang: java" is not a part of regexp; just a Google code search extension that searches for Java.
  - robryk 10 years ago
    
    This searches for camelcase identifiers with at least 7 words.
bubersson 10 years ago

I encourage everyone to search in their codebase for "1204", "521" and "265" :)
nandhp 10 years ago

And then you killed off the public version and you keep that fun (and useful) toy to yourself.
(But as great as Google Code Search was, my grudge is because of Reader.)

low_battery 10 years ago

Direct link to talk (The Motivation for a Monolithic Codebase ):

https://www.youtube.com/watch?v=W71BTkUbdqE

Walkman 10 years ago

This is crazy :D I have never heard tools and workflows like this.

kazinator 10 years ago

I am unable to believe that Google has 2B lines of original code written from scratch at Google.

Maybe they are counting everything they use. Somewhere among those 2B lines is all the source code for Emacs, Bash, the Linux kernel, every single third-party lib used for any purpose, whether patched with Google modifications or not, every utility, and so on.

Maybe this is a "Google Search two billion" rather than a conventional, arithmetic two billion. You know, like when the Google engine tells you "there about 10,500,000 results (0.135 seconds)", but when you go through the entire list, it's confirmed to be just a few hundred.

roxmon 10 years ago

Google has been around for 17 years and employees roughly 10,000+ software developers. I think it's reasonable to assume that the 2B LOC metric is accurate...
- hk__2 10 years ago
  
  Windows has been around for 35 years and Microsoft had 61,000+ employees (ok, that’s not only software developers and they don’t work only on Windows) in 2005; and it’s only ~50M LOC. I don’t think the number of years + developpers really show something; you don’t write new code everyday.
  - scott_s 10 years ago
    
    You pointed it out yourself, but I think you underestimated its importance: Microsoft works on many other things. Office, XBox, Windows Phone, Exchange, SQL Server, .Net, etc. I suspect Microsoft's total line count is similar to Google's. The difference, however, is that it's not one codebase.
sp332 10 years ago

Yes, that is counting everything. It's "the software needed to run all of Google’s Internet services" (so probably not Emacs, but the other stuff). But it's all in the repo, and it all has to be maintained.
hokkos 10 years ago

> Google engineers modify 15 million lines of code across 250,000 files each week > Google’s 25,000 engineers
So employees modify 120 lines / day if we imagine a linear growth in employees in 17 years to 25K coders, with 250 work day a year they employ around 6 more coders each work day, so about 55M man day, so around 6,3G LOC modified. But modified != added, so I wound't believe this is all their own lines.
hanwenn 10 years ago

most of the non-google code is stored in a special subdirectory, and AFAICT, it's less than 10% of the total.

hellbanner 10 years ago

"LGTM is google speak for Looks good to me" - actually common outside of Google.

malkia 10 years ago

SGTM

a3n 10 years ago

In the spirit of "You didn't build that," I wonder how many lines of code comprise the binaries that Google binaries run on? Windows, Linux, network stacks, Mercurial, etc, etc.

I also wonder if there's a circular relationship anywhere in there.

Splines 10 years ago

It's turtles all the way down, and also includes all the hardware and people.

sytse 10 years ago

The CitC filesystem is very interesting. This is local changes overlaid on top of the full Piper repository. Commits are similar to snapshots of the filesystem. Sounds similar to https://github.com/presslabs/gitfs

makecheck 10 years ago

I really wish there was a tendency to track all change/activity and not just total size; maybe like the graphs on GitHub. Removing things is key for maintenance and frankly if they haven't removed a few million lines in the process of adding millions more, they have a problem.

Having a massive code base isn't a badge of honor. Unfortunately in many organizations, people are so sidetracked on the next thing that they almost never receive license to trim some fat from the repository (and this applies to all things: code, tests, documentation and more).

It also means almost nothing as a measurement. Even if you believe for a moment that a "line" is reasonably accurate (and it's tricky to come up with other measures), we have no way of knowing if they're measuring lots of copy/pasted duplicate code, massive comments, poorly-designed algorithms or other bloat.

nhaehnle 10 years ago

The article claims 2 billion lines of code across 25000 engineers, which boils down to 80k lines of code per engineer. I'm not sure what to think about that.
It seems to be in a reasonable order of magnitude for C++/Java-type languages compared to projects that I have seen, but it does imply a significant chunk of code that is not actively being worked on for a long time (which is not necessarily a bad thing - don't change a running system and all that).
dekhn 10 years ago

Although I agree that line counting is a silly exercise much of the time, the talk did cover change activity as well as total size.
With regard to copy/pasted duplicate code and massive comments, we do have ways of knowing that as both of those are easily computable. Duplicate code can be matched using hashes and comments are delimited, making their measurement easy.

brozak 10 years ago

The comparison of Windows to all of Google's services is pointless and misleading.

It's like comparing the weight of a monster truck and the total weight of all the cars at a dealership...

temuze 10 years ago

Assuming these numbers are right...

(15 million lines of code changed a week) / (25,000 engineers) = 600 LOC per engineer per week

Is ~120 LOC per engineer per workday normal at other companies?

_delirium 10 years ago

Elsewhere in this thread it's mentioned that Google makes use of large-scale, automated refactoring tools: http://research.google.com/pubs/pub41342.html
Would be interesting to know what percentage of the total LoC touched are typically from that kind of automated refactor. Depending on the codebase, you can touch a ton of lines of code in a very small amount of time with those tools.
ajg360 10 years ago

I write between 4-600 lines of code a day where I work... I feel that 120 LOC is a day is on the smaller side (of what I'm used to anyway).
xur17 10 years ago

It really depends on what you're writing. Lower level c / c++, doubtful. Python, javascript, java, etc, yeah, it's believable.

melling 10 years ago

I imagine that there's a lot of Java and C++. I do like Go but it makes you wonder if a more expressive language that requires a fraction of the code would be helpful. Maybe Steve Yegge will see Lisp at Google after all.

astrange 10 years ago

He claims to have stopped using it (#5):
https://sites.google.com/site/steveyegge2/ten-predictions

jakub_g 10 years ago

Some questions that immediately come to my mind:

- What is the disk size of a shallow clone of a repo (without history)?

- Can each developer actually clone the whole thing, or you do partial checkout?

- Does the VCS support a checkout of a subfolder (AFAIK mercurial, same as git, does not support it)?

- How long does it take to clone the repo / update the repo in the morning?

Since people are talking about huge across-repo refactorings, I guess it must be possible to clone the whole thing.

Facebook faces similar issues as Google with scaling so they wrote some mercurial extensions, e.g. for cloning only metadata instead of whole contents of each commit [1]. Would be interesting to know what Google exactly modified in hg.

[1] https://code.facebook.com/posts/218678814984400/scaling-merc...

thrownaway2424 10 years ago

Most of your questions don't apply to the system described in this article. You do not clone the repository, you merely chdir into a vfs that is backed by a consistent view of the repository at a point in time, which view is served from a large distributed service that lives in Google datacenters alongside other Google services like Search, Maps, and Gmail. Because it is enormous and nobody clones it, it is also true that nobody partially clones it. You do not "checkout a subfolder" either.
Your last point is the only one that applies. If you want your view to advance from revision 123 to revision 125 it takes about a second to do so. If you have pending (not yet submitted) changes in your client, they might have to be merged with other changes, which can take a bit longer. If you have a really huge pending change, and your client is way behind HEAD, it might take a few tens of seconds to merge everything.
bruckie 10 years ago

Most of these questions are answered in the talk. The tl;dr is that you don't clone or check out anything at all: instead, you use CitC to create a workspace, and the entire repository is magically available to you to view or edit.
This model precludes offline work, of course. But that's not much of a problem in practice.
- jakub_g 10 years ago
  
  I did not follow the links in wired article, and didn't realize there was a link to a youtube talk. Thanks for tl;dr, need to watch the video!
lrem 10 years ago

In practice: none of these operations take long enough to tempt you into alt-tabbing to cat videos.

therealmarv 10 years ago

What? This is surpassing the mouse genom complexity. See this charts for comparison: http://www.informationisbeautiful.net/visualizations/million...

Strikingwolf 10 years ago

Really interesting article. Sounds like a great solution to the problem in git of submodules. Definitely worth looking at. Thanks for posting OP.

IMO this system would best be suited for large companies, but I could see the VCS that they are developing being used by anyone if it gets a github-esque website.

ilurkedhere 10 years ago

Yeah, but it's only like ~200 lines rewritten in Lisp.

juhq 10 years ago

A serious question about Lisp and Google, is Lisp used within Google, and if so, in what projects and why?

Apocryphon 10 years ago

Looks like someone's going to have to update this: http://www.informationisbeautiful.net/visualizations/million...

buro9 10 years ago

This hurts just thinking about what the build, test and deploy systems must look like.

jsolson 10 years ago

Well, for build take a look at bazel, although attach it to a cluster of machines that can all read from Piper.

michaelwww 10 years ago

For those interested, the source analyzer Steve Yegge was working on called GROK has been renamed Kythe. I don't know how useful it turned out to be for those 2B LOC. http://www.kythe.io/docs/kythe-overview.html

Steve Yegge, from Google, talks about the GROK Project - Large-Scale, Cross-Language source analysis. [2012] https://www.youtube.com/watch?v=KTJs-0EInW8

Locke1689 10 years ago

What I'd like to know and no one seems to mention:

What's the experience like for teams not running a Google service and instead interacting with external users and contributors, e.g. the Go compiler or Chrome.

bruckie 10 years ago

Many larger external projects are hosted in other repositories (Chrome and Android are well-known examples).
Smaller stuff (like, say, tcmalloc or protocol buffers) is usually hosted in Piper and then mirrored (sometimes bidirectionally) to an external repository (usually GitHub these days).
- Locke1689 10 years ago
  
  Thanks, but I guess I was asking more about how this affects the other development characteristics described. You still have to deal with the massive repository and infrastructure, but if you're Go, for example, and you want to change an API 1) you can't see the consumers because many or most won't be Google-internal, and 2) even if you could see them, you can't change them. Even the build/test/deploy systems are somewhat compromised because you can't rely on all builders of your components being Google employees and having access to those resources.
  So in these scenarios, what does Google's infrastructure buy you, if anything? And if it doesn't buy you anything, how does that influence Google culture? Are teams less willing to do real open development due to infrastructure blockage?
  - skybrian 10 years ago
    
    Working with multiple source control systems, multiple issue trackers, and multiple build systems has its challenges.
    It's true that you don't know about all callers if you're working on open source software. There's no magic there; you need to think about backward compatibility. (On the other hand, if it's a library, your open source users can usually choose to delay upgrading until they're ready, so you can deprecate things.)
    The main advantage for an open source project is that, though you don't know about all callers, you still have a pretty large (though biased) sample of them. If you want to know how people typically use your API's, it's pretty useful. Running all the internal tests (not just your own, but other people's apps and libraries) will find bugs that you wouldn't find otherwise.
    There were changes I wouldn't have been confident making to GWT without those tests, and bugs that open source users never saw in stable releases because of them. On the other hand, there were also changes I didn't make at all because I couldn't figure out how to safely upgrade Google, or it didn't seem worth it.
  - breakingcups 10 years ago
    
    I've read that in the case of Go, Googlers develop in the public repo of Go, which is periodically synced back into the Google tree.
    You can definitely see that Google uses a completely different build system than mainstream Go by the state of the mainstream Go build system though.

727374 10 years ago

Really? This article sounds very over simplified, but I haven't worked at google so I wouldn't know. I'm assuming if you want to change some much depended on library, there's a way to up the version number so you don't hose all your downstream users. That's the way it worked at Amazon at least. Also, I wonder why the people in the story think Google's codebase is larger than that of other tech giants, not that it really matters.

jsolson 10 years ago

Google mostly works at HEAD. Very little is versioned, and branches are almost unheard of.
In general you change the much depended on library and all of its consumers (probably over time in multiple changes, but you can do it in one go if it really needs to be a single giant change).
rictic 10 years ago

It's incumbent upon the person updating the library to get all users migrated to the new one. There are a few strategies for doing this though, including temporarily having two versions of the library.
There are also tools for making large scale changes safely and quickly.
devinj 10 years ago

The whole point of one big repository is being able to avoid versioning and always work at head.
zBard 10 years ago

Last I heard Google is still on Java 7 precisely because of this, although that might have changed. It's fun seeing the different theologies at Amazon and Google - I remember Yegge's famous platform rant, and he highlighted the Amazon versioned-library system as something which it did better than Google.

sandGorgon 10 years ago

What are the best practices to follow in a single-repo-multiple-projecrs world? Some people recommend git submodule, others recommend subtree.

How do you guys manage alerts and messages - does every developer get a commit notification,or is there a way to filter out messages based upon submodule.

How does branching and merging work?

I'm wondering what processes are used by non-Google/FB teams to help them be more productive in a monolithic repo world.

cmrdporcupine 10 years ago

Generally branching isn't really a thing at Google. Work is done at the code review level per change list ("CL"). Most changes happen through incremental submission of reviewed CLs, not by merging in feature branches. Every CL must run the gauntlet of code review, as well as can not usually be submitted without passing tests. There are rare cases where branching is used, but not commonly.
As for notifications, the CL has a list of reviewers and subscribers. If you want to see code changing, you watch those CLs. Most projects have a list where all submitted CLs go.
- sandGorgon 10 years ago
  
  Can you explain this a little more - what is a CL vs a changeset...and what do you mean by watching changelists. It sounds like you're subscribing to specific commits...but I'm talking about more at a project/directory level within the monolithic repo.
  - devinj 10 years ago
    
    Changelist is Perforce-speak for changeset. Because each CL gets an individual review before applying to the codebase, there is no merge process -- there is no branching/merging.
    There is a solution for project/directory-level CC / review requirements. I didn't see it discussed in the talk, though.
ajross 10 years ago

FWIW: git submodules are not a single repo by definition. It's just a way to automate the checkout of specifically-versioned external projects without requiring hackery like packing tarballs into the project source. It has its uses, but it's definitely not what they're talking about here.
- luckydude 10 years ago
  
  Agree 100%. Git submodules are for tracking other stuff, not for doing dev on that other stuff.
  If you would like to see how things would work with submodules that behaved just like files behave (full distributed workflow) we've got a (unfortunately commercial) solution here:
  http://www.bitkeeper.com/nested

nemesisrobot 10 years ago

The comparison bewteen the total LOC across all of Google's products against just one of Microsoft's is a bit unfair.

h1fra 10 years ago

The comparison with windows really is just here to provide a something to compare for casual reader, it's not really that good. An OS is a huge project. But google has hundred of different project, apis, library, framework... Even unix with an "unlimited" source of developpers does not reach that point.

dchichkov 10 years ago

I remember somebody wise had said once: "Every line of code is a constraint working against you."

jfkw 10 years ago

How do the monolithic repository companies handle dependencies on external source code?

Are libraries and large projects e.g. RDBMS generally vendored/forked into the monolithic repositories, regardless of whether the initial intent is to make significant changes?

jpollock 10 years ago
There's typically a subdirectory called third_party, with subdirectories for each vendor, product and version. If the team is smart, they will also enact a rule saying "only one version". If you're really, really smart, local changes are kept as a set of patches, keeping them separate from the imported tar file.
So, for source deliveries:
```
  third_party/apache/httpd/2.4/release.tgz
                              /patch.tgz
                              /Makefile (or other config)
  third_party/apache/httpd/2.2/release.tgz
                             /patch.tgz                             
  ...
```
- cpeterso 10 years ago
  
  For example, here is Chromium's third_party directory:
  https://chromium.googlesource.com/chromium/src.git/+/master/...

breatheoften 10 years ago

Are the source of piper and the build tools also in the mono repo and also developed/deployed off the head branch? Seems like a random engineer could royally fubar things if they broke a service which the build system depends on ...

thrownaway2424 10 years ago

You said "developed/deployed" as if it were the same thing. Even if you somehow checked in the giant flaw, bypassing all code review and automated testing, it's not like that would suddenly appear in production. Google isn't some PHP hack where you just copy a tarball to The Server. Binaries of even slightly important systems typically go through many stages of deployment, first into unimportant test systems, then usually very, very slowly into production with lots of instrumentation and of course, quick and easy methods of rolling back to the previous release.
- breatheoften 10 years ago
  
  I see - it was something of a half baked thought but in my defense I wasn't trying to suggest that I thought the head was automatically deployed to production ... Deployed to testing round 1 ... N is still a "deployment" isn't it ...? The shared boilerplate for how that magic works in a scaleable way for so many different projects must be quite complex and itself hard to test ...
QuercusMax 10 years ago

Everything has to go through pre-submit checks before it makes it to HEAD. And if you get it past those and it starts breaking stuff, there are robots that will automatically roll back your change if it breaks enough stuff.

dblotsky 10 years ago

Even if the numbers are off, the assumption that 40M lines of code take less effort to write than 2B lines of code commits the fallacy that effort is proportional to number of lines of code. Come on, Wired, you can do better.

amelius 10 years ago

Is this article saying that all developer employees have access to the "holy" search algorithm internals? I can hardly believe that to be true, given the fact that SEO is a complete industry.

enf 10 years ago

Once upon a time it was all in one repository. Shortly after I started there in late 2005, the "HIP" source code (high-value intellectual property, I think it stood for) was moved to its own source tree, with only precompiled binaries available to the rest of the company.
Looks like there is a Quora question that mentions this too: https://www.quora.com/How-many-Google-employees-can-read-acc...
jsolson 10 years ago

It is not saying that.
FTA:
> There are limitations this system. Potvin says certain highly sensitive code—stuff akin to the Google’s PageRank search algorithm—resides in separate repositories only available to specific employees.
The vast majority of code is visible to everyone, though.
shampine 10 years ago

No, it specifically says the opposite:
"Potvin says certain highly sensitive code—stuff akin to the Google’s PageRank search algorithm—resides in separate repositories only available to specific employees."

known 10 years ago

How frequently Google does https://en.wikipedia.org/wiki/Code_refactoring

rbinv 10 years ago

Those are mind-boggling numbers.

Although I kind of doubt that "almost every" engineer has access to the entire repo, especially when it comes to the search ranking stuff.

Lewisham 10 years ago

FWIW, apart from the previously mentioned sensitive stuff, we give engineering interns the same level of access we give full-time engineers. We keep things open because it makes things faster; we have an excellent code search tool that's great for navigating through the Piper repo (e.g. finding subclasses, finding uses of an API) which really speeds up dev time.
When we're not talking about the sensitive stuff, there's not much magic to what many engineers write every day, it's the same "glue technology X to technology Y" stuff you see everywhere, so I don't think there's any value to hiding that in the name of secrecy.
- petra 10 years ago
  
  I though Google's search advantage really speeds up software development. I wonder if they use the same principle in alphabet across other engineering disciplines - creating some unique knowledge tools, and how do they look.
- maximilianburke 10 years ago
  
  How are changes that affect sensitive code handled? Are the owners of that code on the hook for making any API updates that the person pushing the change can't make?
  - Lewisham 10 years ago
    
    Having never worked on the secret sauce, I honestly don't know. There is a small team of people who tend to do many of the global refactors, I might expect that they are given special permission.
dfc 10 years ago

Did you read the article? "There are limitations this system. Potvin says certain highly sensitive code—stuff akin to the Google’s PageRank search algorithm—resides in separate repositories only available to specific employees. "
packetslave 10 years ago

I'm a random engineer and I have access to probably 99% of those 2B lines of code. The restricted stuff is a pretty tiny subset of the whole.
foobar2020 10 years ago

> There are limitations this system. Potvin says certain highly sensitive code—stuff akin to the Google’s PageRank search algorithm—resides in separate repositories only available to specific employees.
> (...) all 2 billion lines sit in a single code repository available to all 25,000 Google engineers.
qq66 10 years ago

Why wouldn't almost every engineer have access to almost the entire repo? Most of Google's code is only relevant to another company working at Google scale, such as Facebook, Microsoft, Apple, etc. These are the companies with deep pockets that are willing to spend lots of money to acquire technology that will help them compete with Google. But none of these companies will buy code that's been stolen from Google.
The most famous corporate trade secret, the Coke formula, was stolen by two employees who attempted to sell it to Pepsi. Pepsi alerted Coke, the companies worked together to bring in the FBI, and both employees went to prison: http://www.cnn.com/2007/LAW/05/23/coca.cola.sentencing/
kcorbitt 10 years ago

From the article:
"There are limitations this system. Potvin says certain highly sensitive code—stuff akin to the Google’s PageRank search algorithm—resides in separate repositories only available to specific employees."
afandian 10 years ago

> Potvin says certain highly sensitive code—stuff akin to the Google’s PageRank search algorithm—resides in separate repositories only available to specific employees.

dblock 10 years ago

A giant repo works for Google, and works for Facebook, and Microsoft, but it's bad for the development community at large.

If you start centralizing your development you’re killing any type of collaboration with the outside world and discouraging such collaboration between your own teams.

http://code.dblock.org/2014/04/28/why-one-giant-source-contr...

wedesoft 10 years ago

With 2 billion lines of code I would consider the problem of developers stepping on each other's toes essentially solved.

rbanffy 10 years ago

What I find most distressing is that their Python code indents with two spaces... This is so wrong, Google.

izzydata 10 years ago

If they were to recompile all of it on a standard desktop PC how long would it take? A week?

sa2015 10 years ago

I wonder how close the "piper" system is to the code.google.com project.

DannyBee 10 years ago

I worked on code.google.com, i can tell you the are 100% unrelated.
piper grew out of a need to scale the source control system the initial internal repositories were using
code.google.com was a completely separate thing supporting completely different version control models, and a very different scale (very large number of small repositories, vs very small number of very large repositories)
a1k0n 10 years ago

IIRC, Piper is a reimplementation of the perforce backend, in order to handle the code size and the sheer number of "changelists" submitted per second. Nothing to do with code.google.com.
spectral321 10 years ago

They are unrelated. :)

MrBra 10 years ago

Am I the only one who initially read 28 instead of 2B ? :)

MrBra 10 years ago

downvoter: laughter is good for your health.

wellsjohnston 10 years ago

What is a "line of code"? out of the 2b lines of code google has, how much of it was auto-generated? how many of those lines are config files? This is a very silly article that has little to no value.

therealmarv 10 years ago

So they do not suffer on git submodules I guess

wgpshashank 10 years ago

Cool , How much front and back end each ?

rosege 10 years ago

How many lines is duckduckgo? :-)

creshal 10 years ago

Can't be that many, given they outsource the actual search engine to third parties.

nootropicdesign 10 years ago

OMG it's all in one file? OMG OMG it's all on ONE LINE????!!!

Sven7 10 years ago

Now I know why my google plus page takes half a day to load.

aikah 10 years ago

lol git clone http://urlto.google.codebase.git ...

I wonder how much time it takes to clone the repo, provided they use git.

robertk 10 years ago

It's 80TB. You don't clone, just ask for views.

kuschku 10 years ago

This explains quite some things.

Still, this is not a very forward-thinking solution. Building and combining microservices – effectively UNIX philosophy applied to the web – is the most effective way to make progress.

EDIT: Seems like I misunderstood the article – from the way I read it, it sounded like Google has a monolithic codebase, with heavily dependent products, deployed monolithically. As zaphar mentioned, it turns out this is just bad phrasing in the article and me misunderstanding that phrasing.

I take everything back I said and claim the opposite.

thomashabets2 10 years ago

That's why Google is so unsuccessful at scaling technical solutions, unlike you they're not forward-thinking.
- kuschku 10 years ago
  
  No, it’s not that they are unsuccessful, it’s that they are unable to maintain it properly. Already today they have tons of open security issues.
  Or think about April 1st, when they set a Access-Control-Location: * header on google.com because someone wrote the com.google easteregg.
  Read the post from the SoundCloud dude from yesterday to find out how to do software management properly (hint: modularization is everything)
  - cwyers 10 years ago
    
    Yes, I'm sure Google has a lot to learn from SoundCloud about how to deploy software at scale, like that time NASA got advice from Estes Industries on how to launch rockets.
  - captn3m0 10 years ago
    
    Do you have a source for the easter egg security issue? Haven't read about it anywhere, and can't seem to find anything either.
    
    kuschku 10 years ago
    
    Here is one: http://arstechnica.com/security/2015/04/no-joke-googles-apri...
    The issue was that they wanted to load the page – with the user logged in, etc – on com.google. For this they implemented an explicit URL parameter that would allow this.
    
    magicalist 10 years ago
    
    > Attackers could have seized on the omission of the X-Frame-Options header to change a user's search settings, including turning off SafeSearch filters
    The horror!
  - t0mk 10 years ago
    
    link to the post?
    Neither HN search nor Google search show anything to "modularization is everything".
    
    DanBC 10 years ago
    
    Perhaps this?
    http://philcalcado.com/2015/09/08/how_we_ended_up_with_micro...?
    (That question mark is part of the URL)
    
    kuschku 10 years ago
    
    Yes, exactly.
    It explains how even in a small company modularization can help extremely. Now look at Google, where some issues (like the google.com april 1st XSS issue) were only fixed after outsiders mentioned it.
    Usually internally the team responsible for that part should have cought that.
  - bitmapbrother 10 years ago
    
    Why don't you list those "tons of open security issues" ?
zaphar 10 years ago

Google runs practically everything internally as services. Nothing about the code repository makes it impossible to run microservices. Where did you get the idea that google runs a single monolithic app for everything?
- kuschku 10 years ago
  
  The article claimed the code, and they way it's run, is a monolithically developed and deployed product.
  If that’s not the case, I apologize for misunderstanding it.
  But if it was the case, I wanted to state that it might not be wise, for the same reasons as this thread mentioned https://news.ycombinator.com/item?id=10195423
  EDIT: Thanks for telling me, though! Always nice to be proven wrong, as at least I learnt something today :D
  - Filligree 10 years ago
    
    It's not the case.
    You should think of Piper as a single filesystem which permits atomic multi-file edits. And that's about it; there's nothing in that which forces any particular release structure on you.
  - spectral321 10 years ago
    
    The talk did mention (briefly) that monolithic codebase and monolithic binaries/software aren't strictly related. It's likely that monolithic software is easier in a monolithic codebase, but I don't think that microservices are harder in a monolithic codebase. Yes, we tend to statically link, but that's for library dependencies (i.e. things like the protocol buffers libraries/definitions). I don't work at all in this area, so this is a guess, but I imagine that it's extremely rare for unrelated teams to have application logic linked into the same binary - just making a release would involve coordination from so many teams :)

Settings

Google Is 2B Lines of Code, All in One Place

Keyboard Shortcuts