Settings

Theme

How a fix in Go 1.9 sped up our Gitaly service by 30x

about.gitlab.com

252 points by rfks 8 years ago · 71 comments

Reader

Animats 8 years ago

Each Gitaly server instance was fork/exec'ing Git processes about 20 times per second so we seemed to finally have a very promising lead.

What's really wrong here is that they're apparently spawning processes like crazy. Do they spawn a new process for each API call? That's like running CGI programs under Apache, like it's 1995.

  • revelation 8 years ago

    It's even worse. Gitaly is a program that takes loosely-validated externally-triggered requests and turns them into Git command lines to be exec()ed. So every API request transmutes its input into one or more Git command lines that are exec()ed, each one invoking fork() on the main massively-parallel Gitaly process (well, used to anyway).

    It's like a terrible China router firmware, without the C. Bonus points for every straightforward way of running a throwaway command on Linux invoking fork().

    I guess it's a good thing because it sets us up for another blog post once they learn of the latency gains to be had when you are not creating new processes on API requests. Hell, when someone starts looking into how this Git thing works, we might be in for a whole series.

    • arghwhat 8 years ago

      I suspect you are misunderstanding "exec" as "shell". China router firmwares call the shell.

      Putting arbitrary input into a shell is dangerous, as missed escaping can result in control of the shell.

      When you call exec yourself, however, you are passing the individual arguments as NULL-terminated list of strings (char*). There is no shell to abuse. Calling a process this way is about as safe as calling a function that takes strings for arguments. The function can still have vulnerabilities, but the process of calling it is safe.

    • Animats 8 years ago

      Do they just do that for commands that make changes, or do they do it for pure read commands as well? Most of the volume is in reads, especially since so many build systems now read directly from Github.

      • revelation 8 years ago

        Here is the "reference exists" that the blog post alludes to:

        https://gitlab.com/gitlab-org/gitaly/blob/master/internal/se...

        I'm not an expert in this abomination, but it looks all the world like invoking "git show-ref --verify".

        (Of course a Git ref is usually just a file in .git with a SHA1 in it. They don't care about the SHA1, so really they are launching a Git process for a file exists operation. This used to take them 400ms, but now it's only 100ms!)

        • chriscool 8 years ago

          Git refs can be packed (see https://git-scm.com/docs/git-pack-refs) so it's not just about checking if a file exists.

        • lbotos 8 years ago

          Curious, how would you do it?

          (disclaimer: I work at GitLab, not on this project though)

          • revelation 8 years ago

            Someone mentioned libgit2, and that is a good first step fix. There is no need to launch a process that calls libgit2 when you can call high-level libgit2 functions yourself. That already eliminates this problem entirely, along with all the other ickiness that comes with running command line tools programmatically.

            But ultimately this is the main Git interface for the remainder of the site, and it is apparently already sharded so only has to deal with a limited number of repositories. You can use libgit2 on a low-enough level that you can just keep and mutate repository state in memory. Something like a ref exists should be just a hash table lookup, and there are a bunch of other commands where gains can be had when you are not starting from scratch on every API call.

            (This is what github is doing. They started out with grit, which was some parts of Git reimplemented in Ruby and then launching git for heavy-weight stuff. Nowadays they use rugged, the ruby bindings for libgit2.)

          • rakoo 8 years ago

            If forking to a process is not good enough, I'd use a native, read-only library such as https://github.com/speedata/gogit

          • cstrahan 8 years ago

            In their case, why not stat the path directly?

            • masklinn 8 years ago

              Packs, most likely. On my main $work local repo, I have 27 refs and 22240 packed refs.

          • Animats 8 years ago

            Use something like boring old FCGI. FCGI spawns off N workers, which process one request at a time. If a worker crashes, it's restarted. Workers are usually restarted every once in a while to deal with any memory leaks.

            Read-only requests with no security implications get done by in the worker process, which has read permissions for public files. More complex requests spawn a Git client program.

            It would be so uncool, though.

  • dchest 8 years ago

    Nothing wrong with it. In fact, I wish spawning processes was more common. It's beneficial for security.

    • sreque 8 years ago

      Spawning a process, even if its just vfork() followed by exec(), is very very expensive compared to spawning a thread, and even then most apps created threads on startup and then just cache them for the lifetime of the application. Spawning a process 20 times a second honestly isn't terrible, but even hundreds per second can be noticeably slow IIRC.

      • azinman2 8 years ago

        I thought Linux treated forks and processes the same, where as the former is just a special case?

        • saghm 8 years ago

          `fork` is how you create a new process on Linux. Threads, which GP was talking about, are different, especially Go threads, which are not the same as OS threads.

          • ereyes01 8 years ago

            I believe GP was referring to the `clone` system call, which both fork and pthread_create use under the covers. The difference is what happens to the memory-resident data- the forked/child program shares the address space of the parent until either writes to a page, in which case the page is copied such that each process gets its own unique copy. This is known as copy-on-write (COW). Pthreads, on the other hand, outright share the process's address space and must implement their own synchronization via locks or whatever.

            Obviously, the performance of fork vs. pthread_create could differ dramatically depending on what the program does.

            Goroutines are a layer of abstraction above this. They might run on different threads concurrently- the Go runtime controls what happens here and may differ on various architectures / OSes. If you break into a running Go program on Linux with gdb, there's definitely a bunch of pthreads running, maybe for goroutines, and probably for garbage collection, and other stuff. (If you want to actually debug go code, you should of course use something like delve).

            http://man7.org/linux/man-pages/man2/clone.2.html

    • kjksf 8 years ago

      Process isolation is good for security.

      Parsing text data in ad-hoc, non-standardized, not documented, not defined format is really bad for security.

      Just spawning a process creates as many security problems as it solves.

      If it was done right, it would look like Chrome architecture, where untrusted, isolated processes can do dangerous work but communicate with trusted process via well defined IPC protocol.

      • pjmlp 8 years ago

        Yes, but there is no need to spawn them all the time.

        A parser service daemon, or a pool of them can be used instead, getting requests from the main application process.

    • mschuster91 8 years ago

      > It's beneficial for security

      ... and for RAM usage. Java applications all have a tendency to bloat the longer you keep them running.

      • paulfurtado 8 years ago

        It's not bloat or memory leaks per se, the JVM just does not return memory to the OS after it is freed. To limit its memory usage, tune the heap size. To fully allocate the heap on startup for consistent usage, use -XX:+AlwaysPreTouch

        • chinhodado 8 years ago

          Last time I checked, I still couldn't control the max heap free ratio, because apparently that option/flag just doesn't work with Java 8's default GC.

        • axaxs 8 years ago

          Java has always been 'use memory to increase speed...sometimes'. You can tune it some, sure, but that's what it's known to do.

        • nurettin 8 years ago

          Back when I tried running a jruby application on 800mb ram, it bloated, then started throwing "OutOfMemoryError"s and "Insufficient Class Space" or something similar. Apparently jruby was generating too many new types at runtime to accommodate rails framework. Garbage collector was pretty garbage at it's job back in 2011.

          • readittwice 8 years ago

            AFAIR JRuby creates a Java class for each Ruby method. Classes were part of the Permanent Generation, so sometimes this generation was too small escpecially when using Rails. It was enough to just increase the size of this generation with -XX:MaxPermSize (-Xmx doesn't increase the permanent generation's size).

            The Permanent Generation was named that way since objects in there were never collected. For most applications this isn't really a problem. Running JRuby+Rails just allocated a lot of classes in this generation, so the default size was too small. But still, the permanent generation was quite small compared to the heap size.

            I wouldn't really call the GC bad because of this, IMHO they were already quite good back then. And in Java 8 the permanent generation was replaced with the Metaspace, objects in there can be free'd and the Metaspace can be expanded at runtime so it's less likely to get these OOM errors for the permanent generation.

            • nurettin 8 years ago

              That was my experience as well. Increase maxpermsize and reboot every night.

      • arghwhat 8 years ago

        That's mostly just a Java thing, though. Java has a weird relationship with memory.

  • CapacitorSet 8 years ago

    I wonder how much it would speed up if they were using libgit2 directly.

    • jacobvosmaer 8 years ago

      I'm on the Gitaly team.

      As zegerjan wrote, Gitaly is a Go/Ruby hybrid.

      The main Go process doesn't use libgit2 (for now) because we didn't want to have to deal with cgo. We already know how to deal with C extensions in Ruby, and we have a lot of existing Ruby application code that uses libgit2, so we still use it there. And that code works fine so I don't see us removing it.

      In practice, sometimes spawning a Git process is faster than using libgit2, so why then not do that. Also for parts of our workload (handling Git push/pull operations), spawning a one-off process (git-upload-pack) is the most boring / tried-and-true approach.

    • sytse 8 years ago

      I'm at GitLab but not on the Gitaly team. I think we are using libgit2 but that it doesn't contain all the calls we need.

      • CapacitorSet 8 years ago

        Reading through an old issue (can't link right now, am on mobile), it seems that the main reason for not using libgit2 in Gitaly is performance, since it would read too many unused files.

    • zegerjan 8 years ago

      The Gitaly server has a Ruby component, but also a Go component. The Ruby server uses Rugged[1] and Gollum-lib[2] which both use libgit2.

      The Go component doesn't have libgit2 binding yet, although we're looking into adding that later. That or maybe go-git[3]. But for now Gitaly is mainly focussed at migrating all git calls from the Rails monolith. Not introducing a new component now reduces the risks this project has.

      [1]: https://github.com/libgit2/rugged/ [2]: https://gitlab.slack.com/archives/C027N716H/p151695430400026... [3]: https://github.com/src-d/go-git/

    • masklinn 8 years ago

      Assuming they're currently pure git, binding to libgit2 would require using cgo, which could have far-ranging (negative) consequences.

    • chriscool 8 years ago

      libgit2 has had bugs related to file locking (for example when you make some ref changes while garbage collecting) and libgit2 does not implement all the git features, so you cannot do everything using libgit2 anyway.

  • paulddraper 8 years ago

    I am tired of this "processes are expensive" bullcrap. (At least for Linux.)

        $ time seq 1000 | while read; do sleep 0 & done
    
        real        0m0.185s
        user        0m0.546s
        sys         0m0.265s
    
    That's less than .2ms to start a process.

    Processes give you operational control (CPU, memory, permissions, isolation, monitoring) that other constructs simple cannot. Decades ago when we had far slower computers, people were doing process-oriented development and forking as if it was okay (CGI, make, git).

    Somehow, separate processes came to be avoided like the plague, when in reality, they are probably the smallest resource "waste" in 99% of systems.

    • arghwhat 8 years ago

      This is a terrible microbenchmark.

      First of all, you're only benchmarking the time it takes for fork(2) to return in the parent subshell, nothing else. The new processes don't exist yet at this point, and certainly hasn't exec'd (which tends to be why you're forking).

      Second, you're not measuring the cost at all. The forked children will, at some point, start executing on other CPUs, which includes finishing configuration and running exec, which takes time. The cost is the total cycles it takes before the child is executing the intended code.

      Fork is damn expensive, but whether they're too expensive depends on the usecase, and the cost of expanding hardware.

      Fork time scales with the virtual memory of the forking process, and you're forking from a fresh subshell that hardly has anything allocated. It's even mentioned in the linked post that their issue stemmed from this (specifically fork lock contention spiking as fork time increased).

      • paulddraper 8 years ago

        (1) The benchmark measured the point of discussion.

        (2) Even not using asynchronity (which Go is heralded for), processes take <2ms to start and stop. Not nothing, but certainly something you could do hundreds of times a second.

            $ time seq 1000 | while read; do sleep 0; done
        
            real        0m1.644s
            user        0m1.065s
            sys         0m0.672s
        • arghwhat 8 years ago

          1. No, the point was that fork(2)+exec(3)/spawning processes is an expensive way to run code, not how long it takes for the parent to be able to do something else.

          2. Your new benchmark is better. However, it is still a useless microbenchmark, as it is an unrealistic best-case scenario. Your spawn of sleep is happening within a fresh subshell started by the pipe you made. fork(2) depends on things like VMM size and open file descriptors of the parent process, and your subshell basically has nothing at all. A real application likely holds at least a few gigabytes of virtual memory (more likely tens of gigabytes—note that virtual memory isn't the same as resident memory), which will make fork(2) take much longer, split between parent and child.

          I suspect you might be confusing asynchronicity with concurrency or parallelism. Go is heralded for concurrency, sometimes in the form of parallelism, but not asynchronicity. Concurrency does not have any positive effect on execution time or cost. Parallelism can reduce execution time, but does not decrease execution cost, it simply throws more hardware at the problem.

          In fact, Go is a worse-than-average language to call fork(2) in, due to it running fork(2) under a global lock. This is mentioned in the linked article. The lock contention caused by fork(2) execution time as memory consumption increased was what made the process unresponsive.

          However, as I also said, whether fork is too expensive depends on the use-case.

          • paulddraper 8 years ago

            > Each Gitaly server instance was fork/exec'ing Git processes about 20 times per second

            > What's really wrong here is that they're apparently spawning processes like crazy.

            Sounds like it depends on the use-case, rather then blanket "two dozen processes per second is clearly absurd".

            • arghwhat 8 years ago

              Definitely. While fork(2) is expensive, a price is useless without also knowing the budget, and how expensive it is depends on the environment.

              However, the problem in the posted article was indeed that spawning Git processes 20 times a second in that specific Go application was too much, and the fix was that Go replaced fork(2) with posix_spawn(3).

_urga 8 years ago

We currently have the same problem in Node, where fork is still being called synchronously from the event loop instead of asynchronously from the thread pool.

Calling exec() or spawn() in Node is therefore not asynchronous and can block your event loop for hundreds of milliseconds or even seconds as RSS increases.

https://github.com/nodejs/node/issues/14917

jsiepkes 8 years ago

"Recompiling with Go 1.9 solved the problem, thanks to the switch to posix_spawn"

I never understood why so many people use fork() instead of POSIX spawn(). For example OpenJDK (Java) also does this as the default for starting a process. Which leads to interesting results when you use it on a OS which does do memory over committing like Solaris. Since the process briefly doubles in memory use with fork() your process will die with an out of memory error.

  • nitwit005 8 years ago

    I thought of creating a fix myself way back, and the issue was that Go made use of system calls directly. You basically have to re-implement posix_spawn in Go. If you look at their change, it includes updates to chipset specific files, and the fix only seems to work on a CPU that reports as amd64.

    • ithkuil 8 years ago

      I must say I didn't go to look at the sources of the patch, but what you say sounds so odd that I'll take the chance and suggest that perhaps the fact that in golang "amd64" is, for historical reasons, the name of the architecture more neutrally known as "x86_64", is the source of confusion (I.e. it doesn't just work on AMD or on CPUs that claim/report having a specific model/maker etc).

      Low level syscall ABI is architecture dependent.

      • laumars 8 years ago

        AMD64 refers to both Intel and AMD chipsets that are 64bit x86. While you are right that there is also the term "x86_64" in common use, AMD64 is the actually more standard name (as well as the term specifically used in the Go ecosystem eg $GOARCH env var and build parameters for cross platform sources)

        Further to that point, I didn't detect any confusion from others in this thread that AMD64 excluded Intel chips. Where they were talking about AMD64 specific code they were saying that Go code targeting other architectures (eg arm, mips, s390x and ppc - to name a few. Go supports an impressive number of architectures[1]) would still use their respective fork() code rather than this new fix.

        [1] https://golang.org/doc/install/source#introduction

      • linkregister 8 years ago

        amd64 is the original name of the instruction set. Intel did beat AMD to a 64-bit instruction set: that of the Itanium processors, IA-64. Itanium had performance issues and lots of errata. Most importantly, IA-64 was not natively backwards-compatible with x86 instructions. amd64 became the standard.

        x86_64 is a common name for the amd64 architecture, and is a way to describe both the AMD and Intel implementations. In my opinion, amd64 is a less ambiguous name and is more historically accurate.

        https://en.m.wikipedia.org/wiki/X86-64

        Yes, I am aware that my point is undercut by the fact that the article title is x86-64, but I stand by my statement.

        • cesarb 8 years ago

          It's not just the article title. Follow the two footnotes in the "History" section of that article, to the press releases from AMD announcing the new ISA. They consistently call it "AMD x86-64" or "AMD's x86-64" or just "x86-64". The oldest snapshot I could find of the x86-64 web site (https://web.archive.org/web/20000817014037/http://www.x86-64...) also calls it x86-64. The most recent snapshot of that site, however, calls it AMD64; it seems to have changed sometime in the middle of April 2003.

          That is, both x86-64 and AMD64 are historically accurate (2003 was early enough in the ISA's lifetime), but x86-64 is the earlier name.

    • AnIdiotOnTheNet 8 years ago

      Go uses system calls directly because the alternative in UNIX land is linking with libc.

  • revelation 8 years ago

    Because every straightforward way of running an external command on Unix involves fork(). So someone wrote that API not thinking much of it.

    Then shock horror they realize running a throwaway command is fork()ing the main process. But now everyone is too angsty to change it because someone out there might rely on the environment copy functionality, even when they shouldn't.

  • dboreham 8 years ago

    Because decades of written material about Unix says fork() is really cool (even though it isn't)?

    • f2f 8 years ago

      fork was alright before other people tacked on multiples of cruft like threads and whatnot onto commercial unixes and they became mainstream. the current problem is that you don't want to have to copy all file descriptors if all you're going to do is call "exec" and reduce them to three: in, out, err.

      for example, here's the caveats section from the macOS fork man page:

           There are limits to what you can do in the child process.  To be totally safe you should restrict your-
           self to only executing async-signal safe operations until such time as one of the exec functions is
           called.  All APIs, including global data symbols, in any framework or library should be assumed to be
           unsafe after a fork() unless explicitly documented to be safe or async-signal safe.  If you need to use
           these frameworks in the child process, you must exec.  In this situation it is reasonable to exec your-
           self.
      
      That spells defeat :)

      Earlier in the game, copy-on-write had to be created for the same reasons.

      • cstrahan 8 years ago

        > the current problem is that you don't want to have to copy all file descriptors if all you're going to do is call "exec" and reduce them to three: in, out, err.

        To be clear, exec does not necessarily close all but the first three fds -- by default all fds will be inherited. However, you can set the close-on-exec flag on each individual fd (in fact, that's what the Go stdlib does behind the scenes).

        Search for FD_CLOEXEC in fcntl(2) and open(2) and you'll see what I'm referring to.

        http://man7.org/linux/man-pages/man2/fcntl.2.html

        http://man7.org/linux/man-pages/man2/open.2.html

  • ploxiln 8 years ago

    fork() is a pretty simple way to be able to modify the environment for a process you will spawn. fork(), the child can modify its own environment using various orthogonal system calls, e.g. to redirect stdout/stderr or drop permissions, and then exec the target executable.

    Threads throw a wrench in things. But fork() existed for decades before threads. O_CLOEXEC etc helps. Lots of command-line utilities don't use threads.

    fork() isn't the fastest way - but in many situations it's not a problem, it's just convenient. In that respect it's somewhat like using python when you could have used go.

    • jacobvosmaer 8 years ago

      Another nice example is changing the working directory for the new process. With fork+exec, you can do a chdir after fork but before exec. With posix_spawn you're stuck with the working directory of the parent.

  • khc 8 years ago

    because posix_spawn() in linux often calls fork(). I just looked at the manpage now and it says under some conditions it'd call vfork() instead, but I don't remember that being the case when I last looked at this (6-7 years ago?)

    • noselasd 8 years ago

      Nowadays, posix_spawn() calls clone() on linux, with the CLONE_VM flag, behaving much like vfork() as far as I can tell.

      That means the child and parent process shares the memory (until exec() is performed).

      Especially if the parent process is multi-threaded this avoids a whole lot of pagefaults that would occur if using fork() when another thread touches memory, possibly triggering a lot of copy-on-writes in the time window between calling fork() and the child calling exec()

      Code: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/uni...

  • jsiepkes 8 years ago

    Typo: "does do" should read "doesn't do".

kevincox 8 years ago

It's often to remember the other point of view when you see huge performance differences.

> A bug in Go <1.9 was causing a 30x slowdown in our Gitaly service.

empath75 8 years ago

Related post from a few weeks ago:

Fork is not my favorite syscall:

https://news.ycombinator.com/item?id=16068305

zebra9978 8 years ago

Are you guys planning to migrate gitlab to golang ? I think the biggest feature that everyone wants is better performance.

Is the migration path that tough ?

  • connorshea 8 years ago

    There are no plans to migrate all of GitLab to Go. The main Rails app is going to stay a Rails app for the foreseeable future. There are a few reasons for this. For one it'd be such a huge project, but also Rails is working well for us, it's great for our pace of feature development.

    We are working on moving the git layer to Gitaly[0] which is written in Go (and is what this blog post is about). It was one of our major bottlenecks and we've seen a lot of benefit from having made the switch. It's not done yet, but a lot of the calls to git that the application makes are now done through Gitaly.

    [0]: https://gitlab.com/gitlab-org/gitaly

  • romanovcode 8 years ago

    They have so many features that I don't see it happening ever.

    • carussell 8 years ago

      They wouldn't need to stop the world and do a full rewrite. It would be feasible if they stop writing new components in Ruby and began replacing the existing parts piecemeal.

tuna 8 years ago

Pretty sure that using stdlib and trying to limit shell script in Go would help performance. Case in point, forking "du":

https://gitlab.com/gitlab-org/gitaly/blob/master/internal/se...

stonewhite 8 years ago

> Having solid application monitoring in place allowed us to detect this issue, and start investigating it, far earlier than we otherwise would have been able to.

Yet apparently nobody either caught or investigated the latency spike after the previous deployment.

haikuginger 8 years ago

The article linked to a set of posix-spawn benchmarks[1] that seemed to indicate that while fork/exec time scaled linearly with resident memory on Linux, it did not scale at all on macOS.

First of all, I was somewhat confused by that due to the availability of copy-on-write; I wouldn't have expected fork/exec time to scale up that way.

Second, I was surprised that there wasn't an attempt to explain the behavior difference between the two systems. Can someone familiar with either or both point towards an explanation for why that's the case? It seems very odd.

[1]https://github.com/rtomayko/posix-spawn#benchmarks

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection