Conservative GC: Is It Really That Bad?

85 points by dleskov 8 years ago · 47 comments

Reader

barrkel 8 years ago

The problematic code in the article is, AFAICT, using an object finalizer to free manually allocated memory; such approaches seldom work well, even with precise GCs.

Thread stacks are effectively manually allocated blocks of memory. You create a thread, which allocates the stack, and as long as the thread lives, the stack is kept alive - it's self-sustaining. The thread must die by explicit programmatic action, which in turn will free its allocated block of stack memory.

Using finalizers at all is usually an anti-pattern in a GC world. The presence of finalizers is a very strong hint that the GC is being used to manage resources other than memory, something that GC is a poor fit for, because other resources almost certainly have no necessary correlation with memory pressure; and GCs usually only monitor GC heap memory pressure.

That's not to say that there aren't plenty of edge cases where you can end up with lots of false roots that artificially lengthen object lifetimes with a conservative GC. Putting a thread stack in your cycle of object references and relying on GC pressure to break the cycle isn't a strongly motivating one to my mind, though.

dbg_nsk 8 years ago

Absolutely agree with you about finalizers! However, please note that this "threadReaper" code is from JDK class, so, the problem can appear on every application that just use Timer class.
Of course, there are many other examples of false-roots, but this concrete class caused unexpected OOMs on several applications of our clients, so we made this small sample and used it for sanity checking during implementing precise GC (and then mentioned it in the post).
- yxhuvud 8 years ago
  
  I take it you are part of the ones behind the article? Did you ever see the papers about a conservative variant of immix, which would be both compacting and conservative?
  - dbg_nsk 8 years ago
    
    Yes, I wrote this article.
    No, I haven't read those papers before, but after your comment I took a look at one of them.. and it is very interesting! Thanks for mentioning it.
    But, if I understood correctly, the excess memory retention is still a problem in conservative immix?
    
    yxhuvud 8 years ago
    
    It is mentioned in one of the papers (or possibly the doctoral thesis, which is nice as it gives an overview of the whole chain of improvements) that it can still happen but that it is much better compared to what they benchmarked it against (boehm, iirc). Hard to tell if it delivers what it promises though..
    
    dbg_nsk 8 years ago
    
    It would be very interesting to run our sample with Timers on this GC. In that paper I found a link to their implementation, but unfortunately it is unavailable.
    
    yxhuvud 8 years ago
    
    Link is dead, but code is on Github somewhere. Dunno where I found the link to it but I am away from computer for a few days so I can't really look it up for you.
- barrkel 8 years ago
  
  I didn't know it was from the JDK, interesting.
kllrnohj 8 years ago

Your argument against finalizers completely ignores that FFI is a thing. Finalizers can be managing things that are memory, just not memory that the GC allocated because it is coming from a foreign system.
You also have to use them as a safe-guard against programmers that are not used to manually managing scope failing to manually manage scope. In some cases it's also just not a practical expectation, as the language is structured entirely around the idea that you aren't supposed to be manually managing scope. An open file that weaves its way throughout an application as it's a constant read/write backing for something? That can often be non-trivial to figure out when to call close() on it exactly, and by the time you've written such a system you've just re-invented manual reference counting (error prone) or a garbage collector of some sort anyway, and should have just used a finalizer.
Calling finalizers an anti-pattern only works to the extent that you can ban everything not controlled by GC'd world. Which would be great, except that you can't even do "hello world" with that constraint.
- barrkel 8 years ago
  
  > Your argument against finalizers completely ignores that FFI is a thing
  Nope. Memory that is indirectly allocated via FFI is not normally[0] accounted for by GC memory pressure and so it should be managed explicitly, not using finalizers. That memory is invisible to the GC. It won't know to collect it. It won't know when the foreign heap has allocated too much, and it won't know to run a more expensive GC collection to try harder when foreign space gets tight.
  (If you have a relatively infinite amount of RAM, or your FFI object sizes are a constant factor of the size of their GC world counterparts and you account for this in your GC max heap size, you may get away with it. But these constraints aren't typical.)
  > That can often be non-trivial to figure out when to call close() on it exactly, and by the time you've written such a system you've just re-invented manual reference counting (error prone) or a garbage collector of some sort anyway, and should have just used a finalizer.
  You're right that it can be hard to figure these things out. But figure them out you must, for any long-lived program using scarce resources, or you're just creating a problem for the future.
  Working these things out is not actually rocket science. It's what we did in the days before GC, and it was actually tedious more than difficult, because the best way to do it meant dogmatically following certain idioms when programming, being rigorous in your error handling and consciously aware of ownership semantics as data flowed around the program.
  We even developed a bunch of strategies to make it simpler, e.g. arena allocation and stack marker allocation. You can adapt these approaches for deterministic resource disposal in GC environments too (e.g. keep a list of things to dispose, or markers on a stack of things to dispose).
  The biggest wins from GC are from two effects: memory safety and expression-oriented programming. Memory safety means you never get dangling pointers or dynamic type errors from reused memory, a major increase in reliability, as well as making certain types of lock-free programming much easier. Expression-oriented programming means you can safely and easily write functions that take complex values and return complex values without thinking too hard about the ownership of these complex values. This in turn lets you program in a functional style that is much harder without GC.
  What GC doesn't give you is a world free of non-functional requirements. You still need to know about memory allocation of your big algorithms and program overall; you need to know where big object graphs get rooted for longer periods of time before dying (the middle age problem[1]), and you need to track the ownership of your resources rigorously, or you will run into resource exhaustion and non-deterministic failure modes - some of the worst kinds of failures.
  [0] https://docs.microsoft.com/en-us/dotnet/api/system.gc.addmem...
  [1] https://blogs.msdn.microsoft.com/ricom/2003/12/04/mid-life-c...
  - kllrnohj 8 years ago
    
    > Nope. Memory that is indirectly allocated via FFI is not normally[0] accounted for by GC memory pressure and so it should be managed explicitly, not using finalizers. That memory is invisible to the GC. It won't know to collect it. It won't know when the foreign heap has allocated too much, and it won't know to run a more expensive GC collection to try harder when foreign space gets tight.
    I'd call that a bug in the GC's design and a problem with its particular implementation of finalizers. Not a problem with the concept of finalizers, and indeed not a problem with all GCs/finalizer systems, as you linked. Android's runtime also allows for finalizers to pressure the heap to avoid this problem: https://android.googlesource.com/platform/libcore/+/master/l...
    > Working these things out is not actually rocket science. It's what we did in the days before GC
    Non-GC languages have facilities to help with this. GC'd languages don't.
    The presence of the GC makes this problem worse by design, so you can't just pretend that the GC isn't involved here. It is. It changed the design of the language. It makes manual tracking harder than it otherwise would be. Therefore it is part of the GC's responsibility to help with the problem it caused.
garethrees 8 years ago

This is right: finalizers are difficult to combine with garbage collection for several reasons. As you say, there's no guarantee that the garbage collector will find the finalizable object in a timely manner. Additionally, if it's a conservative collector, then the finalizable object may be pinned by an ambiguous reference and the collector will be unable to free it. In multi-threaded programs the collector runs asynchronously from the point of view of the code, so if the garbage collector calls the finalizer then the finalization code may not be able to depend on invariants of data structures. It is hard to write correct finalization code under these conditions.
A 2002 technical report by Hans Boehm discusses these difficulties. http://www.hpl.hp.com/techreports/2002/HPL-2002-335.html
kazinator 8 years ago
By your reasoning, everything is manually allocated. When we call (cons 1 2) in Lisp, that's a manually allocated cons. Most objects come into life due to some "manual" construction in the program!
> The thread must die by explicit programmatic action, which in turn will free its allocated block of stack memory.
Simply, no. A thread can recurse through some function activations and then hit an exit statement which terminates it without destroying those activations. Other threads can have pointers into that thread's stack, so the stack must not be reclaimed until they let go.
For instance, a tread allocates a reply box on the stack and calls some message passing API to send a message, whereby it registers the reply box. That API now has a pointer into the thread's stack. Suppose the thread dies before unregistering the reply box. If that can happen, the stack has to stick around. When the API tries to reply to the thread, and finds that the thread is dead, it can dequeue the reply box at that time; then the stack becomes unreachable. If the stack is reclaimed before then, then this messaging API will be traversing bad pointers through this registered message box.
> because other resources almost certainly have no necessary correlation with memory pressure
This is true and there is a way to regard finalization as decoupled from GC.
I have experience implementing an object system in which finalization is treated similarly to C++ destructors.
Objects can expect to have their finalizers explicitly invoked even when they are still live, long before they have become garbage.
One situation in which that happens is if an exception is thrown during the construction of an object.
I also have scoped reclamation construct with-objects which implements RAII. For instance:
```
    (with-objects ((foo (new foo ...))
                   (bar (new bar ...))
       ...)
```
The finalizers of foo and bar are invoked when with-objects terminates. Additionally, if foo throws during its construction, its finalizer is called, and if bar throws during its construction, both are called.
Now if those finalizers are not invoked by the time those objects become garbage, then GC will invoke them. Basically a finalizer is invoked and then removed, so if called early, then GC doesn't call it any more.
Situations in which it's undesirable or infeasible to use with-objects are covered by the call-finalizers API: a one-argument function which takes an object, and calls and removes its finalizer, allowing for completely manual finalization.
Of course, this is basically a logical extension of how with-open-file works in Common Lisp, formalized into the object system, integrated with finalizers.
There is a tradeoff: timely reclamation versus the risk created by the possibility of premature reclamation. It's easy to make objects which can be safely used after their finalizer has been called; the risk isn't that of a dangling pointer that will blow up the show. Still, things can malfunction when objects "hollowed out" by finalization are relied upon to still be functional.
- foodzee 8 years ago
  > Other threads can have pointers into that thread's stack, so the stack must not be reclaimed until they let go.
  > For instance, a tread allocates a reply box on the stack and calls some message passing API to send a message, whereby it registers the reply box. That API now has a pointer into the thread's stack. Suppose the thread dies before unregistering the reply box. If that can happen, the stack has to stick around.
  OMG, can such thing really happen and be correct in some language? The thing is, the stack is most often used as method-local storage for that method's variables and objects that are not escaping its scope and thus can be allocated on stack instead of the heap.
  Basically, what happens with stack in a language like C or Java when in thread T some method A calls another method B is the following:
  T.stack.push(return address) T.stack.allocateFrame(B)
  Then, when B has done everything it wanted to, it clears its frame from the stack and returns by saved address into A. Similarly, when an exception is thrown, it crawls up the stack frame-by-frame clearing them out until it finds the handler.
  With that general scheme in mind, I see some contradictions in your example:
  1. How exactly can a thread correctly die in such a way that frame of method that allocated the reply box on-stack is not cleaned up from it first?
  2. Why the method even allocated shared data structure on its own stack (which it must clear on returning to caller) and not in heap?
  3. If it did so because it blocks while waiting for the response to appear in stack-allocated placeholder, then why would anyone consider thread which is actively waiting for something dead and try to reclaim its stack?
  - kazinator 8 years ago
    
    > How exactly can a thread correctly die in such a way that frame of method that allocated the reply box on-stack is not cleaned up from it first?
    The thread could die incorrectly in that way.
    > Why the method even allocated shared data structure on its own stack
    Done for efficiency or when it's not desirable to have to check and recover from a failure to allocate such a structure. E.g. DECLARE_WAITQUEUE macro in Linux kernel:
    https://elixir.bootlin.com/linux/v4.3/source/include/linux/w...
    How blocking is implemented in Linux is that tasks declare wait queue nodes on the stack, register these into a queue, then change themselves to a sleep state and call the scheduler.
    > why would anyone consider thread which is actively waiting for something dead and try to reclaim its stack?
    What reclaims the stack isn't that "anyone"; it's garbage collection. It's plausible like this. Suppose the messaging API is responsible for dequeuing the waiter. The messaging API walks the queue, depositing replies into the reply boxes and dequeueing (without caring whether the associated threads are alive or dead).
    When it dequeues the dead thread's reply box, that stack then becomes unreachable. Now it is eligible for reclamation.

aidenn0 8 years ago

Interestingly enough, SBCL on x86/x64 has a conservative, but moving, GC. It can know some, but not all roots precisely, so it pins any objects that are reachable through conservative roots.

It's earlier implementations were on RISC chips that had 24 or more GPRs so the implementation was simple: 2 stacks and divide the local registers in half for boxed and unboxed values. This obviously didn't work when porting to x86 which had far fewer registers.

The ARM port I believe uses the non-conservative approach, despite having 1 less register than x64 (the x64 port was derived from the x86 port so uses the same register assignments).

dmm 8 years ago

FWIW Clozure CL has a precise gc on x86/x64. The register usage is detailed here: https://ccl.clozure.com/docs/ccl.html#register-and-stack-usa...
rwmj 8 years ago

> divide the local registers in half for boxed and unboxed values
Fondly remembers the separate address and data registers on 68000. Why didn't they go back to this approach for x86_64 (16 registers), now that no one really cares about 32 bit x86?
- aidenn0 8 years ago
  
  The conservative GC approach has worked well enough in practice that nobody is going to do the work. Also there is a performance tradeoff in non-allocating code: Sometimes you need more unboxed registers, other times you need more boxed registers so with only ~6 of each[1] you will run into register pressure.
  1: 2 stacks means 2 stack pointers and 2 frame pointers leaving only 12 registers left for values; it's also possible that the SBCL ABI uses a global register for something else as well, which would leave only 11. PowerPC is a really luxurious platform in which you have 32 GPRs so even if you use 8 GPRs for various bookkeeping purposes that leaves 24 remaining, which is enough for pretty much everyone.

dmytrish 8 years ago

I am not an expert in garbage collection techniques, but this article does not even mention locality of reference (copying GCs improve locality on each compaction) and how many cache misses are introduced by increased fragmentation. Are there any benchmarks on this?

aidenn0 8 years ago

On SBCL the bigger win for using a copying collector isn't the locality of reference (which helps with some loads, but hurts with others), but rather the fact that you can make an allocation be about two instructions in the non-GC case (pointer increment plus a bounds check).
I hadn't spent a lot of time thinking about how much faster this is than malloc/free until a question came up the other day here on HN to the extent of "why would anyone dynamically allocate an object that is smaller than a cache line?" In lisp a commonly allocated structure is a CONS cell which is two pointers in size, and is often smaller than the cache line. It would be very wasteful to do a malloc/free of 8 (or 16) bytes, in C but throughput is approximately identical compared to stack allocating them with SBCLs allocator.
- bjourne 8 years ago
  
  It never gets as fast as stack allocating variables. Consider a loop:
  while(1) { Integer n = new Integer(....) vs int n = ... }
  A C compiler would allocate the n variable on the stack, making the "allocation" completely free. But in a GC:ed language, the n variable would be bump allocated once every loop. That wouldn't in itself be so costly, but every so often, a GC cycle would be needed as the garbage accumulates. Furthermore, in C the address of the n variable stays in place while in a GC:ed language it moves a bit for each loop.
  That is why escape analysis is a fruitful optimization. It takes heap allocated objects and attempts to stack allocate them, similar to how a C compiler would do it.
  - aidenn0 8 years ago
    
    To be fair, I said "approximately" The cost of a nursery GC in SBCL is O(n+m) where n is the number of live objects in the nursery and m is the number of GC roots. If this code is all you are running then n will be 1, so it's just scanning the roots, which only happens when you have allocated enough integers to fill the nursery, so gets amortized across thousands of loop iterations.
  - ScottBurson 8 years ago
    
    > But in a GC:ed language, the n variable would be bump allocated once every loop.
    Citation needed. I don't know of any GCed language that heap-allocates local variables in this way (well, maybe SML/NJ does, but I doubt it). Certainly not Java or any Common Lisp I've ever used.
    But I agree with your larger point: heap allocation is never quite as fast as stack allocation, once you factor in the additional GC load. I don't actually know how close it gets with modern collectors; would love to see some numbers.
    
    bjourne 8 years ago
    
    C# does and Java did before the escape analysis optimization became default. You can find numbers in this old article from 1999: https://www.cc.gatech.edu/~harrold/6340/cs6340_fall2009/Read...
    "and the overall execution time reduction ranges from 2% to 23% (with a median of 7%) on a 333 MHz PowerPC workstation with 128 MB memory."
    The benchmark is 20 years old so it is kind of out of date. I don't know of any modern benchmarks. I suspect that the difference would be much bigger nowadays because programmers don't avoid allocating small local objects as much.
    
    ScottBurson 8 years ago
    
    Oh, now I understand what you're saying. I got thrown off when you said "the address of the variable n [...] moves a bit for each loop". This isn't quite right; it's not the address of the variable n itself that changes, it's the address of the allocated object that it points to.
munificent 8 years ago

I'm no expert either, but some of the literature I've read says that copying GCs aren't a panacea when it comes to improving locality. For some object graphs, they help, for others they actually reorder the objects in ways that make locality worse.
Consider, for example, a copying GC that copies using a depth-first traversal of the object graph and then running it over a tree that your program always processes in breadth-first order.
dbg_nsk 8 years ago

Yes, an excellent point. Unfortunately, we do not have such specific benchmarks at the moment (only general benchmarks on the performance of GC), but I guess we should add them.
However, please note, that our conservative GC was also copying (but not full copying) collector, and it also improved locality of references. So an impact on the performance is not so huge as in the case of GC that doesn't provide any compaction at all.

bjourne 8 years ago

A safepoint in x86 is nothing more than the instruction mov [rip+0x1234], eax. That shouldn't cause a major slowdown? Also, safepoints are useful for features other than gc. For example, you can inspect a running thread's callstack. That is useful when debugging and when objectifying a thread's state.

Stack maps can be made a bit smaller by pushing and popping all registers from the stack during gc. That way, you only need to store the values of stack locations in them and not of individual registers.

Btw, the article is really good. This is the kind of stuff I keep coming back to HN for!

dbg_nsk 8 years ago

Right, but even a single instruction placed on every backward branch and at the epilogue of every method causes noticeable impact on the performance. This is especially important in highly optimized code and that's why optimizing compilers like HotSpot C2 (and JET as well) try to remove as many safepoints as possible. Sometimes it even causes troubles like in this case:
https://bugs.openjdk.java.net/browse/JDK-5014723
Good point about inspecting thread's callstack. Indeed, with conservative GC we had a problem: many popular profilers were incompatible with JET because they inspect threads at safepoints and we had none. When we implemented the precise GC, this problem disappeared and it was an additional benefit for us. However, there are alternative ways to gather thread's callstack out of safepoint. We use them to avoid safepoint bias in our profiler. You can read more about it here:
https://www.excelsiorjet.com/blog/articles/portable-profiler...
----
Thanks for kind words! I'm glad that you liked the post!

naasking 8 years ago

Conservative GC would probably work well enough for the JVM because there are no value types or inline arrays which more easily masquerade as roots, ie. a random sequence of bytes as used in crypto or hashing would yield a lot of false positives.

By comparison, the CLR is a much worse fit, because value types and stack/inline/fixed arrays means false positives would be much higher for some applications.

dbg_nsk 8 years ago

You are right, such things as value types or inline arrays are not introduced in Java language (yet), but still JVM can allocate objects including arrays on the stack if this objects are not escaping into the heap. Of course, not all objects fit this condition, but the problem remains.

SteveJS 8 years ago

The Chakra Javascript engine uses a conservative generational mark and sweep collector with many phases running in parallel to code execution. It looks like Chakra is now on github (and with an MIT license). In chakra the GC is called it a 'Recycler', which can throw one for a loop when searching for the GC implementation.

willvarfar 8 years ago

This is just off the top of my head, but it made me wonder: are there any VMs that put a stack map header of some sort as a literal in the stack?

E.g. for each frame the compiler orders roots first and then other primitives. Then, as you enter the frame, write the number of roots to the stack. When the GC walks the stack it can see precisely which are roots.

barrkel 8 years ago

It can be done. In fact something very close to that is done in the Delphi compiler.
Delphi has a variety of reference counted types. Originally just strings, but later dynamic arrays and COM-style interfaces, all use automatic reference counting. Assignment and copying of these types are done via runtime calls, not just for variables of these types, but also structures and static arrays, recursively, that contain these types.
When allocating locals of these types, and of types that contain these types, the compiler also writes out an RTTI structure describing the stack layout, a bit like the activation record was an actual Pascal record. This RTTI is used to correctly decrement references when the stack is popped in normal case, or during exception unwinding.
The RTTI scheme is smart enough to encode several variables of the same type as being like a static array of that type, etc.
All this doesn't help with liveness, of course, which will still be a problem for the code presented in the article. In fact the efficiency of the encoding is in direct opposition to liveness; encoding things as contiguous blocks of pointers will most likely artificially extend their lifetime to the whole call frame.
anon1851968 8 years ago

That could be done, but generally precise collectors keep a fixed set of tables outside the stack, that way they only need to figure out which method frames are on the stack and then consult those tables. That incurs fewer writes.
munificent 8 years ago

The paper "Accurate Garbage Collection in an Uncooperative Environment" goes over a very roughly similar technique for compiling a GC language to C in a way that lets you easily find the roots because managed objects are stuffed in structs on the C stack in a certain way.
It's a really really neat paper that I've been itching to implement for a while.
dbg_nsk 8 years ago

The problem is that it all depends on the liveness of the variables. The same value on the stack can be a root at the begining of the method as the corresponding variable is still alive, and later it becomes useless as the variable is already dead. So, you still need to know a location of each live reference at every safepoint (and that means several stack-maps for each method).
- willvarfar 8 years ago
  
  You can zero them when they are not live.
  - dbg_nsk 8 years ago
    
    Yeah, we've tried that. It was one of the attempts of improving conservative GC (no dead values on the stack => no false roots). Unfortunately, it causes noticeable performance degradation, so it is easier to consult with stack maps about liveness of the variable.
Garfgon 8 years ago

Could you use debug information for this? I think the stack frame layout should already be included there.

le-mark 8 years ago

So does this analysis extend to libgc/Boehms gc, since it's conservative as well?

PaulHoule 8 years ago

The pain from conservative GC depends on how much your address space you are using.

In the 32-bit age, you ran into problems more and more as your heap approaches the GB range. At some point the probability that you end up with a false root that keeps a lot of garbage alive goes to 1.

In the 64-bit age we get a respite, although many systems don't really use 64-bit pointers.

kazinator 8 years ago

A useful hybrid is possible and useful in some circumstances: conservative scan for roots, which point into a precisely traced heap.

jacksmith21006 8 years ago

This is something that I just love about Go. Get rid of the GC stalling like you have with Java.

imtringued 8 years ago

The primary difference is that Go creates less garbage than Java. Of course this results in significantly shorter GC pauses but they still exist.
abollaert 8 years ago

Has been discussed before.
https://blog.plan99.net/modern-garbage-collection-911ef4f8bd... https://news.ycombinator.com/item?id=13218550

Settings

Conservative GC: Is It Really That Bad?

Keyboard Shortcuts