Reference count, don't garbage collect
kevinlawler.comThis debate has gone round and round for decades. There are no hard lines; this is about performance tradeoffs, and always will be.
Perhaps the biggest misconception about reference counting is that people believe it avoids GC pauses. That's not true. Essentially, whereas tracing GC has pauses while tracing live data, reference counting has pauses while tracing garbage.
Reference counting is really just another kind of GC. I'd highly recommend perusing this paper for more details: A Unifying Theory of Garbage Collection. https://web.eecs.umich.edu/~weimerw/2012-4610/reading/bacon-...
One of the biggest issues with reference counting, from a performance perspective, is that it turns reads into writes: if you read a heap-object out of a data structure, you have to increment the object's reference count. Modern memory architectures have much higher read bandwidth than write bandwidth, so reference counting typically has much lower throughput than tracing GC does.
I am the maintainer of a very high-performance JIT compiler for a Haskell like rules programming language used by large enterprises around the world. It uses reference counting + a global optimisation step to reduce the reference count updates to an absolute minimum. The result is compiled code that runs faster than C++ code carefully hand optimised by C++ experts over a 10 year period. There are zero GC pauses. Unless you claim that a C++ alloc/feee call is “garbage collection”. Which is not common terminology. It also (BTW) scales linearly the more cores you throw at it.
You've gone from claiming reference-counting is faster than tracing GC to claiming it's even faster than hand optimized C++, which is quite honestly unbelievable - whatever the reference counting algorithm is doing can be emulated by the hand-optimised C++ code so that's just literally impossible. But anyway, it's a completely fruitless discussion here unless you provide data that we can look at and scrutinize. OP hasn't provided any. You haven't provided any (and I do believe you may think you're right, but I've been in the position of being very confident of something just to be proven completely wrong by giving all my data to others to scrutinize... it's disheartening but necessary to get to the bottom of what's real). It's like the V language saying it can do memory management magically and it's much faster than Rust or whatever when they don't even have a working system yet.
> to claiming it's even faster than hand optimized C++, which is quite honestly unbelievable - whatever the reference counting algorithm is doing can be emulated by the hand-optimised C++ code so that's just literally impossible.
There is nuance here. They claimed that their project is faster than a specific hand optimized project. Not faster than a theoretical peak performance c++ program.
I've run into similar situations where python reimplementation s are faster than java because python is easier to change and fixup algorithms. And %timeit in the ipython shell is way easier than the black magic involved in profiling and benchmarking java.
You also have people on rust subreddit or discourse asking for optimization help when their rust is not as fast as a Go example they wrote. Often you get buffered IO going and it's on par. But to melt faces like ripgrep and friends you often need to drop pretences and work on Vec<u8>s.
> And %timeit in the ipython shell is way easier than the black magic involved in profiling and benchmarking java
Unfair criciticm... first, Java has had a REPL for several years and you can time stuff like in Python as easily... second, profiling tools in Java are some of the best available, and are not blackmagic... quite simple to use, just attach them to the running process and hit "profile".
With that said: yes, I've also seen Java programs that run faster than the Rust or C counterpart. I suspect what OP saw falls into this category: a rare example that you take as a rule (maybe I misinterpreted the claim, I admit, but it does sound OP meant his Haskell program and, I assume, others which you write in the same style, cannot be beaten by the equivalent C++).
> Unfair criciticm... first, Java has had a REPL for several years and you can time stuff like in Python as easily... second, profiling tools in Java are some of the best available, and are not blackmagic... quite simple to use, just attach them to the running process and hit "profile".
You're right that the java tooling is powerful. But getting numbers out of the JVM is not the black magic. The dark arts are setting up an experiment, managing the JIT warm-up, and interpreting the results. In my experience it's just hell trying to turn those numbers into a convincing argument that we have confidence in our performance. Concrete example: convincing openjdk11 to use AES instructions Vs conscrypt using AES out of the box with no warm-up...
The whole song and dance means that we just don't take it as a priority because it's a tiring sink of effort. On the other hand %timeit is so easy and remains consistent that you can use it with unit tests and offer algorithm fixes in PRs.
The code is not open source unfortunately so I can’t provide evidence. But I have nothing to sell or promote (unlike the V language guys who has a clear motivation to inflate what they are selling). What the compiler does differently is that it does a whole program optimisation pass where it minimises the code manipulating the reference counters to an absolute minimum. This is something you can’t do in a C++ compiler. Which is one of the reason why the resulting machine code is faster. Also, I am using LLVM as the backend. So I am getting the same low-level optimisations that C++ compilers get. However the most important optimisations are done before lowering the code to LLVM. Function specialisation for example is huge (generating multiple versions of the same function with the arguments unlined and optimised away). I am doing quite a few high level optimisations like that before even lowering it to LLVM.
> whatever the reference counting algorithm is doing can be emulated by the hand-optimised C++ code so that's just literally impossible.
shared_ptr and unique_ptr have pretty significant overhead and are common practice, even for optimized codebases, so I wouldn't say it's impossible at all.
You are 100% right. At some point I measured smart_ptr to be 25x slower than raw pointers. The compiler I am maintaining is not using C++ style smart pointers. It is using a global whole program optimisation pass to reduce reference manipulation to a minimum. Basically what a world class C++ programmer with years of experience optimising performance would do. It is just done automatically.
unique_ptr is nearly overhead free compared to a bare pointer (except for argument passing... because of ABI concerns).
shared_ptr is expensive and easy to build leaks with, so a lot of code bases avoid it where possible. Though it's only expensive when you copy it, moving it is ~free.
This is just a more detailed version of my point lol you just listed the reasons they are not free.
What happens in your language when a linked list is freed? Doesn't running its destructor (or its equivalent) take a linear amount of time relative to the length of the list?
The compiler uses arrays not linked lists. One of the big mistakes that other functional compilers make (IMHO) is that they use linked lists. It is a huge performance problem. There is a reason why high-performance software written in C++ and C always use arrays and not linked lists. Memory access patterns is the #1 thing to optimise for on modern CPUs.
So, as I understand it, you avoid pauses by avoiding data structures with long chains of pointers. The same will work equally well in a language with a GC. It's also not the case that reference-counting itself doesn't result in pauses itself, but that the user is responsible for using such data structures that they do not result in pauses. Which I think is the only way when you care about performance, no matter whether you use manual memory management, reference counting or a tracing GC, so that's by no means a criticism of you or your language, I think it's very sensible. But I think that describing it as a no-pause memory management mechanism is a mischaracterisation.
So according to your logic no pause GC is impossible? In that case my claim is that it works as fast as what an expert in C/C++ optimisations would hand craft the compiled code to do by carefully minimising allocs/frees. This is what AAA games do to minimise stuttering. You can achieve the same with GC languages like Java. However you then have to manually keep a pool of objects and manually “alloc/free” objects from that pool. That is what games written in GC languages do to avoid random GC pauses.
No-pause GC is possible, when we're talking about concurrent GCs, but it also shouldn't be mistaken for a "no-overhead GC". And reference-counting, as long as the actual reference counting isn't done in a separate thread as well, instead of blocking in a destructor, which triggers other destructors, is not a no-pause GC. It's true that application developer can avoid pauses even then, but that's a different claim.
From what I've been told, C++ games do another thing and just use arena allocators, which make allocations cheap (just an integer addition, provided you know a reasonable upper bound on the size of an arena, and if you don't, you may reserve a large piece of virtual memory, and commit it later in reasonably small, but also not too large chunks) and free-s even cheaper (either reset the integer, so that the arena can be reused, or a single syscall to unmap the whole arena in one go). That's a different strategy than making a lot of little allocs and frees, and then trying to minimize them by reusing objects, which is also quite hairy.
My guess is that this could be done concurrently and/or in parallel.
It still take time linear in length, but dead nodes are by definition stable so it does not really matter when you free them.
So just like with tracing GCs. You do have a pause, which you may avoid with parallelism, or smear out incrementally. If you do it in parallel, your reference counting needs to be atomic, which adds further overhead. You still need to perform all the reference counting operations for each element of the list, not just free them, because part of the list may be shared. E.g. in SML:
Now a and b share the tail.fun replace_head_with_1 (_::xs) = 1::xs; val a = [2, 3, 4]; val b = replace_head_with_1 a;And a linked list is only a simple demonstration of a chain of pointers. It occurs spontaneously outside of containers, when you just write e.g. classes which have objects of other classes as their fields. In Haskell that would be records, or just an algebraic data type. Or just closures.
The global optimization step is often what people commonly refer to as "garbage collection." Putting it inside a framework to RC as few times as possible is pretty cool.
However, I doubt the efficacy of your C++ experts: most of the people I know who write C++ are actually really bad at optimizing code. They mostly use it for legacy reasons. If you get a team of experienced (and expensive) systems programmers, you will likely get a slightly better result than your GC algorithm.
The C++ code I am referring to was hand optimised and maintained for 10+ years before the compiler I am maintaining was fast enough to take over. I never looked at the C++ code myself. I simply use it as the milestone to beat. When I took over the maintenance of the compiler it was 25x slower than the C++ code. It took a lot of work to finally make it faster. I also used the C++ code to compare the output of the compiler and the C++ code. I found quite a few bugs in the C++ code doing that by the way.
> However, I doubt the efficacy of your C++ experts: most of the people I know who write C++ are actually really bad at optimizing code.
Which is a very good reason to develop an optimized GC algorithm, the domain experts can crank out code without having to optimize every single memory (de)allocation which sounds like a waste of their time.
It’s funny, people don’t usually doubt that a modern compiler can do a better optimization job than an expert but add a memory management algorithm and that’s a bridge too far.
Yep you are absolutely right. The business rules domain experts using the language are way more productive using a pure functional language that takes care of all the nitty gritty performance and memory issues for them. The code they write is 100% focused on solving customer problems.
Personally, I think non-GC languages are very much overused. If you are writing business logic, and not counting cycles (even worse - if you are never planning to count cycles), you should probably not use C++ or Rust.
However, a GC is a lot slower than manual memory management, which contrasts with the fact that most compiler activities are actually pretty low in overhead (now - it didn't used to be this way). Really, the only cost overhead left is the abstraction mismatch, and that is not too bad, when you compare to how bad humans are at writing assembly.
That said, this case looks like one where the C++ experts spent very little time optimizing (mostly writing business logic), and probably made a very poor choice of tools.
> However, a GC is a lot slower than manual memory management, which contrasts with the fact that most compiler activities are actually pretty low in overhead (now - it didn't used to be this way). Really, the only cost overhead left is the abstraction mismatch, and that is not too bad, when you compare to how bad humans are at writing assembly.
There is a triangle of GC performance; througput, latency (i.e. pause length), and memory overhead. Manual memory management will often be slower (in the throughput sense) than a throughput-tuned GC because:
1. Manual memory management typically precludes moving live data
2. Manual memory management often frees data as soon as it is dead
GC will often have faster allocations than manual memory management because #1 makes it possible to just use a pointer-increment for allocation. GC will often have faster freeing of data because of #2; in particular using a nursery with Cheney's algorithm makes it O(1) to free an arbitrary amount of data.
Where a throughput optimized GC falls down is in that any code that allocates may have an unpredictable amount of delay.
Also note that for video games, both typical GC and malloc/free are often too slow for per-frame data, so arena allocators are used, which sidestep #2, and allow a pointer-increment allocation without needing #1. This is specifically because there are a lot of objects with exactly the same bounds on their lifetime. Special-purpose algorithms will almost always trump general-purpose algorithms when run on the workload they are optimized for.
What I understood from their comment (which may not be correct) is the following. Say you have something like this:
This requires the compiler to call your_deleter::operator() regardless of whether cond is true or false, even though it's unnecessary (and can thus be slower) in the case where cond is true. Moreover, the obvious way to avoid it is to write it "C-style":extern void foo(T *p); // some arbitrary function void bar1(bool cond) { .. auto p = std::make_unique<T, your_deleter>(); if (cond) { return foo(p.release()); } ... }
which can up being faster when cond is true. But this isn't something an expert would generally want to do, as now the C++ code becomes unidiomatic, fragile, and unmaintainable.void bar2(bool cond) { .. auto p = new T(); if (cond) { return foo(p); } ... your_deleter()(p); }In an ideal world, though, you could have an optimizer smart enough to do that transformation automatically. C++ compilers already do that in trivial cases, but they can't do it in general. My impression is that their Haskell compiler exploits the internal knowledge of what your_deleter does (i.e. reference counting) in order to optimize the code in various ways, like optimizing out such code, consolidating refcount updates, etc. And if I understand this correctly, there's no surprise at all that it can be faster than idiomatic C++ code written even by experts.
The question for me isn't the expertise of their programmers. Perhaps in their case they genuinely do need to have lots of objects on the heap, have (say) tight loops where they (for whatever reason) nevertheless cannot avoid the heap allocations, and don't have much of a use for finalizers besides freeing memory. In which case, I'm not surprised their solution clearly delivers better results than the C++ equivalent. The question from me, instead, is how well they think that generalizes, such as to (a) well-written Haskell programs in general, (b) well-written C++ programs in general, and/or (c) other domains. It would be one thing if their solution delivers better results in Haskell than C++ for their use case; it would be another thing if they could claim their solution delivers better results in Haskell than C++ for most use cases.
Yep you are quite right. The optimiser basically does what an experienced C++ optimisation experts would (tediously) do by hand: it carefully reduces the number of reference counter modifications to an absolute minimum by symbolically evaluating the full executing flow of the code. It is something that is doable in a pure functional language like this one. I am not sure if it is possible to do the same in C++ (because of the aliasing rules).
std::unique_ptr::release() resets the ptr to nullptr, which will be the first thing the deleter will check for and bail out.
This is a non-issue.
> which will be the first thing the deleter will check for and bail out.
The point is that check itself is an extra instruction (or two, rather) that would otherwise be skipped entirely.
I'm not saying this commonly makes a difference. I'm just saying this might be something that does make a difference for them in their particular use case.
Also note that I was trying to describe the general phenomenon with a simple example, but this obviously isn't limited to std::unique_ptr.
Just to clarify: it is not a framework. It is a whole program optimisation pass done by the compiler before lowering the intermediate code to LLVM and then generating machine code.
How do you collect cycles without a pause?
There are solutions to this one. Real time garbage collectors are a thing. You just might not be able to collect the cycle in one GC run.
Or you can do what Erlang does: Erlang has neither mutation nor laziness, so you can't create cycles. The GC also lays out object in topological order in memory, so that you can detect garbage without tracing every life object.
The comment claimed "There are zero GC pauses." Tbh, it's not clear how to interpret that. But real time simply means the pauses are guaranteed not to cross the realtime response budget. You still need to schedule the collection while the mutator is paused.
Where can I read about Erlang’s magic?
Google suggests https://stackoverflow.com/questions/10221907/garbage-collect...
By requiring non-cyclic data structures or provide "weak" references. It's actually pretty easy to write most code without cycles. I program a lot in Nim and generally compile lots of programs with ARC and no cycle collector without issue.
Yep. The language doesnt allow data cycles. Nobody ever complained about that or even noticed it.
That seems a little hard to believe. I can't think of many complex pieces of software I've worked on that didn't have such cycles (e.g. any sort of tree structure where parents need to know about children and children need to know about parents - how does Nim handle that?)
Nim does allow cycles and the new GC is non-atomic RC + cycle collector.
Though I’ve been hacking on a UI framework written in Nim. Its ui nodes form a tree structure that only tracks parent => child. When it needs the parent info it makes a stack of parents during processing.
Essentially it moves the cycle collector / GC work to extra work during processing. But it’s cheaper since you’re already accessing the memory.
You typically use a weak pointer from child to parent in those situations. So there are no ownership cycles. Languages without garbage collectors (C/C++) pretty much require you to not have ownership cycles or your code will crash during cleanup. Unless you add specific code to detect it and stop it from happening. So I assume you mostly work in GC languages and not C/C++?
We were talking about Nim, which is language I know next to nothing about (but very familiar with C/C++, Swift - which has weak pointers, as well as GC languages like JavaScript/Java/C# etc.)
Not sure what you mean but Nim allows data cycles.
I just prefer to write cycle free code when I can and turn off the cycle collector. Though Nim’s cycle collector performs well and does some tricks using the RCs.
The language has zero data cycles by design. It turns out that we didn’t need it. Thinking about it, I can’t remember even a single situation in my 30+ years career where I needed a data structure with data cycles. There are usually better ways to do it in my experience.
I simply don't do cycles. Software is simpler without them.
Agree. I can’t even come up with an example that couldn’t be done better without them.
I'm gonna take a guess they dedicate a core to GC (or something along these lines).
No each thread does it’s own alloc/free. I reduce the locking to a minimum by using per-thread caching and memorisation. I had to implement that to make the code scale linearly with number of cores. Without per-thread caching it would only scale to about 3 cores. Now it scales very cleanly. Which is nice.
That's not enough. Imagine a circular linked list. GC looks at an outside pointer to item A in it. Now another thread removes A from the list and moves the outside pointer to the next item. After that, GC would see all items in the list not referenced by anything (apart from the cycle).
Yeah, I didn't say it was sufficient. I just said it's probably one of the things they do.
free() calls that have to run for a data-dependent amount of time are more or less equivalent to GC pauses (assuming a concurrent GC that doesn't need to stop the world, like Java's). The most typical example is free()-ing a a linked list, which takes O(n) free() calls to free with a simple RC mechanism.
If you are assuming GC does not need to stop the world, you can also assume that the freeing of memory (including the O(n) calls to free()) will not be done in a critical path; all memory could be handed off to a separate dedicated thread that actually calls free(). Or in a RPC or HTTP server all memory can be freed after the request has been served.
It's very easy to make a reference counting scheme not stop the world. It's a bit more difficult to make a GC implementation so.
> all memory could be handed off to a separate dedicated thread that actually calls free()
That only works when you have infinite memory (or infinite CPU resources).
> It's very easy to make a reference counting scheme not stop the world. It's a bit more difficult to make a GC implementation so.
Only if by "not stopping the world" you mean your previous suggestion (leaking unbounded amount of memory to free() everything at some later point). When your memory is bounded, you will eventually have to stop/crash once you have run out of it.
AFAIK, the best modern state of art garbage collectors have stop-the-world pauses, proportional to size of root set and/or thread count. I'd love to see an RC implementation, that does not have stop-the-world pauses at all, but that sounds as audacious as claims of perpetual motion machine.
I don't think the gp is describing the null garbage collector, rather one that can work incrementally and concurrently with the program doing its thing. Whereas a concurrent tracing gc has the problem of data changing while it runs and so special care has to be taken not to corrupt memory.
I don't think you understood my point. So let me give you a more concrete example.
The Linux kernel uses RCU to manage data structures that are concurrently read and updated. When an old value (after an update) is no longer needed, a thread can either:
(a) block for a while to make sure no other thread is using it using synchronize_rcu() and then free the memory (see https://www.kernel.org/doc/html/latest/core-api/kernel-api.h...)
(b) if the thread cannot block, it will use call_rcu to register a callback to free the memory at a later time (https://www.kernel.org/doc/html/latest/core-api/kernel-api.h...). That callback generally runs in some other thread to do the cleanup.
Now, moving the concepts to user space, a typical user space implementation will just launch a dedicated thread to free all memory that is no longer needed by any RCU data structures.
> all memory could be handed off to a separate dedicated thread that actually calls free().
You could go a bit further and have multiple concurrent threads mark the references, then sweep them up in a separate thread, too. Some sort of concurrent sweep and mark reference count system.
With concurrency you need atomics or locks to synchronize reference counts between cores, unless you can somehow guarantee that ref counts only decrement and can never increment.
The compiler uses arrays for lists instead of linked lists. It is a performance improvement compared with how functional compilers usually do it. And the language is designed in such a way that rule writers manipulate whole lists at a time instead of one element at a time. A bit like array languages. It changes how the code is written compared with traditional functional code.
> There are zero GC pauses. Unless you claim that a C++ alloc/feee call is “garbage collection”.
Alloc/free can introduce arbitrary pauses last I checked, so yes, there are pauses. Any time doing book keeping for resources rather than running your code counts as GC time.
> Any time doing book keeping for resources rather than running your code counts as GC time.
Perhaps a nitpick: memory management time, yes, but not GC time.
alloc/free is manual memory management, not garbage collection.
If you're using alloc/free in your GC, which is what was being implied, then that counts as GC time.
On any OS which is not hard realtime, there could be arbitrary pauses with any syscall. This is just nitpicking.
Nitpicking: arbitrary pauses can occur even without syscalls, when the OS preempts the program.
More nitpicking: on x86-64, SMI interrupts can cause arbitrary pauses even without any software control involved. Hard realtime on x86-64 is not possible.
More nitpicking: Your computer might turn off, cosmic rays might blow fuck up your RAM/CPU, Capital G God could reach down and pause the system, there's a universal quantum pause every 5.391247 × 10^-44 so that the universe can reboot, etc etc etc
Orrrrr, GC pause just means pauses caused by the GC as part of its implementation's work to manage memory.
This is not nitpicking. The OP was implying they were using alloc/free in their GC. Those calls can be doing all sorts of book keeping behind the scenes, including cascading destructor calls which could free a chain of objects the size of live memory. Suffice it to say, that would pause.
I'm not sure whats being asserted here, could you explain more? This sounds like you're describing a non-stopping GC, and its well understood reference counting is garbage collection. I'm not sure how the rest applies, you're correct, it is possible to write software with just malloc and free.
Re: "it's well understood reference counting is garbage collection". I think this might just be a terminology thing.
There appear to be two ways the terms are categorized:
1. "reference counting" and "garbage collection" are two types of automatic memory management/reclamation.
2. "reference counting" and "tracing garbage collection" are two types of garbage collection.
I think mbrodersen is using #1.
(Back in the 90s and 2000s I feel like #1 was much more prevalent. But #2 seems to have gained popularity since then. My theory is that whoever started writing the Wikipedia content for this stuff picked #2.)
I think it's more of an industry vs. academia kind of thing. In the mid-90's before Java was released automatic memory management was mostly unknown to the unwashed masses. So garbage collection became synonymous to whatever Java was doing. However, The Garbage Collection Handbook originally published in 1996 defines garbage collection as: "All garbage collection schemes are based on one of four fundamental approaches; mark-sweep collection, copying collection, mark-compact collection or reference counting." So that's an example of terminology #2.
I think this is actually common in the literature, it has nothing to do with Wikipedia. Consider that Python is commonly described as a GC language, though it has always mostly relied on automatic reference counting to free objects (it does have a tracing GC as well to handle cycles, but I'm not sure if it always did).
I would argue that what matters is the observable behavior. In Python, regardless of how the actual cleanup is distributed between ARC and GC, the behavior that programmer sees is that all unused memory gets cleaned up eventually. So, it makes sense to group it with languages that provide the same guarantee.
Yep. The compiler generates code that works the same way that an experienced C programmer would hand optimise alloc/free calls and manually keep track of shared data by increasing/reducing a reference counter. There is no separate mark/sweep or whatever step.
I never understand why people refer to reference counting as garbage collection. It waters down the term so much it becomes essentially useless. Because under that assumptions basically every language has garbage collection in some shape or form.
Because we have studied the literature instead of what a random dude in tells at a coffee
Nice appeal to authority instead of trying to engage in a meaningful way.
When one is open to learning, the best path to enlightenment is to use the same books as the Wizards on the tower of knowledge.
C doesn't have garbage collection.
Furthermore, pretty much every language has I/O, but that doesn't make the term useless.
They do it because they serve the same purpose, though with different trade-offs.
Note that this mostly reflects automatic reference counting, like you have in Python or Swift, not so much manual reference counting like std::shared_ptr or Rc.
Pics or gtfo.
Yeah, GP's claim is completely bonkers. And they haven't provided any links or data to back up their nonsensical claim.
Yep the code is not open source so I can’t provide that. However I am happy to explain exactly how it works down to the lowest level details. Perhaps somebody gets inspired to implement it themselves as open source? That would be neat.
For a unifying term I prefer Automatic Memory Management.
One reason is that GC is already universally used to mean only tracing garbage collection, and trying to defend its wider meaning is a pointless uphill battle.
Another is that is suits the job much better, because not every AMM technique works by producing garbage then collecting it, you know.
If you want to get even more precise, call it automatic dynamic memory management. Automatic static memory management would be something like Rust's scope-based memory reclamation via ownership.
Great to confuse more people instead of properly learning about affine type systems.
There's no confusion here? This is a perfectly natural classification, that satisfies many people's intuition about Rust as having "compile-time GC". If anything, bringing up affine type systems is more liable to confuse people in this context; just because a language has an affine type system doesn't necessarily mean it needs to use it for memory management.
> not every AMM technique works by producing garbage then collecting it
And the confusing thing is that garbage collection (GC) doesn’t collect garbage, while reference counting (RC) does.
GC doesn’t look at every object to decide whether it’s garbage (how would it determine nothing points at it?); it collects the live objects, then discards all the non-live objects as garbage.
RC determines an object has become garbage when its reference count goes to zero, and then collects it.
That difference also is one way GC can be better than RC: if there are L live objects and G garbage objects, GC has to visit L objects, and RC G. Even if GC spends more time per object visited than RC, it still can come out faster if L ≪ G.
That also means that GC gets faster if you give it more memory so that it runs with a smaller L/G ratio (the large difference in speed in modern memory cache hierarchies makes that not quite true, but I think it still holds, ballpark)
What are these other techniques and what can a technically-literate newcomer like myself read to get acquainted?
I just keep feeding them the respective CS literature.
One great example would be a C++ program that runs fast, and then just spends 10s of seconds doing “nothing” while it deallocates shared pointers’ huge object graphs at the end. They really are two sides of the same coin, with tracing GCs being actually correct (you need cycle detection for a correct RC implementation), and having much better throughput. It’s not an accident that runtimes with thousands dev hours are polishing their GC solutions instead of using a much more trivial RC.
I don't know what the current state of the art is, but at one point the answer to GC in a realtime environment was to amortize free() across malloc(). Each allocation would clear up to 10 elements from queue of free-able memory locations. That gives a reasonably tight upper bound on worst case alloc time, and most workflows converge on a garbage-free heap. Big malloc after small free might still blow your deadlines, but big allocations after bootstrapping are generally frowned upon in realtime applications, so that's as much a social problem as a technical one.
This is normal today inside malloc implementations.
I think that’s been normal in malloc for some time. I maintain that we should all thank Java, and the hype train around it, for forcing operating systems to implement more robust threading support. Everyone wanted to be able to say they did Java well, and nearly every OS had capital ‘I’ Issues, including Solaris. Boy I bet those closed room arguments were epic.
Where GCed languages pulled ahead for a while, particularly Java, was that they had mature concurrent memory allocation before most OSes did, so many people found on early two and four core hardware that malloc started to show up as a bottleneck.
It's not marketed as a GC, but exit(2) is fast and effective when used as one.
I wouldn't call it a GC. It's more like an arena allocator, with the arenas managed by the OS :)
Now you've pushed the job of cleaning up page tables to the OS.
But that’s extremely easy - you go through a linear list once and free them all. You don’t even need to bring the contents from main memory to do it.
(Exception is shared pages, not used for heap memory often.)
It's definitely easier, because you don't have interior pointers. But it's still work :)
It's optimized for that, so just let it do its job.
https://devblogs.microsoft.com/oldnewthing/20120105-00/?p=86...
The OS would have that job anyway.
Fun fact: if you dont do anything important in the destructors you can avoid that delay by intentionally leaking the memory. The os will clean it up when the program exits and it does a better job since it frees the pages rather than looking at your objects one by one.
Most GCs do exactly that -- they only "work" when absolutely necessary, and their heuristics says that they are getting behind the created garbage. If the program exits shortly after it will just leak the memory.
The problem with that in the case of C++ is that you likely only want to leak things used in the end from the main thread, but not "recursively" - the distinction is hard to do.
It actually wouldn't surprise me that's the main reason GC often outperforms ARC in real- life software - simply because it's only necessary to actually bother doing GC once memory usage is high and there's a known need to allocate further memory. But couldn't ARC do the same thing in principle - i.e. only bother incrementing/decrementing reference counts when there's likely to be a need to reclaim memory? Even an infinitely running service can just accept it will increase its memory footprint up to a certain threshold then only bother with memory management at that point (I actually suspect SQL server works exactly like this!). For most short-running tools, no MM is required - the OS handles everything.
As mentioned elsewhere in the thread, the actual reclamation can be done asynchronously, but the most problematic part, namely decrementing a counter atomically is the most problematic part and it can’t just be done later willy-nilly.
Saying that both have pauses is a false equivalence to me.
It overlooks the difference in how likely this can occur (without large enough object graphs freed from the top it may never be an issue), when this occurs (any time vs on cleanup that may not be latency sensitive), and how much control the programmer has over RC costs (determinism allows to profile this and apply mitigations).
RC with borrow checking can avoid a lot of refcount increments.
Tracking GC typically needs write barriers, so it’s not free either.
> and how much control the programmer has over RC costs (determinism allows to profile this and apply mitigations).
I fail to see how would it be deterministic in a highly dynamic program. Like, imagine a game for example where the user can drag'n'drop different things to a "parent" object. Observability is imo an entirely different axis.
> RC with borrow checking can avoid a lot of refcount increments.
That's the same thing as escape analysis - with language support many many objects could be effectively "removed" from the guidance of the GC, decreasing load and greatly improving performance. It is a language-level feature, not inherent in the form of GC we do (RC vs tracing)
> Like, imagine a game for example where the user can drag'n'drop different things to a "parent" object.
Where is the reference count going to 0 here? Presumably the object you dragged from one parent to another parent just swapped the reference and no ref count ever went to 0 which would have triggered a free.
I'm guessing OP meant deterministic as in, "oh when I profile my code and it hits that big lag spike, this occurs in my particle system. Let's swap that to a fixed size static memory allocation instead of RC". Whereas with GC you say, "oh my program has a big lag spike. I wish I could tell why that is, but the code executing at that time may or may not have been the reason GC was collecting then". The only experience I've had trying to get accurate profiling data on memory usage in Java/C# was aggravating to say the least.
Also, most games these days (and probably since the beginning of gaming) don't allocate and free stuff randomly. They're most likely using RC on big blocks of memory that are shared among several other game objects. Or RC on several fixed size pools that get used for different components like in an ECS. Or they use RC for things like long living assets that have difficult lifetime management.
Honestly, RC and garbage collection are unnecessary in most instances. Most of the time all you need is a unique pointer because you know exactly where the ownership of the object lies. For the edge cases, RC works fine.
I guess my biggest complaint is, why rely on a magical algorithm that you have no control over? I've had so many "fun" times trying to force GC to collect at specific points. It's really nice when you can say, "at this point in the program, this ref count will go to 0 and it will free and I don't have to worry about it lagging anything". Rather than "please GC.collect(), actually do something when I call you right here and don't wait another 5 minutes and ignore my pleading".
> I've had so many "fun" times trying to force GC to collect at specific points
Because that's not the type of application for which a non-deterministic GC is suited. The vast majority of applications don't need that determinism. Right tool for the job.
Even in a highly dynamic program you can know all the places where releases happen. If you find a slow case, you can reproduce it, and mitigate (e.g. send object to a background thread to avoid lag on the ui thread). This is different from tracing GC that may fire on any allocation, on a timer, and workload depends on heuristics and other things in the program.
Escape analysis is hard, and GC is typically used to free programmers from assisting it with extra syntax or unsafety. This is why GC languages tend to use generational GC instead of having precise analysis.
If you use refcounted pointers for everything then you'd be better off with a proper gc. But at least in the programs I see, 99% of objects are not refcounted , and that is reserved for a tiny majority of objects with especially tricky lifetimes.
This is the key.
Using std::shared_ptr in a performance-sensitive context (e.g. after startup completes) is code smell.
Using pointers as important elements of a data structure, such that cycles are possible at all, is itself code smell. A general graph is usually better kept as elements in a vector, deque, or even hash table, compactly, with indices instead of pointers, and favoring keeping elements used near one another in the same cache line. Overuse of pointers to organize data tends to pointer-chasing, among the slowest of operations on modern systems.
Typical GC passes consist of little else but pointer chasing.
But the original article is completely, laughably wrong about one thing: an atomic increment or decrement is a remarkably slow operation on modern hardware, second only to pointer chasing.
Systems are made fast by avoiding expensive operations not dictated by the actual problem. Reference counting, or any other sort of GC, counts as overhead: wasting time on secondary activity in preference to making forward progress on the actual reason for the computation.
Almost invariably neglected or concealed in promotion of non-RC GC schemes is overhead imposed by touching large parts of otherwise idle data, cycling it all through CPU caches. This overhead is hard to see in profiles, because it is imposed incrementally throughout the runtime, showing up as 200-cycle pauses waiting on memory bus transactions that could have been satisfied from cache if caches had not been trashed.
If a core is devoted to GC, then sweeps would seem to cycle everything through just that core's cache, avoiding trashing other cores' caches. But the L3 cache used by that core is typically shared with 3 or 7 other cores', so it is hard to isolate that activity to one core without wastefully idling those others. Furthermore, that memory bus activity competes with algorithmic use of the bus, slowing those operations.
Another way GC-dependence slows programs is by making it harder, or even impossible, to localize cost to specific operations, so that reasoning about perforce becomes arbitrarily hard. You lose the ability to count and thus minimize expensive operations, because the cost is dispersed throughout everything else.
> But the original article is completely, laughably wrong about one thing: an atomic increment or decrement is a remarkably slow operation on modern hardware, second only to pointer chasing.
Very true. No matter how one looks at it, GC imposes a non-trivial overhead. If I remember correctly, wasn’t there a study that demonstrated an average Swift application spending 40% of its time on RC?
But at the same time hardware can be improved. Apple CPUs are an obvious example. Uncontended atomics are extremely cheap and the memory prefetcher can detect and prefetch pointer chases.
Nowadays good compilers can handle many of these non-tricky lifetimes statically, too.
> One of the biggest issues with reference counting, from a performance perspective, is that it turns reads into writes: if you read a heap-object out of a data structure, you have to increment the object's reference count.
This is one of the biggest misconception about RC. You need not to increase the refcount just to read the referred data because you already have a reference whose count has been increased when handed down to you. That’s a semantic that’s very well carried off by Rust’s Arc type: the count is inc’ when the Arc is cloned, and dec’ when the cloned Arc is dropped. But you can still get a regular ref to the data since the compiler will be able to enforce locally the borrow, ownership and lifetime rules.
For example, in a web server, you might have the app’s config behind an Arc. It gets cloned for each request (thus rc inc’d), read a lot during the req, then dropped (thus rc dec’d) at the end of the handler.
RC may turn reads into writes, but of course, GC ends up having to go through literally every piece of memory ever from bottom to top once in a while.
RC limits itself to modifying only relevant objects, whereas GC reads all objects in a super cache-unfriendly way. Yes, an atomic read-modify-write is worse than a read, but it's not worse than a linked-list traversal of all of memory all the time.
And of course, not all kinds of object lend themselves to garbage collection - for instance, file descriptors, since you can't guarantee when or if they'll ever close. So you have to build your own reference counting system on top of your garbage collected system to handle these edge cases.
There's trade-offs, yes, but the trade-off is simply that garbage collected languages refuse to provide the compiler and the runtime all the information they need to know in order to do their jobs - and a massive 30 year long effort kicked off to build a Rube Goldberg machine for closing that knowledge gap.
> RC may turn reads into writes, but of course, GC ends up having to go through literally every piece of memory ever from time to time.
Depends on the GC algorithm used. Various GC algorithms only trace reachable objects, not unreachable ones.
Reference counting does the opposite, more or less. When you deallocate something, it's tracing unreachable objects.
One of the problems with this is that reference counting touches all the memory right before you're done with it.
> And of course, not all kinds of object lend themselves to garbage collection - for instance, file descriptors, since you can't guarantee when or if they'll ever close. So you have to build your own reference counting system on top of your garbage collected system.
This is not a typical solution.
Java threw finalizers into the mix and everyone overused them at first before they realized that finalizers suck. This is bad enough that, in response to "too many files open" in your Java program, you might invoke the GC. Other languages designed since then typically use some kind of scoped system for closing file descriptors. This includes C# and Go.
Garbage collection does not need to be used to collect all objects.
> There's trade-offs, yes, but the trade-off is simply that garbage collected languages refuse to provide the compiler and the runtime all the information they need to know in order to do their jobs - and a massive 30 year long rube goldberg machine was built around closing that gap.
When I hear rhetoric like this, all I think is, "Oh, this person really hates GC, and thinks everyone else should hate GC."
Embedded in this statement are usually some assumptions which should be challenged. For example, "memory should be freed as soon as it is no longer needed".
> When I hear rhetoric like this, all I think is, "Oh, this person really hates GC, and thinks everyone else should hate GC."
Yeah, I think it's an inelegant, brute-force solution to a language problem - and that we continue to throw good money after bad improving it.
We should be investing in removing the need for GC through smarter compilers and through languages that allow us to better express our intent - and our object graph. Instead, we continue to improve on a giant linked-list traversal through all of live memory to try and infer the object graph at runtime. Without sufficient information to do it efficiently. The fundamental issue is that we don't have languages that express our goals in a way that allows memory management to be elided efficiently.
That doesn't mean I have some particular affinity for reference counting, though. It has its own issues as you rightly point out. I prefer it for its determinism, nothing more.
> Yeah, I think it's an inelegant, brute-force solution to a language problem - and that we continue to throw good money after bad improving it.
Having studied GC, implemented GC, and used it extensively (either as a dev or someone in operations) I'd say that there's just a lot of people out there who don't understand it. That's why people come to the wrong conclusion that it's somehow "inelegant" or "brute-force", when it is definitely neither of those.
There's also a lot of people who formed their opinions on GC from, say, what the JVM was like in the 1990s.
And there's a lot of people who only have a vague idea of how GC works in theory, but no knowledge of deeper theory and no practical knowledge of how real-world GC algorithms work.
> And there's a lot of people who only have a vague idea of how GC works in theory
I agree. To better understand garbage collection, nothing better than implementation. This article is such a joy to read:
https://journal.stuffwithstuff.com/2013/12/08/babys-first-ga...
Even the best "real world" garbage collectors pause the program for order of magnitude a hundred milliseconds, no? I was recently reading some blog posts of the V8 JS engine team.
There is nothing to understand or not understand there, it's hard data.
And yet GC haters keep ignoring the hard data about languages like D, Oberon family, Modula-3,Nim, .NET, and plenty others where alongside the tracing GC, there are value types, stack, static global memory, manual memory allocation primitives and unsafe code blocks to perform as much C like coding as performance junkies might feel like doing.
Good point. Way too many people act like GC means an unremovable curse, versus it can be another option, among various memory management options of a language.
> Even the best "real world" garbage collectors pause the program for order of magnitude a hundred milliseconds, no?
If you think GC pauses are measured in milliseconds, and not microseconds, then you're reading horribly outdated material. You're off by three orders of magnitude for an off-the-shelf GC with no tuning.
The best real world garbage collectors are like ZGC, Shenandoah or C4 which don't pause your program at all. They are fully concurrent.
Even if you go for a GC that is designed to balance throughput and latency, like G1, you can still configure what pause times it should target and those can easily be ~10-20msec if you want, with the vast majority of pauses being far less than that (less than 3 msec).
"I was recently reading some blog posts of the V8 JS engine team. There is nothing to understand or not understand there"
Look, if your knowledge of GC comes from reading V8 blog posts then you're proving the grandparent's point pretty nicely. You leaped from an extremely basic and remote understanding of GC to "there's nothing to understand or not understand here".
Serious questions: if these garbage collectors are so good, why aren't they widely used? My guess is that they have downsides such as low throughput, high CPU usage, high memory usage, etc... You can't import a JVM GC into V8, to stay with my example, but you can reimplement the ideas if you have quasi-infinite money like Google.
For ZGC/Shenandoah because they're new, and they're new because they're extremely hard to implement well. For C4 because it is expensive and requires kernel patches. Also there isn't a whole lot of need for them in many use cases. Web servers for example have far bigger latency problems than GC, normally. Pauseless GC was historically driven by the HFT/finance sector for that reason.
Also yes, pauseless GC tends to have higher overheads than GC that pauses for longer. Whether that matters or not depends a lot on the use cases and actual size of the overheads. For example ZGC is pauseless but not yet generational. Generational ZGC when it launches will improve throughput significantly.
Google haven't done pauseless GC for V8. I don't know why not because they have done one for Android. ART uses a fully concurrent collector iirc. At any rate, although you can't import a JVM GC to V8, you can run JavaScript on the JVM using GraalJS and use the GCs that way. Though I don't recall off hand if Graal supports ZGC yet. There's no deep reason why it couldn't.
Thank you. I didn't know that one GC that I regularly use (as a user) - the one in Android - is so advanced these days. Interesting!
ART is a pretty astoundingly advanced JVM which gets nearly no publicity. It's a real shame. I bet a properly supported desktop version would be quite competitive with HotSpot!
https://source.android.com/devices/tech/dalvik/gc-debug#art_...
It also does mixed AOT / JITC, amongst other tricks.
But reference counting is a GC-algorithm -- that does exactly the same thing: tries to infer the object graph at runtime.
I really don't think it is a language problem, for several algorithms that one may want to write, GC is a necessity as object lifetime is not decidable at compile time. And it's not just some special parallel algorithm, I believe most business problems fall into this category - that's why we have basically "hacks" like RC, shared ptr, etc in low level languages to let us fake a partial GC (not collecting cycles).
On the plus side of GC, it has been one of the biggest productivity boosters the field has seen since the first non-assembly PL.
I wouldn't call reference counting a hack.
The paper mention elsewhere in the comments, https://web.eecs.umich.edu/~weimerw/2012-4610/reading/bacon-... "A Unifying Theory of Garbage Collection" put this in context.
For example, a generational garbage collection essentially employs a hybrid approach between tracing and reference counting.
Similarly, depending on your system's trade-offs, reference counting might be better. Eg tracing garbage collection works really great for main memory, but for filesystems we typically use reference counting to decide whether to retire an inode. (The paper explains this in more detail.)
I meant hack as in languages meant for primarily for manual memory can still have RC (rust, C++), but that will only be a conservative GC algorithm due to no cycle detection. My use of hack here was only to denote that library-only RC , while possible, may not be the best approach and is used as an escape hatch in rare cases.
Oh, ok. That makes sense. Though I would count the Boehm Garbage Collector as a hack then, too?
Apropos hacks, have a look at this memory management 'thing' that doesn't move, but manages to compact: https://github.com/plasma-umass/Mesh
> RC limits itself to modifying only relevant objects, whereas GC reads all objects in a super cache-unfriendly way. Yes, an atomic read-modify-write is worse than a read, but it's not worse than a linked-list traversal of all of memory all the time.
All tracing GC algorithms scan only live memory, and they typically do so in an array-like scan (writing some bits in the object header when a pointer to that object is discovered), not in linked-list order.
> RC may turn reads into writes, but of course, GC ends up having to go through literally every piece of memory ever from bottom to top once in a while.
> RC limits itself to modifying only relevant objects, whereas GC reads all objects in a super cache-unfriendly way. Yes, an atomic read-modify-write is worse than a read, but it's not worse than a linked-list traversal of all of memory all the time.
This is patently untrue. Contemporary GCs have had card marking/scanning for 10+ years now.
Pointer chasing is expensive because fetching a random location defeats locality and therefore caches the the CPU's stream detection.
But a compacting GC copies the data it's scanned into a contiguous stream, dramatically improving locality, cache utility and stream detection. And this affects not only subsequent GCs but also the application itself, which may traverse its object graph far more often than the GC does.
Most GC implementations that I can think of know which bits in memory are pointers and which aren't, and only scan the pointers.
> Essentially, whereas tracing GC has pauses while tracing live data, reference counting has pauses while tracing garbage.
This is pretty trivial to avoid. When your thread finds itself freeing a big chain of garbage objects, you can have it stop at any arbitrary point and resume normal work, and find some way to schedule the work of freeing the rest of the chain (e.g. on another thread). It's much more complex and expensive to do this for tracing live data, because then you need to manage the scenario where the user program is modifying the graph of live objects while the GC is tracing it, using a write or read barrier; whereas for garbage, by definition you know the user can't touch the data, so a simple list of "objects to be freed" suffices.
"Reads become writes" (indeed, they become atomic read-modify-writes when multiple threads might be refcounting simultaneously) is a problem, though.
> It's much more complex and expensive to do this for tracing live data
But this is what happens in a modern state-of-the-art tracing GC implementation, isn't it?
And so it becomes a simple tracing GC implementation, while one keeps calling it RC to feel good.
> One of the biggest issues with reference counting, from a performance perspective, is that it turns reads into writes: if you read a heap-object out of a data structure, you have to increment the object's reference count.
You have to do this if you want deterministic deallocation, because your holding a read-only reference to that object might be exactly what keeps it around for longer. So you need to track that.
(Deterministic deallocation also means having to recursively free unreachable objects. That's often described as an arbitrary "pause" behavior in RC systems, but it's actually inherent in the requirement for deterministic behavior. If you don't care about determinism for some class of objects, you can amortize that pause by sending them to a separate cleanup thread.)
Pausing a single thread to release memory is not what is considered a "pause". Even if pausing a single thread could be a problem, you can trivially offload releasing to another thread. A pause is when all application threads get paused. Hence, reference counting does not have a problem of pauses, and tracing often does.
What is your read on the lack of discussion of escape analysis?
My own read on this is that it blurs the line with deferred collection/counting, because you could either use it to complement deferral making it cheaper, or avoid deferral because you're getting enough of the benefits of deferral by proving objects dead instead of discovering that they are dead.
Likewise it means that heap allocation on a tracing GC never took place and the object was allocated on the stack, or if small enough, on registers.
One thing that is particularly strange in C#, is objects can respond to events / delegates after they have gone out of scope. It can be quite a while before the object is actually collected - especially if it ends up on the large object (85K+) heap. This seems like an incredibly leaky abstraction to me. The whole "using" concept is a bit of an abomination as well = though getting better.
The issue with GC is it is a fluid implementation detail that is often necessary to understand deeply.
That doesn't sound quite right. When a object attaches a handler to an event, that creates a reference from the event source to the event listener. Until the handler is unattached, the event listener shouldn't be collected by GC, since it is technically still alive.
This lead to one of the more entertaining C# memory leak stories, where Princeton's entry to the DARPA Grand Challenge ended up failing because every frame they detected obstacles, created a class for each, and subscribed each obstacle object to an event. They missed that the event subscription was keeping those objects alive, and every piece of tumbleweed in the desert helped leak memory until the car just stopped! https://www.codeproject.com/Articles/21253/If-Only-We-d-Used...
I don't get that part. If it can receive an event/delegate, it means the object needs to be referenced by another object which invokes the event. If that is the case - how would be eligible for GC at all?
Consider an object referring to a network connection, say to a database. The object may be eligible for garbage collection but still have to respond to events from the network.
When it finally gets destroyed, its Destructor method would be called. At which point the thing at the other end of the network is told that it was talking to a zombie.
Note that network connections can be expensive for the other end, so this is a horrible design. We put a lot of work in to reliably get rid of connections when not needed. But you still need the fallback of being able to correctly handle programmer oversights.
Yeah you can't apply C++ style resource management (specifically, calling close() in destructors or expecting cascading destructors to trigger in sequence) to GC languages. Object lifetime in a GC languages does not generally align with desired resource lifetimes. You have to be explicit. This often means creating your own public close methods when wrapping a resource in a class.
One far-fetched yet simple example would be an observer/observable pair, where the observable mutates observed properties inside its finalizer. The observable will usually contain a reference to the observer, but not the other way round. So when the observable is dead but the observer (optionally) still alive, when the observables finalizer runs it will send a message to the observer.
When our observer is also dead (so the pair is out of scope) it will be a dead object receiving events.
It's a "feature" on the language, like others said below.
The codebase I work with has had many pathological crashes due to this behavior.
So basically in C# when you use += to subscribe to events, in a big system where lifetimes of objects are independent of each other, you're back to a C/C++ mindset where you should check you have a -= call for the subscribed object when the subscribing object is about to run out of scope. Else you get random crashes, when you get events delivered to an object that should have been dead.
This is one of the reasons I don't like "event" and += in C#. It's a leaky abstraction, like you said.
There's WeakEventManager [0] but that's available only in "classic" dotnet framework (and in "new" dotnet but only if you're targeting Windows) since it lives in the WPF namespace. It can be used outside of it, but you still take a dependency on System.Windows.
There are some other bespoke solutions too.
There's an open issue on the dotnet repo to add a weak event manager to the standard libs [1]. It's very well worth reading through it, it also has links to the other bespoke solutions available.
[0] https://docs.microsoft.com/en-us/dotnet/api/system.windows.w...
On the other hand such behavior can be a blessing in some situations. Maybe I do just want to hang an object off of some pubsub without having to decide the one true "owner" of the object.
If you're used to objects being destructed when they go out of scope ala C++ then yeah adapting to the lifecycle of objects in Java/C# takes some doing. But I think there's benefit to be had.
Considering that C# has a major role in desktop development, and interacts with platform APIs and objects a lot as a result, these kind of weird behaviors coming from conflicting ideas about object lifetimes happen a lot - it's weird they chose a GC for the language.
> reference counting has pauses while tracing garbage.
Which pauses you are meaning?
Reference counting is not free, but there are no long pauses (long compare to GC, e. g. in JVM under certain workloads you can get 100ms pauses).
It's not always a tradeoff:
Nim switched from GC to RC and it even increased performance.
I disagree with the statement that modern memory architectures have much higher read bandwidth vs write bandwidth.
Benchmarks show they are within 30 percent of each other: https://www.techspot.com/images2/news/bigimage/2021/03/2021-...
While perhaps true, what needs to be compared here is read, vs read + write, no? Just writing isnt enough. And then we are at a factor above 2, assuming no thread contention. If there is contention, it can be a lot higher.
While "write bandwidth" is probably not the right term, writes are more expensive because you need to update caches. If you forked you might need to copy-on-write the page before you can write to it.
> Perhaps the biggest misconception about reference counting is that people believe it avoids GC pauses. That's not true.
When I can easily replace the deallocator (thus excluding most non-RC production GCs), I can (re)write the code to avoid GC pauses (e.g. by amortizing deallocation, destructors, etc. over several frames - perhaps in a way that returning ownership of some type and its allocations to the type's originating thread, and thus reducing contention while I'm at it!) I have done this a few times. By "coincidence", garbage generation storms causing noticable delays are suprisingly uncommon IME.
As programs scale up and consume more memory, "live data" outscales "garbage" - clever generational optimizations aside, I'd argue the former gets expensive more quickly, and is harder to mitigate.
It's also been my experience that tracing or profiling any 'ole bog standard refcounted system to find performance problems is way more easy and straightforward than dealing with the utter vulgarity of deferred, ambiguously scheduled, likely on a different thread, frequently opaque garbage collection - as found in non-refcounted garbage collection systems.
So, at best, you're technically correct here - which, to be fair, is the best kind of correct. But in practice, I think it's no coincidence that refcounting systems tend to automatically and implicitly amortize their costs and avoid GC storms in just about every workload I've ever touched, and at bare minimum, reference counting avoids GC pauses... in the code I've written... by allowing me easier opportunities to fix them when they do occur. Indirectly causal rather than directly causal.
> if you read a heap-object out of a data structure, you have to increment the object's reference count.
This isn't universal. Merely accessing and dereferencing a shared_ptr in C++ won't touch the refcount, for example - you need to copy the shared_ptr to cause that. Rust's Arc/Rc need to be clone()d to touch the refcount, and the borrow checker reduces much of the need to do such a thing defensively, "in case the heap object is removed out from under me".
Of course, it can be a problem if you bake refcounting directly into language semantics and constantly churn refcounts for basic stack variables while failing to optimize said churn away. There's a reason why many GCed languages don't use reference counting to optimize the common "no cycles" case, after all - often, someone tried it out as an obvious and low hanging "optimization", and found it was a pessimization that made overall performance worse!
And even without being baked into the language, there are of course niches where heavy manipulation of long-term storage of references will be a thing, or cases where the garbage collected version can become lock-free in a context where such things actually matter - so I'll 100% agree with you on this:
> There are no hard lines; this is about performance tradeoffs, and always will be.
From what I can see, the myth that needs to be debunked isn't that garbage collection is super fast and easy with no consequences, it's the myth that garbage collection always automatically means your program is going to be spending 80% of its time doing it and freezing for a second every five seconds the instant you use a language with garbage collection. I see far more "I'm writing a web server that's going to handle six requests every hour but I'm afraid the garbage collector is going to trash my performance" than people who believe it's magically free.
It's just another engineering decision. On modern systems, and especially with any runtime that can do the majority of the GC threaded and on an otherwise-unused core, you need to have some pretty serious performance requirements for GC to ever get to being your biggest problem. You should almost always know when you're setting out to write such a system, and then, sure, think about the GC strategy and its costs. However for the vast bulk of programs the correct solution is to spend on the order of 10 seconds thinking about it and realizing that the performance costs of any memory management solution are trivial and irrelevant and the only issue in the conversation is what benefits you get from the various options and what the non-performance costs are.
It is in some sense as big a mistake (proportional to the program size) to write every little program like it's a AAA game as it is to write a AAA game as if it's just some tiny little project, but by the sheer overwhelming preponderance of programming problems that are less complicated than AAA games, the former happens overwhelmingly more often than the latter.
Edit: I can be specific. I just greased up one of my production systems with Go memstats. It periodically scans XML files via network requests and parses them with a parser that cross-links parents, siblings, and children using pointers and then runs a lot of XPath on them, so, it's kinda pessimal behavior for a GC. I tortured it far out of its normal CPU range by calling the "give me all your data" JSON dump a 100 times. I've clicked around on the website it serves to put load on it a good 10x what it would normally see in an hour, minimum. In 15 minutes of this way-above-normal use, it has so far paused my program for 14.6 milliseconds total. If you straight-up added 14.6 milliseconds of latency to every page it scanned, every processing operation, and every web page I loaded, I literally wouldn't be able to notice, and of course that's not what actually happened. Every second worrying about GC on this system would be wasted.
The biggest GC issues I’ve personally seen manifest were arrays of historical data that grew to tens of millions of entries and due to array storage in .net, the array was placed in the large object heap. Swapping to a linked list actually fixed the issue and the team lived to fight another day.
Like a lot of premature optimization, it isn’t a problem until it is… but solutions aren’t unattainable.
I still mostly remember the day a coworker convinced me that object pooling was dead because it tears the mature generation a new one over and over.
It's nice when the runtime solves a problem you've had to solve yourself, but it also takes a bit of your fun away, even if your coworkers are relieved not to have to deal with 'clever' code anymore.
Yes; I have a friend who is part of a small team that wrote a very successful stock market trading gateway in Java. Turns out the JVM's GC can be tuned in very specific ways based on your needs. And there are ways to avoid having to do JVM GC in critical areas of the code as well.
> And there are ways to avoid having to do JVM GC in critical areas of the code as well.
Yeah, you allocate a large pool of objects up front and manually reference count them. Every high-performance Java application I've seen ends up doing this. But isn't that an argument for reference counting?
>Yeah, you allocate a large pool of objects up front and manually reference count them. Every high-performance Java application I've seen ends up doing this
Not sure if it's still relevant, though.
One popular physics library years ago went as far as instrumenting compiled bytecode to turn all vector/matrix allocations to fetching preallocated objects from a pool, because a simple math operation could allocate tens/hundreds of vector/matrix objects and GC was slow, but then in newer versions they removed it because Java's GC became fast enough.
It is an argument that Java can do arenas like some other languages, while providing the productivity of a tracing GC for the rest of the non critical path code.
Generally Java has made a huge investment in the garbage collector over a long period of time to address the problems that people have in some use cases. JDK 17 is much better than JDK 8. If you were writing a GC from scratch you are not going to do as well.
To be fair, they definitely got into a rut in the JDK 6-7 timeframe. I maintain it's no accident that memcache came into its own during this period. That was a major pain point, and going out-of-process shouldn't have been necessary.
I always figured memcached gained popularity because it let your monolithic LAMP-like-stack scale better, not because of anything about Java. Just wedge it between your N stateless application servers and MySQL and you could handle 10x more users.
At least, that's how most people I knew back in 2006ish were using it.
I think each ecosystem had a very different experience with memcached adoption. In Java it was really limiting to do in process cache, and that limitation hit a wall while memcached usage was starting to spread. I assume Rails and maybe Python had similar problems, because all of the papers about GC techniques up to about that point were talking about Java. They got a lot of good stuff first or second.
What’s common though is that hard drives were not getting faster, but network hardware was hitting its stride. Full bandwidth (port speed X port count) routers and multi NIC were recently ubiquitous.
I had just come off a carrier grade software project when memcached finally hit my radar, and we had only spec’ed 3-5 servers for the web tier. That was still enough local cache hit rate to keep us running relatively smoothly. Or at least once we got done being honorary QA members for F5. We had lots of problems on that project with data modeling and so we actually were using too much caching vs precalculation but that’s a different story.
There's several exchanges and clearing platforms running Java[1], although I'm not sure how many are still around after Nasdaq's hostile takeover.
The JVM GC’s are absolutely insanely good. G1 can sustain loads with heap sizes well into TERAbytes.
As I heard those guys write allocation-free Java code in critical paths. Nothing allocated, nothing to collect.
> it's the myth that garbage collection always automatically means your program is going to be spending 80% of its time doing it and freezing for a second every five seconds the instant you use a language with garbage collection.
Not 80%, but still annoying enough to dump it: https://discord.com/blog/why-discord-is-switching-from-go-to...
Magpie developers would use any excuse to move on.
It is less boring than building up the skills to fix the plane in mid-flight.
> Magpie developers
That was a new one for me. I sort of understood the reference but the Jeff Atwood post [0] where it's introduced is as relevant today as it was when this was published in 2008.
Such a claim really needs hard data to back it up. Reference counting can be very expensive, especially if the refcount update is an atomic operation. It's hard to capture in profiling tools because the performance overhead is smeared all over the code base instead of centralized in a few hot spots, so most of the time you don't actually know how much performance you're losing because of refcounting overhead.
The most performant approach is still manual memory management with specialized allocators tuned for specific situations, and then still only use memory allocation when actually needed.
This. Exactly this.
Garbage collection has a huge, and generally entirely unappreciated win when it comes to threaded code. As with most things, there are tradeoffs, but every reference counting implementation that I've used has turned any concurrent access to shared memory into a huge bottleneck.
> The most performant approach is still manual memory management with specialized allocators tuned for specific situations, and then still only use memory allocation when actually needed.
RAII gets you a lot of the way there.
> Basically, you attach the reference to the object graph once, and then free it when you're done with it.
So reference counting works by the programmer knowing the lifetime of each object allowing them to only increment / decrement the refcount once, and trusting that the raw uncounted pointers they use elsewhere are always valid? There's another word we have for this: manual memory management. It's unsafe and unergonomic, and it's pretty telling that the author needs to this pattern to make RC appear competitive. It's because actually doing reference counting safely is really expensive.
> If GC is so good, why wouldn't Python just garbage collect everything, which they already did once and could trivially do, instead of going through the hassle of implementing reference counting for everything but the one case I mentioned?
Because they've made reference counting a part of their C extension API and ABI. If they wanted to use a GC, they'd instead need a very different API, and then migrate all the modules to the new API. (I.e. a way for those native extension to register/unregister memory addresses containing pointers to Python objects for the GC to see.)
Early on the deterministic deallocation given by reference counting would also have been treated by programmers as a language level feature, making it so that a migration would have broken working code. But I don't think that was ever actually guaranteed in the language spec, and anyway this was not carried over to various alternative Python implementations.
Yes.
In Python reference counting precedes tracing garbage collection.
So they didn't 'go through the hassle of implementing reference counting' after they already had tracing garbage collection. Instead they went through the hassle of implementing tracing garbage collection after they already had reference counting.
(And as you say for backwards compatibility reasons, they can't get rid of reference counting.)
It should be noted that the Python language spec explicitly says that refcounting, and the resulting deterministic cleanup, is an implementation detail of CPython, and not part of the Python language proper. Precisely so that other implementations like Jython or IronPython could just use GC without refcounting.
https://docs.python.org/3/reference/datamodel.html#objects-v...
So, idiomatic Python does not rely on this, and uses with-statements for deterministic cleanup.
But, of course, as with any language, there's plenty of non-idiomatic Python out in the wild.
Yes, I know that I was referring to a detail of CPython when I said 'Python' in general. Alas, in practice Python really is CPython, more or less.
The biggest problem is that reference counting is baked into how CPython's C-modules work.
Details aside, using Python to make an argument about performance is a pretty bold move.
While Python is not high performance overall, people have spent a ton of time optimizing the memory management. So while it's not definitive proof of reference counting being slow, it seems like a fair initial question.
> So reference counting works by the programmer knowing the lifetime of each object allowing them to only increment / decrement the refcount once, and trusting that the raw uncounted pointers they use elsewhere are always valid? There's another word we have for this: manual memory management.
Programmer, or compiler. In latter case it is automatic reference counting, which I never heard called "manual memory management".
Reference counting is garbage collection, just a different strategy - and all these strategies tend to blur to the same methods eventually, eventually offering a latency-optimized GC or a throughput-optimized GC.
Swift is inferior here because it uses reference counting GC without much work towards mitigating its drawbacks like cycles (judging by some recent posts, some of its fans apparently aren't even aware RC has drawbacks), while more established GC languages had much more time to mitigate their GC drawbacks - e.g. Java's ZGC mitigates latency by being concurrent.
Maybe Apple just hires mediocre developers (I certainly have lots of complaints about their software and UI issues, but I would probably suspect management/priorities/schedules rather than the technical staff) but they implemented GC and Automatic Reference Counting for ObjC and found that the latter resulted in better and more consistent responsiveness, which is probably what most users care about in apps. (Apps still get compressed or paged out which can lead to annoying pauses.)
My anecdata indicate that Java apps are not as responsive as ObjC/Swift for the most part.
https://github.com/ixy-languages/ixy-languages
The real reason why a tracing GC was a failure in Objective-C was due to the interoperability with the underlying C semantics, where anything goes.
The implementation was never stable enough beyond toy examples.
Naturally automating the Cocoa release/retain calls made more sense, given the constraints.
In typical Apple fashion they pivoted into it, gave the algorithm a fancy name, and then in a you're holding it wrong style message, sold their plan B as the best way in the world to manage memory.
When Swift came around, having the need to easily interop with the Objective-C ecosystem naturally meant to keep the same approach, otherwise they would need the same machinery that .NET uses (RCW/CCW) to interop with COM AddRef/Release.
What Apple has is excellent marketing.
> What Apple has is excellent marketing.
Whoa! Marketing? Interop with C is a MASSIVE use-case. And iOS is A LOT faster and more responsive than android. That is, right there, total proof that can't denied.
I do both, and the speed of apple way and the simplicity of bridge the C-abi is not a joke.
The android/iOS comparison has many more factors than GC type. I suspect the main difference is processor type - Qualcomm ARM has been very disappointing so far, and GC type doesn't even come into it.
Certainly, but Apple code also run on x86.
My main point actually is that Apple don't pick this route for the fun of it. It HAS a need to be performant on mobiles devices (that have less luxury to get a GC that eat Ram) and that need represent A LOT OF MONEY. You can say any step about this, including making processors, are directly or indirectly related to the need.
P.D: I don't see Rc VS Gc as enemies, but as faces of the same coin. ARC is just a way to apply some Gc lessons to naive Rc. I think naive Rc is truly worse than Gc, but smart Rc is total fine and consider it the best of both in the general case...
Performance, right => https://github.com/ixy-languages/ixy-languages
ARC is a perfectly respectable choice by itself. What's annoying me is ignoring the improvements other GC strategies and other languages have made in the last 20 years. There's a certain section of Apple users that thinks that every choice Apple makes is the only good way to do things.
And RC might be the correct choice for their devices (due to them not having much RAM, and latency being more important than throughput). But that is not true everywhere, I would even say that a good tracing GC is a better default.
Your first paragraph is only true if malloc/free counts as GC. I have seen people try to claim that free() is just manual GC. You can say that if you want, but it renders the terminology meaningless.
Any true GC strategy (== one that collects cycles) will fundamentally touch and allocate more memory than malloc/free, where reference counting is pretty close to malloc/free performance; it doesn’t need to touch any memory not involved. At OS scale, that’s a huge performance advantage. You can scope down the memory involved in GC using advanced, modern GC techniques, but you’re still going to be behind malloc/free in overall efficiency — cache efficiency, memory maintenance overhead, and additional memory required for bookkeeping. And reference counting will be pretty darn close to malloc/free.
Chapter 5, https://gchandbook.org/
It puts that "extra memory" inside the very objects it tracks with a... reference count.
What do you mean? Potential reference cycles are a compile time error in Swift. That’s the whole point of the @escaping annotation
So this[0] wasn't current despite being edited half a year ago? I guess I'm guilty of the same mistake the post had: criticizing without understanding the state of the art on the other side (at least I didn't miss it by 20 years and on an entire subfield of Computer Science).
[0] https://stackoverflow.com/questions/32262172/how-can-identif...
No, the SO post is accurate—it’s trivially easy to create strong reference cycles in Swift and @escaping annotation is only of limited use in detecting strong reference cycles. Though Swift also broadly pushes towards the use of value types for which creating reference cycles of any kind is impossible since they’re values, not references!
Putting both under the same term is meaningless.
Reference counting and Garbage Collection have very clear difference: when the referenced objects are destroyed (not deallocated). In RC it happens when the count reaches zero. In GC it happens some time later. That difference is crucial for having or not having deterministic performance in your program.
Is the Pony language's GC a GC then? It runs at determinate times, namely when a behaviour (an actors' receive function, basically) finishes running... and because actors cannot share mutable state, and immutable values with more than one owner can be handled easily by a simple common parent actor, each actor's GC is completely independent of each other.
Things are rarely as clear cut as we would want to believe.
> Is the Pony language's GC a GC then? It runs at determinate times, namely when a behaviour (an actors' receive function, basically) finishes running...
It is. It might run when a behaviour finishes running, according to their docs.
> In GC it happens some time later.
Yes. In languages with destructors/finalizers called from the garbage collector, things can get very complicated. C# and Java have this problem.
Go avoids it by having scope-based "defer" rather than destructors.
Java has scope based defer since version 7, while C# has it since version 4, with the improvement on version 8 to be syntactically like defer.
With a bit of generics and lambdas, they can be made to even be more defer like.
Chapter 5, https://gchandbook.org/
For how strongly worded this article is, you'd think the author would provide some substance in their reasoning. Reference counting, even atomic, is quite expensive. Not only because it can invalidate the cache line, but depending on the architecture (looking at you x86), the memory model will deter reordering of instructions. On top of this, reference counting has a cascading effect, where one destructor causes another destructor to run, and so on. This chain of destructor calls is more or less comparable to a GC pause.
Not my experience at all. I am maintaining a very high performance JIT compiler for a Haskell like programming language used in production at large enterprises around the world. So I am used to very carefully analyse performance. And reference counting is never the bottleneck. You might be right in theory but not in practice.
Not sure about the characteristics of your workload, but here's a talk where a group wrote device drivers in several high-level languages, and measured their performance: https://media.ccc.de/v/35c3-9670-safe_and_secure_drivers_in_...
They found that the Swift version spent 76% of the time doing reference counting, even slower than Go, which spent 0.5% in the garbage collector.
Yeah, GP has no idea what they are talking about. They replied something similar on another thread without ever providing evidence for their claims. They even claimed their RC implementation in Haskell is faster than hand-crafted C++.
Gave me a chuckle.
And your reply game me a chuckle :) The code is unfortunately not open source so I can’t provide the evidence. You believe what you want to believe of course. However I am happy to answer any questions on how it is implemented down to the lowest-level nitty gritty details.
All that tells me is that the Swift implementation is different from the one I implemented.
Well, it won’t be the bottleneck itself, but it has an overhead on basically every operation, which likely won’t show up during profiling.
Also, I fail to see the advantage of RC in case of a presumably mostly immutable language - a tracing GC is even faster there due to no changes to the object graph after allocation, making a generational approach scale very well and be almost completely done in parallel.
Nope. The compiler does a full program optimisation, reducing the reference counting to an absolute minimum. There is close to zero overhead passing data around. It does not work like C++ shared_pre. shared_ptr is slow.
> The compiler does a full program optimisation
That's the equivalent of an escape analysis I assume, which optimization exists for tracing GCs as well.
I wonder how Haskell's purity influences RC usage patterns. Are there tricks which aren't possible in an imperative language?
Haskell specifically is a poor choice of language here, because it creates cycles like the pest. (This is because it uses lazy evaluation, and programming patterns (design patterns?) using lazy evaluation tend to use cyclical references. Strict FP languages might support your point better, but then Ocaml again doesn't work because it mutates like the pest.)
Furthermore it also allocates like the pest: busy Haskell programs regularly allocate on the order of 1GB/sec. However, this works out fine because the majority of those allocations are short-lived, hence become dead quickly, hence a GC that is designed to only touch live data (like Haskell's GC!) will handle that well.
It is a Haskell inspired language but not Haskell. The syntax is similar but it is eagerly evaluated and has no IO and no data cycles. It is used 100% for evaluating complex optimisation and business rules within a client application. The language was carefully crafted to maximise rules writer productivity and execution performance.
Ah, I was just responding to my parent, committing the sin to not remember what they are replying to.
Interesting; with purity and eager evaluation, and presumably no mutually recursive bindings either, you should indeed get absence of cycles. I do wonder: does absence of mutually recursive bindings come back to bite you at some point? Or does the need for that just not arise in your domain?
Is there stuff that you feel is awkward to express in your language that would be fine in Haskell proper?
One reason you'd choose ref counting is because it's deterministic behavior, whereas you lose that granularity with gc, even if you did a gc cleanup.
I see great reasons for both systems being useful, but both systems also bring their own warts.
Yes, ref counting affects cache and branch prediction, but gc is a whole complete subsystem running in parallel with your main code, constantly cleaning up after you. It will always depend upon the application which will determine what's best for that application.
Some languages lean heavily one way than the other too. Scripting with ref counting would be a nightmare, as would running a garbage collector on an 8bit micro. Since the article's talking C & C++, then of course a pro ref counting stance makes sense.
>One reason you'd choose ref counting is because it's deterministic behavior
Not sure if it's entirely deterministic. A variable going out of a scope can trigger deallocation of a large object graph and it's not always clear by just looking at a code what will happen (especially if objects have destructors with side effects, your object graph is highly mutable, and your code is on a hot path). A common trick is to delay deallocation to a later time, but then again you can't be sure when your destructors will be run. Another issue is cycles, if your RC system has cycle detection, your program will behave differently depending on whether a cycle formed at runtime or not.
> trigger deallocation of a large object graph and it's not always clear by just looking at a code what will happen
If you can't understand what's happening when an object gets freed, it may be a sign that your code is too tightly coupled and/or becoming spaghetti. I've found that the more graph-like my data structures become, the more inadvertent complexity I'm adding. That's the whole reason we talk about data normalization, is to avoid these types of couplings.
Come on, object graphs are completely dynamic, noone can say where will “a variable go out of scope”, unless we literally have a hello world.
Do you honestly claim that you know when deallocations happen in any codebase full of conditionals depending on outside effects (user input, network, etc)?
I mean... people program large developments in Rust with pretty minimal use of Arc. It's clearly possible to structure many programs this way (though they do not necessarily look like programs in other languages). It's important not to make extreme and definitive statements about what's realistic that are contradicted by a bunch of existing programs... in any case this has little bearing on the reference counting vs tracing GC performance thing. It also might be instructive to look at how Rust frameworks deal with your UI example: mostly, by keeping a vector of UI widgets and reusing the allocation over and over. Therefore, you shouldn't see a significant pause there (you could probably argue the pause comes when resizing the vector technically, but it is generally efficient enough that you're not going to notice it unless the vector is very very large).
Statically adding code at the end of scope-leaves is different than knowing when the deallocation will happen.
The borrow checker doesn’t know when will the scope end, it only knows that however it happens upholds the invariants it cares about. You might call two entirely different code path inside a method and rust will only execute the dealloc logic at the end at a non-deterministic time. But maybe we just use different terminology here.
Fun fact: Rust actually had a proposal at one point to execute drops entirely statically, with no runtime flags on the stack. It was decided against because people thought it would be too confusing as it would be hard to tell when the destructor would run, but for purely memory related destructors it would probably be acceptable.
I do see what you're trying to get at, I think, but it's also worth noting that the use of stuff like arenas and vectors to absorb the cost of the repeated reallocations goes a long way here towards making deallocation times predictable in practice (if not in theory). It is certainly the case that you can mostly reduce the deallocation overhead to ~ zero for any particular part of your Rust program without that much effort, unless you are writing an interpreter for a different language that expects GC semantics (at least, that's been my experience).
> Do you honestly claim that you know when deallocations happen in any codebase full of conditionals depending on outside effects (user input, network, etc)?
Yes. C programs have been doing this for over 40 years now. A leak free C program has an equivalent free for every malloc, which means they know exactly when everything gets allocated and freed.
That just means that every allocation has a pair that frees it - that’s different from knowing how many allocations happen, when and when does the corresponding free happen.
For a simple example, take a text editor (sure, you would likely allocate a much bigger buffer in practice) that allocates for each line of text an object, and adds the buffer’s pointer to a list to be freed - this freeing happens when the user closes the open text file window. While you do know that every allocation will be freed and know their relative order, you don’t know anything more specific - will the user open multiple such files first, close them in some random order, etc.
It's not different though. You asked:
> Do you honestly claim that you know when deallocations happen in any codebase full of conditionals depending on outside effects (user input, network, etc)?
And the answer to that is yes. You even admitted that here:
> While you do know that every allocation will be freed and know their relative order
This is a lot more specific than "the GC will free this memory at some indeterminate point that it deems acceptable".
Additionally, in your specific example:
> this freeing happens when the user closes the open text file window
So you can plan for that. You can pop up a saving screen if you run tests and realize that the deallocations take a bit of time. With GC, it's luck of the draw. I'm speaking from experience.
I wrote a tool that used Roslyn to do some transpiling of a custom format in our company. It was very important that it ran fast, since this algorithm was going to be run in a time sensitive situation. And it had to free it's memory as soon as it was done using it, since it was hot swapping the DLLs and I needed to make sure I wasn't getting name collisions from DLLs that were waiting for the GC to run to fully unload the old DLLs. I tried so many different ways to tell C# GC to collect the object tree that I knew was no longer necessary, but it was seemingly impossible.
Microsoft even has a page dedicated to debugging why an assembly won't unload and it reads:
> The difficult case is when the root is a static variable or a GC handle.[0]
Now you may say that this while problem only arises because of my weird specific use case, which is true but I couldn't change my requirements since those were hard requirements from my company. This could all easily be solved if a GC allowed you to define the concept of ownership. I had references hanging around that didn't matter because the object holding the references didn't own that memory.
All that to say, yes you can know exactly when you're memory will be deallocated in a language like C. In a language like C#, your left to the whims of the GC which can be a deal breaker in lots of cases.
What really gets me too, is languages like C# end up creating DI frameworks with the "novel" concept of lifetime requirements to make sure object lifetimes are properly scoped. What the heck? It's got a GC. Why did they go through all that trouble if the GC just cleans it up for you? If I have to think about object lifetimes, I may as well switch to a language that makes that explicit rather than a language that obfuscated it as much as possible.
[0]: https://docs.microsoft.com/en-us/dotnet/standard/assembly/un...
Well, you can always go a layer below and use a bytebuffer (or its C# equivalent). You can drop it at the end of the task in an instant.
But I don’t think that giving up the comfort/performance of a good GC is a good tradeoff in all the other cases. The same way Rust et alia can opt into some form of GC with (A)RC, GC languages can have escape hatches as well.
In most cases, yes, you should be able tell when deallocations are going to happen once you know the inputs.
But what if my application is say, a diagramming GUI where the user can create many nested items. When they delete a million items by removing a top level item, how are you going to avoid a pause if using single threaded synchronous RC? Per object determinism doesn't mean systemic determinism on a dynamic graph.
You're not going to avoid it. But you will know that it'll happen at that exact moment.
Whether that is actually important or not depends on the use case. Personally, I think that GC is plenty good enough for most GUI apps other than games, and allows for non-contorted modelling of said GUI (e.g. with backreferences where they make sense).
You say, "you will know that it'll happen at that exact moment". I'm curious what you mean by this.
Do you mean that the user will know? Well sure, that's the pain point to avoid in this case. Anyone who has tried to quit certain versions of various browsers after a long session with many tabs, etc. will know this pain when closing a window. Server side applications can have similar issues.
Or do you mean the code will "know"? That is, the code will need to predict, at runtime, that a code path will be expensive and choose a memory release strategy based on some criteria?
Or do you mean the designer of the code will know, and avoid RC before implementing?
Honest question, I'd like to understand your perspective. Thanks.
This was specifically a response to:
> you know when deallocations happen in any codebase full of conditionals depending on outside effects
What I'm saying is that the author of the code, and anyone else who can read and understand it, will know that, if the user does X, Y, and Z, it'll trigger a deeply nested release of an object graph that'll cause a period of non-responsiveness visible to the user.
Whether the author will consider this acceptable or not is a different question.
Deterministic means that running the same part of the program will take the same amount of time. Deallocating the same memory object graph is pretty much deterministic.
GC can throw a spanner in that by deciding that now is the time to do its thing.
>Deterministic means that running the same part of the program will take the same amount of time
The object graph can easily be different from run to run depending on user input, timing, etc. Say, in the first run, the graph has a large subgraph due to a runtime condition (a property is set to a certain value), and on the next run, the graph is more lightweight because the property was not updated. It will take different amounts of time to collect such graphs (especially if they have destructors). I don't see how GC is any different here, it also depends on input, timing, etc. It's true though that a programmer has less control of GC because the rules when it is triggered are not always 100% clear (implementation details of a particular runtime) but imho they are as deterministic as RC; RC is just a form of GC which has very simple rules which are eaiser to understand.
That is expected. With different input data (what you call runtime condition) you get different result. With the same input data the determinism holds for RC, but not for GC.
> Scripting with ref counting would be a nightmare, [...]
Why? Old version of Python used ref counting only, and Python still largely relies on reference counting (but has a GC to detect cycles).
Time to tout my own horn. I made a project comparing different types of garbage collectors (I still prefer the original terminology; both ref-counting and tracing garbage collection collects garbage, so they are both garbage collectors) a few years ago: https://github.com/bjourne/c-examples
Run ./waf configure build && ./build/tests/collectors/collectors and it will spit out benchmark results. On my machine (Phenom II X6 1090), they are as follows:
Copying Collector 8.9
Reference Counting Collector 21.9
Cycle-collecting Reference Counting Collector 28.7
Mark & Sweep Collector 10.1
Mark & Sweep (separate mark bits) Collector 9.6
Optimized Copying Collector 9.0
I.e for total runtime it is not even close; tracing gc smokes ref-counting out of the water. Other metrics such as number of pauses and maximum pause times may still tip the balance in favor of ref-counting, but those are much harder to measure. Though note the abysmal runtime of the cycle-collecting ref-counter. It suggests that cycle collection could introduce the exact same pause times ref-counting was supposed to eliminate. This is because in practice cycles are very difficult to track and collect efficiently.In any case, it clearly is about trade-offs; claiming tracing gc always beats ref-counting gc or vice versa is naive.
This is cool, thank you. Note that my comments are those of a layman, I don't consider myself an expert on these topics, but this gave me some thoughts. Happy to learn more, would love links to blogs/ papers where I can read more.
I would not be surprised to find that even a naive mark and sweep collector is faster than naive refcounting on some workloads. One obvious thing to consider is that the work is delayed, you can perform the sweeping 'as needed'. Even the marking doesn't have to run on any deterministic schedule.
The thing is that, from my naive perspective, run of the mill tracing collector algorithms are just way more advanced than your typical refcount. Most refcounting is just that - either an integer, atomic integer, or both, that gets incremented and decremented based on a number of operations applied to the underlying type. The naive approach has no delays.
Tracing GCs on the other hand, although perhaps not naive ones (could you link me info on the quickfit algorithm? I can not find anything online), might contain epochs that bump allocate in the majority of cases. That'll be particularly nice for benchmarks where allocations are likely very short lived and may actually never need to get to the mark/sweep phase. Your algorithm isn't really documented and I just really don't feel like looking at C right now.
Although naive refcounting is very common it's not the only game in town. Depending on the language you can group refcounts together - for example, imagine you have:
(assuming all fields are automatically refcounted) struct Foo { bar: Bar, baz: Baz, }
In theory, a "copy" of this type would involve 3 increments, possibly atomic increments. Each increment would also require a heap pointer dereference, and there would be no locality of those integers behind the pointers. That would be the trivial implementation.
But depending on the language you could actually flatten all of those down to 1 RC. This is language dependent, and it requires understanding how these values can be moved, referenced, etc, at compile time. You could also store all reference counts in tables associated with structures, such that you have locality when you want to read/write to multiple counters. The pointer dereference is going to be brutal so having locality there will be a nice win. I'd be curious to run your benchmarks through valgrind to see how much the refcount is just spending time on memory fetches that get invalidated in the cache immediately.
Anyway, an example of a pretty slick refcounting GC is what Pony built: https://tutorial.ponylang.io/appendices/garbage-collection.h... https://www.ponylang.io/media/papers/OGC.pdf
Pony has different types for: 1. Local, Immutable 2. Local, Mutable 3. Shared, Immutable 4. Shared, Mutable
You can read the paper where they discuss how they track local variables vs shared variables, the implementation of counter tables, etc.
So I guess to summarize:
1. The results make sense, or as much sense as anything. I'd be interested in more details on the algorithms involved and your benchmark methodology.
2. "Naive" tracing GCs are actually pretty advanced, and advanced refcount implementations are pretty scarce.
Let me preface by stating that I'm no expert. I created the repo a few years ago when I was self-studying gc. Then I realized the rabbit hole was much deeper than I thought and retreated. The book I read is The Garbage Collection Handbook. Quite expensive but definitely worth its price.
Throughput-wise, it's hard to beat naive tracing gc. The algorithms are just too simple and they don't "interfere" with "normal operations" like ref counting does. Assuming the same allocation pattern (i.e no cheating by stack allocating objects), a tracing gc would likely (again, throughput-wise) beat manual memory management too. The additional benefit tracing gives you is easy heap compaction. Thus future pointer-chasing and memory allocations will be more efficient. With ref counting, compaction is harder.
True, you could delay sweeping, but ime, marking time dominates so you don't gain much. Even with a huge heap of several gigabytes, sweeping is just a linear scan from lowest to highest address.
Quick fit is a memory allocator, see: http://www.flounder.com/memory_allocation.htm Most gcs do not keep the heap contiguous so you need it in a layer below the gc. Quick fit is the algorithm almost everyone uses and it is very good for allocating many small objects of fixed sizes (8, 16, 32, etc.). It could be swapped out with malloc/free pairs instead, at the price of some performance.
I have to disgree with naive tracing being advanced. My mark & sweep implementation is only about 50 lines and that includes comments: https://github.com/bjourne/c-examples/blob/master/libraries/... A copying collector isn't much more complicated. Neither is beyond the reach of most comp sci students. Yes, optimized multi-generational tracing collectors supporting concurrent and resumable tracing makes them very complicated. But the same is true of optimized ref counting schemes. :)
Pony looks very interesting. It looks like it is supposed to have less object churn than very dynamic languages like JavaScript which probably makes ref counting very suitable for it.
It's my theory that Java, unintentionally, did a lot of damage to P&L research. I write a lot of Rust, and while the borrow checker is great, I've come to really admire the work that was put in the Go GC even if it's not as fast Java.
There is a whole generation of programmers that have come to equate GC with Java's 10 second pauses or generics/typed variables with Java's implementation of them. Even the return to typed systems (Sorbel, pythons' typing, typescript) could be seen as typed languages are great, what we really hated was Java's verbose semantics.
I feel that every time we talk about Java, we should clarify between Java the language, Java the programming style, and Java the JVM.
As a language, Java's not too bad. It's a bit wordy, in bad need of some syntax sugar, but it's designed to be fairly straightforward and for the most part it does its job well. I don't need a degree in language theory to get started writing it.
Java the programming style, particularly enterprise, is a horrendous over-engineered mess that schools jam down the throats of students who don't know any better. It's designed to (and fails to) enforce a common style that can be written by armies of mediocre developers plodding along inside giant enterprise codebases, so that no matter who wrote the code, some other developer in another department can figure out how to call it.
Java the JVM is a pretty nifty beast. It's made tradeoffs that means it's not always suited for every use case, but put in its element it really shines. The modern GC algos give developers options based on the program's needs. It's currently struggling to overcome some historical decisions that while good back in the old days are now holding it back.
Personally I'm very biased towards Kotlin, which gives me the benefit of the JVM without the barf that is Java. It's not the fastest-executing language out there, but for me it's a perfect balance between development speed, ecosystem of battle-tested libraries, and competitive execution speed.
> There is a whole generation of programmers that have come to equate GC with Java's 10 second pauses
Anyone who has ever shipped a C# Unity game know the pain that is the garbage collector. It’s effectively impossible to avoid frame hitches with the GC.
I’ve spent a LOT of time going way out of my way to avoid any and all garbage collects. Which somewhat defeats the purpose of using a GC-based language.
I definitely would say “GC used to be bad but now it’s good”. That tale has been spun for 30+ years at this point!
I don't know much about the C# garbage collector; and it's likely that garbage collectors are a bad fit for programs that have hard deadlines.
That said, it could also be a function of the same "problem" Java has in its design - Java by default boxes everything and so every memory allocation increases garbage collection pressure. Go, by using escape analysis and favoring stack allocations, doesn't have this problem and has had sub-millisecond pauses for years now.
You rarely hear people complain about Go's GC despite it being much less mature than the JVMs. But due to actual language design, Go's GC does much less work. I wouldn't say “GC used to be bad but now it’s good”, but that "the design of languages like C# and Java were too dependent on using the GC for memory allocations, and there are other GC languaged out there that utilize the GC less which can lead to greater performance"
You're right, and it's got more layers than that. C# does have value types, which are not boxed, and using them judiciously can avoid garbage. However, they are a more recent addition to the language (which started as a lame Java clone), and so the standard library tends to not know about them. Really trivial operations will allocate hundreds of bytes of garbage for no good reason. Example: iterating over a Dictionary. Or, IIRC, getting the current time. They've been cleaning up these functions to not create garbage over tiem, of course, but it's never fast enough for my taste and leads to some truly awful workarounds.
C# had value types and pointers from the very beginning. These are not a recent addition. The standard library does know about them. However, not until C# 2.0, which introduced generics, were collections able to avoid boxing value types.
There are some cases where allocations are made when they could have been avoided. Iterating over a dictionary creates a single IEnumerator object. Async methods, tuples, delegates, and lambda expressions also allocate memory as do literal strings. It is possible to have struct-based iterators and disposers. There are some recently added mitigations such as a ValueTask, ValueTuple, function pointers, ref structs, conversions of literals to read-only spans, that eliminate allocations.
DateTime is a value type and doesn't allocate memory. Getting the current time does not allocate memory.
With the recent additions to ref types and Span<>, C# provides a lot of type-safe ways to avoid garbage collections. You can always use pointers if need be.
And to put that in context, 2.0 was around 2003, iirc
It was 2005.
But even then, arrays of value types were available since C# 1.0 (2001).
Value types in .NET exist since forever, version 1.0.
Besides the runtime also does C++ since version 1.0, and any language can tap into it.
The JVM also does escape analysis. Java value types are in the works as well.
Don't mix Unity's implementation, an ageing one, with the language.
> what we really hated was Java's verbose semantics
I believe it's actually the opposite - Java has pretty simple, compact and well defined semantics. Too simple and compact for confort - a little syntatic sugar would have made the language a lot less verbose.
Java's fundamental problems are well beyond reach of any syntactic sugar.
Please explain these fundamental problems.
Generic type erasure and primitive boxing are two examples that cause daily headaches for many.
Generic specialization is being worked on.
Java (the JVM) doesn't really have 10 second pauses anymore. G1GC and ZGC and Shenandoah have been a thing for a while now.
G1 can do 10 second pauses. I observed them many, many times. It tries not to, but there are no guarantees.
Have you tried ZGC or Shenandoah?
> P&L research
What’s P&L?
> Java's verbose semantics
Does Java have verbose semantics? I think Java’s semantics are pretty neat and concise. Where’s the verbosity?
Not OP, but someone who has gotten paid to write Java for several years.
I would say that isn't that Java's semantics are that verbose, it's that the way Java is traditionally written, with every line actually 3 lines on your screen of
public function makeItalicTextBox(String actualTextIWantToBeItalic) { ItalicTextBox itb = italicTextBoxFactoryGenerator.GenerateFactory().buildItalicTextBox(actualTextIWantToBeItalic); return itb; }
I think this is actually the pernicious work of Martin's _Clean Code_ which trained a whole generation of Java coders to write nonsense one line functions with a million interior function calls like this, not anything forced by the Java language itself, but it makes for really ugly code, in my exceptionally humble but undoubtedly correct opinion.
The builder pattern with method chaining is unfortunate and should usually be replaced by named/optional parameters, in any sensible/modern programming language at least.
IDEs are an enabler for horriblyLongIdentifierNames because they enter them for you automatically without requiring you to type them on a keyboard.
Can you explain why you feel this way? I personally hate named and optional parameters, as they inevitably seem to drastically inflate the complexity of the function. I would much rather deal with a builder class that encapsulates doing something over calling a method with fifteen parameters, half of which are optional.
And god forbid you have optional bool args!
If I need to call a monster like that repeatedly, I'm likely to make my own wrapper that's simpler anyway - so why not start out that way.
One thing I like is that:
is shorter (and, I would argue, clearer) thanfoo(a=1, b=2, c=3)
and requires only one method declaration and one method call.foo().a(1).b(2).c(3)Also for constructors you don't have to worry about partial or out-of-order initialization. Is the above the same as
What if one call involves opening a file or allocating some resource that is used by another call?foo().c(3).b(2).a(1) ??One nice thing that method chaining does give you is being able to differentiate external and internal parameter names. Swift (for example) supports this with argument labels for doing things like
rather than having to use the variable name:insert(something, into: list, at: position)
Of course I wish that Swift supported ObjC/Smalltalk-like syntax so you could just doinsert(something, list:list, pos:position)
Note SwiftUI actually uses method chaining for some reason. It's OK but I would have been fine with named parameters.insert a into: list at: positionBut I do like Smalltalk's method chaining/cascade syntax:
robot walkForward; wave; sitDown.
Java idioms are one contributing factor. There's some inherent wordiness too - for example, Hello World in Java is wordier than in most other programming languages.
That's syntax not semantics though.
agreed, I think everyone under the child comment of yours is talking about the verbosity coming from non-semantic stuff.
This is one of the reasons I left Java after 30 years. For those that wonder my day to day is now python and typescript.
Java is 27 years old though?
>What’s P&L?
I meant PL research as in programming language research. Not sure why my brain decided to put an & there.
>Does Java have verbose semantics? I think Java’s semantics are pretty neat and concise. Where’s the verbosity?
`FizzBuzzEnterpriseEdition` and is a meme for a reason. That said I'm sure Java idioms written today is a lot more sane than they were around the time everyone decided to rewrite everything in Ruby. When I say Java here, I'm talking about an era of Java that probably no longer exists (the JVM also doesn't have 10 second pauses today either and is widely considered the best garbage collector ever built).
"Profit and loss"
Been working in fintech (HFT) for well over a decade. Never seen profit and loss abbreviated to P&L. It was always PnL
It’s gotten better over the years. Java originally did not have generics, then there was the factory/uml everything crowd. Recently modern Java has been evolving towards ergonomics with streams/loom/guice/Lombok etc.
However we still don’t have something like auto/let from cpp/rust.
Typical criticism of Java: outdated by several years...
Java has had "var" since at least Java 11 (current version is 18, with version increasing every 6 months, so you're talking about Java from 4 years ago).
With new features every 6 months, yes, 4 years ago is an eternity in Java world these days... 4 years in the future, almost certainly Java will already have virtual threads (like Go coroutines), full support for pattern matching on records (ADTs like OCaml), true value types (Project Vallhalla) and some other stuff that is making other languages stay behind Java in many aspects.
Fair, my Java experience recently has been on a code base with jdk8/10 semantics. Super pumped for loom though.
> However we still don’t have something like auto/let from cpp/rust.
But things like auto are more verbose semantics, not less!
And isn’t var in Java like auto in C++?
There is type inference for local variables indeed with var in java for a few versions now (from the top of my head it is available since 14?)
Yes. And you could use Lombok to do the same before that.
(Though I'm one of the weirdos who likes Java in it's explicitness, which is related directly to its verbosity. I like reading code where I can see what the local variable types are.)
(I believe it depends on the exact scenario. In cases where the return type is not obvious I also much prefer explicit types (eg. ConcrType a = someObj.someMethod() ), but I don’t find A a = new A() any more readable than the var version)
Agreed. I'm not sure that the shortcut existing in the language is worth the fact that other devs on my team will use it in the former case.
People have been managing garbage collection schedules for decades now. It's quite possible for many systems to have completely deterministic performance, with the allocation/deallocation performance made extremely fast, gc restricted to certain times or a known constant overhead, etc. Ironically, from a programming perspective it's incredibly easy in a language like Java to see exactly what allocates and bound those cases.
Conversely, it's also possible for reference counting to have perverse performance cases over a truly arbitrary reference graph with frequent increments and decrements. You're not just doing atomic inc/dec, you're traversing an arbitrary number of pointers on every reference update, and it can be remarkably difficult to avoid de/allocations in something like Python where there's not really a builtin notion of a primitive non-object type.
Generally speaking, memory de/allocation patterns are the issue, not the specific choice of reference counting vs gc.
Not only does the author ignore the huge progress in conventional garbage collected languages like Java, he also dismisses GC as inherently flawed despite the fact that the common strategy of only having one heap per application has nothing to do with garbage collection. In Pony each actor has its own isolated heap which means the garbage collector will only interrupt a tiny portion of the program for a much shorter period of time. Hence the concept of a stop the world pause is orthogonal to whether you have a GC or not. One could build a stop the world pause into an RC system through cycle detection if desired.
I’m out of my league so this may be dumb, but does any language or VM or whatnot have a combined system where each thread has its own heap, and you can talk by passing messages, but they also have a common heap for larger data that’s too expensive to pass around, but at the cost that you have to be much more careful with lifetimes or have to manage it manually or something?
In the Erlang VM, each Erlang "process" has its own garbage collected heap [1]. There's also an ETS module for more efficiently storing and sharing large amounts of data, but data stored in ETS is not automatically garbage collected [2].
[1] https://www.erlang.org/doc/apps/erts/garbagecollection [2] https://www.erlang.org/doc/man/ets.html
If you squint at it right you could say C works that way: you have many processes each with their own heap and they can pass messages, but if they want larger data that's too expensive to pass around you can use shared memory.
That’s just IPC, nothing inherent to C.
Sure, you can do this in a lot of languages. I gave C mainly because it's the "native language" of Unix.
I believe Nim works this way, among others.
I believe Erlang works this way.
Dart
> he also dismisses GC as inherently flawed
It's a compromise, on memory consumption and performance. Modern GCs are minimising the impact of those factors, but they still remain a part of the design.
RC is a performance compromise.
There's a lot of discussion of comparative performance, but most software isn't performance sensitive so it just doesn't matter. But there's another major facet: FFIs. The choice of memory management has huge implications for how you structure your FFI.
JavaScriptCore uses a conservative GC: the C stack is scanned, and any word which points at a heap object will act as a root. v8 is different, it uses a moving collector: references to heap objects are held behind a double-redirection so the GC may move them. Both collectors are highly tuned and extremely fast, but their FFIs look very different because of their choice of memory management.
Read and write barriers also come into play. If your GC strategy requires that reads/writes go through a barrier, then this affects your FFI. This is part of what sunk Apple's ObjC GC effort: there was just a lot of C/C++ code which manipulated references which was subtly broken under GC; the "rules" for the FFI became overbearing.
Java's JNI also illustrates this. See the restrictions around e.g. GetPrimitiveArrayCritical. It's hard to know if you're doing the right thing, especially bugs may only manifest if the GC runs which it might not in your test.
One of the under-appreciated virtues of RC is the interoperability ease. I know std::sort only rearranges, doesn't add or remove references, so I can just call it. But if my host language has a GC then std::sort may mess up the card marking and cause a live object to be prematurely collected; but it's hard to know for sure!
I agree that the API and interop with C/C++ is a huge issue and something I haven't seen good articles on.
But I was sort of put off from reference counting by working with Python extensions that leaked memory many years ago. It's so easy to forget a ref count operation. I don't have data, but I suspect it happens a lot in practice.
With tracing, you have to annotate stack roots (and global roots if you have them). To me that seems less error prone. You can overapproximate them and it doesn't really change much.
Moving is indeed a big pain, and I'm about to back out of it for Oil :-/
----
edit: I don't have any experience with Objective C, but I also think this comment is unsurprising, and honestly I would probably get it wrong too:
https://news.ycombinator.com/item?id=32283641
I feel like ref counting is more "littered all over your code" than GC is, which means there's more opportunity to get it wrong.
It’s a good point that it is a topic that doesn’t get enough coverage, but let’s just add that it has good solutions for most use cases: GCs can pin objects that might be used from some other language.
Reference counting can also have unpredictable hits if you release any large data structures. Whoever drops the last reference suddenly gets to sit through the entire deep set of items to release ( unless you can hand off the release cascade to a background thread ).
I've never heard of a reference counting implementation that can handle memory compaction.
Every time you update a reference count, which is every time you touch any object, you're going to have to write to that RAM, which means stealing it from any other threads using it on any other processors. If you share large trees of data between threads, traversing that tree in different threads will always end up with your threads constantly fighting with each other since there's no such thing as read only memory in reference counting.
When releasing something like a huge list in reference counting, how does the release avoid blowing the stack with recursive releasing? My guess is this just a "don't use a large list whose release may blow the stack with recursive releasing" situation.
> I've never heard of a reference counting implementation that can handle memory compaction.
It's possible to add that in theory. But if you are tracing all your memory anyway so you can compact it, you typically might as well collect the garbage, while you are at it.
But: you are in for a treat, someone implemented compaction for malloc/free. See https://github.com/plasma-umass/Mesh
They use virtual memory machinery as the necessary indirection to implement compaction, with neither changing any pointers nor reliably distinguishing pointers from integers.
> hits if you release any large data structures.
Well, that depends in how is the RC done. This is key to understand because if you can control it, the RC become cheaper.
You can see this way on
http://sblom.github.io/openj-core/iojNoun.htm
ie: If instead of `[Rc(1), Rc(2)]` you do `Rc([1, 2])` that work great.
How is that not the exact same for tracing GC?
Well, you don't need the GC. That is the point of this: Rc could be piece-meal. That is something that could be exploited very well making a interpreter, for example (that is what Array langs do under the hood)
Case 3.: You don’t “want” the constraints of Case 2, but are in practice forced into them due to a huge, poorly-educated developer base being incapable of writing correct refcounted code or even knowing what a weak pointer is.
When I worked at Facebook, which is structurally and politically incapable of building high-quality client software, I was on a small team of people tasked with making heroic technical fixes to keep the iOS app running despite literally hundreds of engineers working on the same binary incentivized to dump shoddy code to launch their product features that nobody would use as fast as possible (did you know that at one point you could order food through the Facebook app, and that a whole two digit number of people per day used this feature? Etc.)
Objective-C has ARC (automated reference counting) — every pointer is a refcounted strong reference by default unless special annotations are used. What makes it worse is that large, deep hierarchies are common, making reference cycles leaking huge amounts of memory easy to create.
For example, the view controller for a large and complicated page (referencing decoded bitmap images and other large objects) is the root of a large tree of sub-objects, some of whom want to keep a reference to the root. Now imagine the user navigates away and the reference to the view controller goes away, but nothing in the tree is deallocated due to the backlink — congratulations, you just leaked 10 MB of RAM!
It’s possible to do this correctly if you actually read the docs and understand what you’re doing, using tools like weak pointers, but when you have hundreds of developers, many of whom got their job either by transferring from an android team or by just memorizing pat answers to all the “Ninja” algorithms interview questions (practically all of which have leaked on Leetcode and various forums), you can be sure that enough of them will fail to do so to create major issues with OOMs.
To mitigate this, we created a “retain cycle detector” — basically a rudimentary tracing GC — that periodically traced the heap to detect these issues at runtime and phone home with a stack trace, which we would then automatically (based on git blame) triage to the offending team.
It was totally egregious undefined behavior, one thread tracing the heap with no synchronization with respect to the application threads that were mutating it, but the segfaults this UB caused were so much rarer than the crashes due to OOMs that it prevented that we decided to continue running it.
> It’s possible to do this correctly if you actually read the docs and understand what you’re doing, using tools like weak pointers, but when you have hundreds of developers, many of whom got their job either by transferring from an android team or by just memorizing pat answers to all the “Ninja” algorithms interview questions (practically all of which have leaked on Leetcode and various forums), you can be sure that enough of them will fail to do so to create major issues with OOMs.
This pretty much nails down what I imagine is the main difference between GC and ARC: with the former you sacrifice performance for ease of use, and with the latter you improve performance by placing some additional work on the programmers.
You can optimize reference counting: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.23...
With allocating as dead, you're basically turning it into a tracing collector for the young generation.
https://users.cecs.anu.edu.au/~steveb/pubs/papers/lxr-pldi-2... is the most recent publication in this lineage of high-performance RC systems.
A paper I quite enjoyed on automatic reference counting for pure, immutable functional programming: https://arxiv.org/abs/1908.05647
It can be quite "fast."
Indeed, and this line of research has been continuing to improve in follow on work on Perceus in Koka: https://xnning.github.io/papers/perceus.pdf and https://www.microsoft.com/en-us/research/uploads/prod/2021/1...
Very cool stuff!
If you have pure, immutable and strict, you can't create cycles. (That's what Erlang does for example.) That makes a lot of memory management techniques much simpler. Both tracing garbage collection and reference counting.
If you have pure, immutable and lazy, you can get cycles. (That's Haskell.) This is almost as complicated for a GC as not having immutability.
This article is naive to the point of being flat-out wrong, since it makes extremely naive assumptions about how a garbage collector works. This is basically another C++-centric programmer saying that smart pointers work better than the Boehm GC -- which is completely true but also completely misleading.
I'm not saying that GC is always the best choice, but this article gets the most important argument wrong:
> 1. Updating reference counts is quite expensive. > > No, it isn't. It's an atomic increment, perhaps with overflow checks for small integer widths. This is about as minimal as you can get short of nothing at all.
Yes, it is. Even an atomic increment is a write to memory. That is not "about as minimal as you can get short of nothing at all".
Additionally, every modern GC does generational collection, so for the vast majority of objects, the GC literally does "nothing at all". No matter how little work it does, a RC solution has to do O(garbage) work, while a copying GC can do O(not garbage) work.
Now, that's not to say that GC is automatically better. There are trade-offs here. It depends on the workload, the amount of garbage being created, and the ratio of read to write operations.
The article says:
> I've already stated I'm not going to do benchmarks. I am aware of two orgs who've already run extensive and far-reaching experiments on this: Apple, for use in their mobile phones, and the Python project.
I can counterpoint that anecdata: Google extensively uses Java in high-performance systems, and invented a new GC-only language (Go) as a replacement for (their uses of) Python.
The right answer is to do benchmarks. Or even better yet, don't worry about this and just write your code! Outside of a vanishingly small number of specialized use cases, by the time GC vs RC becomes relevant in any meaningful way to your performance, you've already succeeded, and now you're dealing with scaling effects.
> [...] and invented a new GC-only language (Go) as a replacement for (their uses of) Python.
That's not true. Go was invented with the intention of replacing C++ at Google. That didn't really work out, and in practice Go became more of a replacement of Python for some applications at Google.
Also there are some indications that Go didn't gain traction necessarily on the merits of the language itself, but more on the starpower of its authors within Google.
(I mostly agree with the rest of what you wrote.)
>Even an atomic increment is a write to memory. That is not "about as minimal as you can get short of nothing at all".
Atomic reference count may trigger cache flush in other CPUs/stall waiting for them to do that, so it's not so minimal indeed.
Definitely use reference counting, it's better! Now you're avoiding cyclic data structures and it sucks, so maybe just for a few objects we'll put them on a linked list, maybe mark them, definitely sweep from time to time to see if anything is unreachable. I'm told there's an algorithm by Boehm.
Well, ok, let's go whole hog, we're collecting garbage again, and it sucks, we get all these baby objects, let's try and optimize the GC: we can keep, I dunno, a count of references to new objects, do some allocation sinking to see if we can avoid making them, put the babies in an orphanage, hey look, RC is GC, QED.
It's interesting how we have come full circle from "Reference counting is the worst of two worlds [manual and GC] and will always be slower" to now "Well, we all know it's actually faster." in like 10 years.
Except that "refcounting is faster than a GC" is mostly a myth, both are equally bad if predictable performance matters.
Looking at metrics from Go's garbage collector, what else do you want? GC pauses are damn low, and you'll see numbers in the sub 500μs range.
If I needed hard-realtime, I would avoid allocation entirely.
Java's ZGC is the state-of-the-art on low-latency, high-throughput GCs, so it might be a better example.
Also, there are hard-real time JVMs -- but hard-realtime really doesn't mean what people often believe. These systems are usually much slower - that's the price they pay for having a fix higher bound on some latency requirement.
Yes, the Java stuff is better. I picked Go as an example because you get very low pause times with no configuration. It's really common for Java apps to be misconfigured, especially when people start throwing GC tuning options at them--people who often don't understand what those tuning options do. For example, "the pauses are too long, let's increase the heap size".
Yeah I agree. Just for anyone who might stumble upon this, if you go with the default G1GC, pretty much just leave it as is. IF you experience some problem that you assume is due to the GC, you may try changing the singular config option of target pause time, which will tweak the GC on the latency-throughput spectrum (which are opposite ends of the same axis)
Its actually usually slower than both manual memory management and GC. It's only coming back now because people are finally learning how to make memory allocations large and rare.
This blog post is an answer to: "Tell me you haven't learned about cache coherence without telling me you haven't learned about cache coherence."
> Its actually usually slower than both manual memory management and GC
[citation needed]
You and the blog post are arguing opposite things, and neither of you have shown any evidence. I get that you're arguing that reference counted objects are bigger (to store the reference count) and/or might use double indirection (depending on implementation), which are both bad for caches. It's not a bad argument. But the counter-argument that the blog posts makes is persuasive as well: it's expensive running a GC that scans the heap looking for loose objects, and reference counting does not need to do that. GC is also "stop-the-world" as well unpredictable and jittery in a way reference counting is not.
My instinct is that reference counting is actually faster (which matches my personal experience), but really, this is not an argument you can solve by arguing in the abstract, you need actual data and benchmarks.
The blog post about this is pretty much incoherent, and comes from a bad understanding of performance, atomic, GC algorithms, and reference counting.
The ONLY reason to reference count is when you need GC-like behavior on a few objects, but do not want to impose a GC on all objects. It is a very valuable tool in performance code not because it is fast, but because it allows you to make other things fast. Suggesting that you should reference count the world is patently ridiculous. This is the reason for Rust's Arc and C++ shared_ptr, not that they are faster than a GC.
The blog post completely brushes aside the costs of _atomic_ counter increments and decrements, calling them "just increments." The "atomic" is the key performance problem, not the increment.
Reference counting also makes objects larger so small objects cannot fit inside a machine register.
Modern GC algorithms are very efficient. They do not need to pause the world, and they do not need to be jittery. However, someone who doesn't understand how expensive atomic ops are probably wouldn't understand this either.
You're assuming RC requires atomics. Just like GCs have been improving so have RCs. There are several designs for modern non-atomic RCs with various solutions for passing data between threads. It generally improves performance significantly better than atomic RCs, for the cache reasons you point out.
Though yes reference counting does increase the size of small objects. I find the performance edge depends on the use case. In some scenarios tracing GCs can be faster than non-atomic RCs, but sometimes they're slower. Sometimes I swap between the two just to see. Generally though RCs tend to use less RAM which can be very valuable.
Without atomics, lock free algorithms are required, which without adequate CPU support are very hard to get right.
Lock free algorithms generally require atomics. The non-atomic RCs use other algorithms to eliminate or at least reduce atomics. The main detail is that the bulk of normal incr/decr's are non-atomic.
> Suggesting that you should reference count the world is patently ridiculous.
To be fair, Python reference counts the world. And old Python only had reference counting, they added tracing garbage collection later.)
Python is not a fast language, of course. But neither is it a ridiculous language.
I think one reason Python has a GIL (and can't execute in parallel on multiple threads in a process or interpreter) is to ensure faster non-atomic reference counting doesn't cause data races and memory errors.
I disagree on the ridiculous part. Python (the language and the interpreter) is generally not considered to be an example of high-quality engineering.
Also, when you have a global interpreter lock, you don't have to do anything inside that interpreter with atomics. Reference counting would be blazing fast in most cases where it can be done without atomic ops.
The blog post is largely incoherent for several reasons.
1. It recommends 8 byte counters. This implies making every object in your programming language 8 bytes bigger which is pretty much a non-starter. Almost everyone that actually uses reference counting uses more like 2-4 bits.
2. Reference counting can have arbitrarily long pauses as well. If you decrease the reference count of an object to zero, you have to decrease the reference count of all referenced objects, which can do significant amounts of work (specifically, it will do a lot of work in the cases where regular GC does almost no work).
3. The blog states that atomic writes are "basically free", but that ignores the fact that in multi-threaded code, atomic writes can actually be fairly expensive since each one requires communication between every thread (This is why python still has a GIL)
4. Because no one uses 8 bytes for the count (since you don't want to double your memory consumption), you still need a GC anyway to deal with objects that get too many references.
> 2. Reference counting can have arbitrarily long pauses as well. If you decrease the reference count of an object to zero, you have to decrease the reference count of all referenced objects, which can do significant amounts of work (specifically, it will do a lot of work in the cases where regular GC does almost no work).
The saving grace for the reference counter is that this is deterministic. The pause is an exact function of the allocation / deallocation pattern which is under the programmers control. So the code can be written in such way that it avoids big delays at inopportune times.
I'm not sure it's true -- just think about your browser displaying this page. Which DOM elem allocations are alive and which are not? The object graph depends on runtime data, and runtime data only. Some pattern may exist for a given program, but that is just as actionable under a tracing GC.
Implementing a runtime is that runs other peoples code is a special case of extreme unpredictability.
> Almost everyone that actually uses reference counting uses more like 2-4 bits.
Here is an incomplete list of languages which use 8 bytes for their reference count on 64-bit computers:
1. Rust, in Rc<T> and Arc<T>
2. C++, in std::shared_ptr<T>
3. Objective-C, in NSObject's retainCount
4. Swift, because of Objective-C
5. Python, where reference count is Py_ssize_t
These were literally the first languages i thought of, they all use 64-bit types. I would argue since reference counting is much rarer than GC, these make up the bulk of reference counting use in the real world. "Almost everyone", huh? It's a bit rich accusing the author of being "almost incoherent" and then say this.
> Reference counting can have arbitrarily long pauses as well.
Deallocating can take arbitrarily long, but deallocation does not stop-the-world. It stops the current thread, which is a huge difference. Malloc can take arbitrarily long as well, it's not like it's wait-free.
In addition, the GC pauses in regular programming languages are frequently orders of magnitude longer than deallocation. You would have to deallocate an enormous tree of object at the root for this to be an issue. And GCs have to do that as well, in addition to their regular stop-the-world pauses. This argument is just irrelevant.
> The blog states that atomic writes are "basically free", but that ignores the fact that in multi-threaded code, atomic writes can actually be fairly expensive since each one requires communication between every thread (This is why python still has a GIL)
First off, this is not why Python has a GIL, but lets leave that aside.
Atomic writes are more expensive than non-atomic ones, but they are not slow operations in the grand scheme of things. If you properly implement acquire-release semantics, they are not even that slow under high contention. Compare this to a GC which literally STOPS ALL THREADS, it's nothing.
> you still need a GC anyway to deal with objects that get too many references.
This is just silly. In languages which has both reference counting and traditional garbage collection (e.g. Python), they do it to avoid reference cycles, not because objects get "too many references". That is a ridiculous statement.
In fact! I just realized we do have data for which is more performant: this article describes how Instagram turned of GC for Python and just relied on reference counting. They gained 10% increase in performance:
https://instagram-engineering.com/dismissing-python-garbage-...
I think it's still an open question if reference counting is always faster than GC, but I do not believe you have the technical expertise to evaluate such a claim. Your comment is four paragraphs long and riddled with factual errors. If you want to be convincing, show data that is better than that Instagram case study.
>First off, this is not why Python has a GIL, but lets leave that aside. Atomic writes are more expensive than non-atomic ones, but they are not slow operations in the grand scheme of things. If you properly implement acquire-release semantics, they are not even that slow under high contention. Compare this to a GC which literally STOPS ALL THREADS, it's nothing.
This is actually part of why Python still has the GIL. A GILECTOMY was attempted and multithreaded atomic refcounting made things a lot slower (going up with the number of threads) and even other methods were not sufficient for performance.
> Deallocating can take arbitrarily long, but deallocation does not stop-the-world
And modern GCs only have to stop the current thread to check for which thread-local objects are alive. The alice ones are moved to another generation, making the space reusable for free.
And atomic writes need synchronization which is crazy expensive, I honestly don’t get why you think it isn’t.
Also, just try writing rust/c++ code that relies entirely on RC vs Java in an object heavy workload - I really don’t think it is an open question in any shape or form.
> The alice ones are moved to another generation, making the space reusable for free.
It's pretty hilarious to me that you first say "they have to move it to another generation" and then you say "it's free!" It's like saying "I paid for my dinner, and now I get to eat it for free!"
Also: what do you think `free()` does? All modern memory allocators do this trick, keeping thread-local caches. This is not an advantage of GCs when reference counting does it as well.
Almost all modern GCs are stop-the-world in at least phases, and for good reason: stop-the-world GCs are higher performance. You pay in other ways for skipping that stop.
> And atomic writes need synchronization which is crazy expensive, I honestly don’t get why you think it isn’t.
Because I've actually benchmarked it: https://quick-bench.com/q/ISEetAHOohv-GaEuYR-7MajJgTc
18.5 nanoseconds fits under no reasonable definition of "crazy expensive", not when a regular increment clocks in at 5.9 nanoseconds. And there is extremely few situations where you increment a reference count more than, like, 5 times. It's just not an issue.
This is like cargo cult programming: you've been told these things and never tested them in the real world, and you have all these preconceived notions that just don't stand up to two minutes of scrutiny.
> Also, just try writing rust/c++ code that relies entirely on RC vs Java in an object heavy workload - I really don’t think it is an open question in any shape or form.
Yes, of course garbage collectors are easier to use than reference counting. Nobody has ever disputed this. That is the whole raison d'etre of garbage collectors. This is not what the discussion is about, it's about performance.
I'm done with this thread now, unless anybody can show me any actual data to back anything you say up. It's really tiring. I started this by saying "I don't actually know", and everyone replying to me has been so darn certain of everything they say while being outright incorrect about most things, and without any actual data to back up the rest.
>Because I've actually benchmarked it: https://quick-bench.com/q/ISEetAHOohv-GaEuYR-7MajJgTc 18.5 nanoseconds fits under no reasonable definition of "crazy expensive", not when a regular increment clocks in at 5.9 nanoseconds. And there is extremely few situations where you increment a reference count more than, like, 5 times. It's just not an issue.
Congratulations. You tested a construct meant for multicore/threading in a single threaded benchmark and then marvel at the low overhead.
Of course you will only start seeing the cost if there is actually contention to operate on the value between multiple threads running simultaniously. See.: https://travisdowns.github.io/blog/2020/07/06/concurrency-co...
You'd get very different results doing atomic operations on counters shared between multiple threads, and with other stuff to do that causes your CPU to serialise and run less out-of-order. That single-threaded benchmark is data, but it doesn’t appear awfully relevant to how naive RC barriers perform versus barriers for coalesced RC or tracing.
Indeed the issue with measuring barriers is that measuring the barrier doesn't suffice; one wants to measure how the barrier affects the rest of execution. This entails coming up with programs that are much less trivial than repeatedly incrementing a counter.
Moving to another generation can be done completely asyncronously on another thread that likely doesn't do any useful work on a modern, highly parallel hardware. `free` doesn't do much, but `malloc` does -- with the method I am talking about (TLAB in the JVM), you get as fast allocations as it gets, it's nothing more than a NON-ATOMIC pointer bump. Meanwhile malloc has to find an empty space that can fit the object at hand.
> > Also, just try writing rust/c++ code that relies entirely on RC vs Java in an object heavy workload - I really don’t think it is an open question in any shape or form. > Yes, of course garbage collectors are easier to use than reference counting. Nobody has ever disputed this. That is the whole raison d'etre of garbage collectors. This is not what the discussion is about, it's about performance.
I am talking about performance exactly. Java's GC will smoke the hell out of C++'s shared pointers and Rust's (A)RC. Noone said anything about productivity/ease of usage.
And as mentioned by another commenter - your benchmark didn't take into account anything related to parallel execution, which would be the point.
From experience, your benchmark is not correctly estimating how long a non-atomic increment takes. Run 10000 of them back to back inside each benchmark iteration (atomic and non-atomic), and then runcontested.
You will see that the counter increment is about 2-5 cycles, a few hundred ps, and the atomic is on the order of 10 ns uncontended.
If you then introduce contention and a multi-socket setup, the atomic will slow down significantly. Only one thread can touch it at a time, so they have to take turns.
even ignoring the fact that this isn't counting contention, 18ns is still really high overhead since it's an operation you have to do twice to read any object. a read from l1 cache is about 3 cycles (1-3 ns), so 36 ns to increment and decrement a counter is far from trivial.
Is your benchmark running concurrently? Or single-threaded?
> That is a ridiculous statement.
If you do use limited counts, you will need a backup tracing collector; and limited counts are appealing because they save space (and most objects tend to only have a few references [1]).
> In fact! I just realized we do have data for which is more performant: this article describes how Instagram turned of GC for Python and just relied on reference counting. They gained 10% increase in performance:
...because generations in the GC are represented as linked lists, and modifying those linked lists reduces the number of page faults, in turn because pages are shared by copy-on-write between process. That representation isn't the best idea already, and shared structure should be in the oldest generation anyway, as you don't want to collect it (as SBCL does for images, which are loaded with copy-on-write).
[1] http://users.cecs.anu.edu.au/~steveb/pubs/papers/rc-ismm-201... page 4
Objective-C stores the retain count in the tops bits of an object pointer. The hardware it runs on has been specially tuned to make atomic reference counting faster.
>C++, in std::shared_ptr<T>
In C++, every shared_ptr also allocates a separate "control block" in the heap, so performance is even worse than that.
std::shared_ptr is not meant to be used as the default memory management technique, hence why it's not that important.
But it does mean that C++ is not a good example here. I would argue that Rust isn't, either, since ARC is also an occasional opt-in there rather than the default.
Eh, the GC you are describing sounds like something out of 1994. Modern concurrent algorithms do not need to stop execution. This type of code may paradoxically be slower in benchmarks when there isn't any memory churn, as it necessarily introduces barrier instructions at key points to ensure the GC has a consistent view of the memory (although memory consistency is also a concern for reference counting).
Note that if you can afford stop the world pauses, GCs that use them are much more efficient. Parallel stop the world garbage collectors are by far the best for minimizing runtime, while concurrent GCs have higher overhead, but can guarantee no significant pauses.
Yeah, that's true, but it shouldn't be held against GC that it interrupts the execution, since it really doesn't have to.
With all due respect, why do you believe that your experience is meaningful compared to people writing actual industrial-scale runtimes, when RC is the hello world of GC algorithms? It’s not that they don’t know about it, it’s that they studied the topic countless times and found significant difference between the two approaches, to the point that no language considered fast uses RC (or if they do, they do so because it is a useful escape hatch to manual memory management that doesn’t need support from the runtime).
Hiring would certainly be a lot easier if more people were to make bold, completely wrong blog postings like these. I could immediately give my negative recommendation without the time and hassle of a phone interview.
I completely agree, but before we scare people away from blogging too much, I will say that my big problem with this post isn't the lack of knowledge, it's the willful ignorance and lack of humility. It's clear that the author doesn't really understand the position they are trying to argue against.
Has anyone done a good paper on how memory bank affinity for processors affects these costs?
See that last graph with memory overhead? Not everyone's definition of "better" allows for that.
He fails to mention that Apple added support for ref counting in silicon.
And.. often GC will be able to use area allocators, before falling back to "proper" GC allocation. Which will be a lot faster than ref counting everything.
And atomics can get very slow, I've had atmics show up regularly in the profiler.
For my project, the combination that works great so far: unbox all types, use area allocators if the compiler can guarantee the value doesn't escape, use GC for data that changes often and ref counting for data that hardly ever changes. (luckily cycles are not possible)
In practice, I don't see any reference counting approaches that do cycle detection.
I have an example from early in my career where I accidentally created a memory leak in Python from a cyclic reference between a closure and a function argument
https://stackoverflow.com/questions/54726363/python-not-dele...
Clearly this person hasn't tried how this works on NUMA cpus. it's quite expensive to do these atomic inc/decs there, or even without NUMA... caches must be synced and flushed because of this.
Yeah, surprised no one is mentioning this. (A)RC is awesome... for flushing caches. :-(
I'm sorry, but this is a very poorly reasoned article that does not engage with any of the serious work that's been underway to get reference counting competitive with tracing GC. This is evident from the very first point:
> 1. Updating reference counts is quite expensive.
> No, it isn't. It's an atomic increment, perhaps with overflow checks for small integer widths. This is about as minimal as you can get short of nothing at all. The comparison is the ongoing garbage collection routine, which is going to be more expensive and occur unpredictably.
First off, updating the reference count invalidates the entire cache line in which the reference count lives. For naive reference counting (which I'm assuming the author is talking about since they give no indication they're familiar with anything else), this generally means invalidating the object's cache line (and with an atomic RMW, to boot, meaning you need a bus lock or LL/SC on most systems). So right away, you have created a potentially significant cacheline contention problem between readers in multiple threads, even though you didn't intend to actually write anything. RC Immix, for example, tries to mitigate this in many creative ways, including deferring the actual reference count updates and falling back to tracing for reclamation when the count gets too high (to avoid using too many bits in the header or creating too many spurious updates).
Secondly, you know what's cheaper than an atomic increment or decrement? Not doing anything at all. The vast majority of garbage in most production tracing garbage collectors (which are, with the exception of Go's, almost exclusively generational) dies young, and never needs to be updated, copied, or deallocated (so no calling destructors and no walking a tree of children, which usually involves slow pointer chasing). Even where the object itself doesn't die young, any temporary references to the object between collections don't have to do any work at all compared to just copying a raw pointer, C style. This and bump allocation (which the author also does not engage with) are the two biggest performance wins that tracing garbage collectors typically have over reference counting ones, and solutions like RC Immix must implement similar mechanisms to even become competitive. You don't even need to go into stuff like the potential benefits of compaction, or a reduction in garbage managing code on the hot path (which are more dubious and harder to show) to understand why tracing has some formidable theoretical advantages over reference counting!
But what about in practice? Surely, the overhead of having to periodically run the tracing GC negates all these benefits? Well, bluntly--no, not even close. At least, not unless you care only about GC latency to the exclusion of everything else, or are using something fancier (like deferred RC). You can't reason backwards from "Rust and C++ are generally faster than languages with tracing GCs on optimized workloads" to conclude that reference counting is better than tracing GC--Rust and C++ both go out of their way to avoid using reference counting at all wherever possible.
None of this is secret information, incidentally. It is very easy to find. The fact that the author is apparently so incurious that they never once bothered to find out why academics talk about tracing GC's performance being superior--and the fact that it was so dismissive about it!--makes me pretty doubtful that people are going to find useful insights in the rest of the article, either.
GC researchers insist on conflating GC with all of automatic memory management. The public doesn't do this and neither does the article.
> Secondly, you know what's cheaper... Not doing anything at all.
These techniques are on the level of resetting a stack pointer or calling `sbrk()`. Incorporating them doesn't produce more-advanced GC schemes, it just means you neglected to consider similar allowances for RC.
The line of contention is at traversing the object graph and pausing threads.
Of course I have considered similar allowances for Rc (this has nothing to do with stack allocation by the way, I'm hoping this is not a misunderstanding on your part). I referenced RC Immix throughout the post. It is extremely clear that the author of the article did not consider such allowances because they do not see that there is a problem. Even if the author had done so, there is a big difference between saying "whatever, I could do that same thing for Rc!" and actually doing it--a hugely nontrivial one that has not been fully bridged until quite recently. These kinds of techniques are still not used in production languages with reference counting garbage collection, including Swift.
An advantage of RC is that you can also use it to verify ownership.
When the counter is 1, you can do anything with the object without affecting any other references.
Like the object could be mutable for a counter=1, and copy-on-write otherwise. Then you can make a (lazy) deep copy by just increasing the counter.
>> "The Python case is more inarguable. If GC is so good, why wouldn't Python just garbage collect everything,... ? It is because RC outperforms garbage collecting in all these standard cases"
Pretty weird argument for one of the slowest languages out there ...
> This is about as minimal as you can get short of nothing at all.
With GC, you can do nothing at all. In a system with lots of garbage, you can do a GC by copying everything accessible from the GC root, then de-allocating all the garbage in a single free.
Atomic inc/dec is hella expensive relative to not doing it. It’s true that CPUs optimize it, but not enough to make it free. RC as a replacement for GC means doing a lot more of this expensive operation - which the GC will do basically zero of in steady state - so this means RC just costs more. Like 2x slowdown more.
The atomic inc/dec also have some nasty effects on parallel code. The cpu ends up thinking you mutated lines you didn’t mean to mutate.
So, GC is usually faster. RC has other benefits (more predictable behavior and timing, uses less memory, plays nicer with OS APIs).
> So, GC is usually faster.
GC is way faster if there is little collection.
In memory or cache intensive applications, garbage collection as a whole can be significantly slower.
GC is faster even if you collect a lot. GCs create better cache locality especially for recently allocated objects, and their cache behavior is not generally worse than malloc (but there are many GCs and many mallocs and some try harder than others to make caches happy).
The total time spent in GC across a program’s execution time is usually around 30% or so. Maybe more in some cases (some crazy Java workloads can go higher) or less in others (JavaScript since the mutator is slow), but 30% is a good rule of thumb. That includes the barriers, and total cost of all allocations, including the cost of running the GC itself.
Reference counting applied as a solution to memory safety, as a replacement for GC, is going to cost you 2x overhead just for the ref counting operations and then some more on top of that for the actual malloc/free. When you throw in the fact that GCs always beats malloc/free in object churn workloads, it’s likely that the total overhead of counting refs, calling free(), and using a malloc() that isn’t a GC malloc is higher than 2x, I.e. more than 50% of time spent in memory management operations (inc, dec, malloc, free).
It’s a trade off, though. The GC achieves that 30% because it uses more memory. All of the work of understanding the object graph is amortized into a graph search that happens infrequently, leading to many algorithmic benefits (like no atomic inc/dec, faster allocation fast path, freeing is freeish, etc), but also causing free memory to be reused with a delay, leading to 2x or more memory overhead.
That also implies that if you ask the GC to run with lower memory overhead, it’ll use more than 30% of your execution time. It’s true that if you want the memory usage properties of RC, and you try to tune your GC to get you there, you gonna have a slow GC. But that’s not how most GC users run their GCs.
Another RC advocate that misses the point about RC being a GC algorithm from CS point of view.
This article provides very little evidence for it's claims and seems to only have a superficial understanding of modern GCs.
"Increments and decrements happen once and at a predictable time. The GC is running all the time and traversing the universe of GC objects. Probably with bad locality, polluting the cache, etc."
This is only the case with a mark-sweep collector, usually most of your allocations die young in the nursery. With reference counting you pay the counting cost for everything.
"In object-oriented languages, where you can literally have a pointer to something, you simply mark a reference as a weak reference if it might create a cycle."
As someone who has tried to identify memory leaks in production where someone has forgotten to "simply" mark a reference in some deep object graph as weak, this is naive.
"With an 8-byte counter you will never overflow. So...you know...just expand up to 8-bytes as needed? Usually you can get by with a few bits."
So now my "about as minimal as you can get short of nothing at all" check as an unpredictable branch in it?
"If you must overflow, e.g., you cannot afford an 8-byte counter and you need to overflow a 4-byte counter with billions of references, if you can copy it, you create a shallow copy."
I don't even know where to begin with this.
"If GC is so good, why wouldn't Python just garbage collect everything, which they already did once and could trivially do, instead of going through the hassle of implementing reference counting for everything but the one case I mentioned?"
This probably has more to do with finalising resources and deterministic destruction than anything else.
--
Anyone who is interested in actually studying this area would probably find https://courses.cs.washington.edu/courses/cse590p/05au/p50-b... interesting. Also https://gchandbook.org/
>If GC is so good, why wouldn't Python just garbage collect everything, which they already did once and could trivially do
I don't think python ever did pure mark-and-sweep ( cpython, at least, I'm sure jython and other alternate implementations have ).
My understanding was that they did pure reference counting, and kludged on a sweep GC to do cycle breaking eventually, as manually breaking cycles in early versions of python was a pain point. A quick lookup seems to indicate python1 was pure reference counting, and they added the cycle breaking when they released python2.
That's my recollection as well (though I've not found evidence to support it). In fact, IIRC, as part of the python performance thing one of the topics was adding a proper generational GC into python for objects that don't refer to pinned memory.
Whenever I read someting like this, I wonder what kind of programming the author is doing. I’m getting a strong whiff of embedded and/or real-time systems.
Reference counting forces the developer to think about memory management.
Apple has a nice talk on ARC [1] but it got me thinking: if I have to think about reference counting this much I might just as well manage memory all by myself.
The true joy of Garbage Collection is that you can just create objects left and right and let the computer figure out when to clean them up. It's a much more natural way of doing things and lets computers do what they're best at: taking tedious tasks out of the hands of humans.
[1]: https://developer.apple.com/videos/play/wwdc2021/10216/
One great advantage of garbage collection is that it removes need for thread synchronization in cases where is it only needed to make sure that object jou are going to call free/decref on is not in use in another thread. Corollaly to that GC is the thing that you need for many lock-free data structures to be practical and not of only theoretical interest.
It might seem that it is simply about pushing your synchronizations problems onto the GC, but the synchronization issue that GC solves internally is different and usually more coarse-grained, so in the end you have significantly smaller synchronization overhead.
You are in for a fun time when you need to make circular data structures with ARC.
What if programming languages start offering both? In my opinion RC is GC in disguise. At least for example Golang GC has the merit to run in a separate thread(still has to stop the world when reclaiming memory back, and the memory allocator model is helping achieve great GC perfs).
Python does both reference counting and garbage collection.
Btw, GC is also often RC in disguise. What I mean is that generational garbage collectors are basically a hybrid of tracing GC and RC. See https://web.eecs.umich.edu/~weimerw/2012-4610/reading/bacon-... for the details.
My understanding of Python's Global Interpreter Lock is that reference counting cannot be done efficiently between threads, so we cannot remove the GIL with reference counting
Java's GC is concurrent and runs at safe points and stops the world so it avoids this problem.
When I wrote a Lisp interpreter in the '90s, that's how I did it and I'm ashamed to admit that I have no idea how modern GC is done - I've always assumed (naively!) that it was like Lisp's!
Modern Lisps likely have modern GCs. I mean there's no reason for them not to.
Racket probably has a state-of-the-art garbage collector. (I don't actually know, but that's where I would start looking.) Clojure obviously has the same garbage collector as any other JVM language.
Racket has like 5 GC, perhaps more.
In one extreme you can build Racket using the Senora GC that is conservative and not moving, that is used only for bootstraping.
On the other extreme, both of the normal versions of Racket have custom moving incremental GC. The docs with some high level explanations are in https://docs.racket-lang.org/reference/garbagecollection.htm...
The implementation details of the main "CS" version are in https://github.com/racket/racket/blob/master/racket/src/Chez... It's a replacement of the default GC of Chez Scheme that has better support for some features that are used in Racket, but I never have looked so deeply in the details.
Thanks for doing the legwork! I didn't find anything definite in twenty seconds of Googling, so I left my comment vague.
Pardon the ignorance but I thought refcount was a GC strategy?
The academia tends to call RC a form of GC. For programmers experienced in languages with manual memory management those are very different beasts.
> I've already stated I'm not going to do benchmarks.
Yikes
Isn’t garbage collection needed to solve circular reference counts?
Nope, you can just mark the back-reference as weak.
GC is only required if you as a programmer (or programming language) do not provide sufficient information to the compiler or runtime to understand the object graph.
It's not always obvious to know which reference to mark as weak, and there's not necessarily a clear indication of which reference is a back-reference.
You can find various algorithms in journals or whatnot written with the assumption that there's GC. Algorithms designed with this assumption may not have clear ownership for objects, and those objects my have cyclic references.
It's easy to say, "objects should have clear ownership relationships" but that kind of maxim, like most maxims, doesn't really survive if you try to apply it 100% of the time. Ownership is a tool that is very often useful for managing object lifetimes--it's not always the tool that you want.
But, I mean, the whole purpose of using a reference counted GC language is for the productivity gain. If I'm going to be using a reference counted language and manually specifying weak pointers then I'd just C++
It sounds like you're saying that the only productivity gain from ARC (automatic reference counting) is that you don't have to manually annotate what type of pointer you want. I don't agree with that.
Yes, if you forget to mark a ref-counted pointer as weak, then you may get a memory leak. Memory leaks can be disastrous, but they can be benign, and they're always better than use-after-free.
In C++ it is easy to create a raw pointer (with & or *). It's unsafe. Soon enough, you have some lambda inside another function, but the lambda is executed after the enclosing function returns, and you've captured a variable with &. Oops. You thought that the lambda would get executed during the enclosing function's execution, but you misread the API you were using.
IMO the big productivity gain is being able to write code where I don't have to think too hard about whether the code is memory-safe. Modern C++ code makes this easier, but languages with ARC (like Objective-C or Swift) make this even easier. Code is mostly safe by default, and you can visually inspect code to look for unsafe behavior, more easily than you can with C++.
There are also hybrid ref-counted + tracing GC options. CPython uses this approach.
As a slight tangent, weak pointers are useful in languages with a tracing GC, too. Haskell offers weak pointers.
For example they make certain kinds of caches easier.
And if you have really clear ownership, you don't need reference counting either..
That's definitely not true. "Clear ownership" may still be shared ownership.
Hmm, ok. But how is that different from situations where tracing GC is useful? (And even languages with a tracing GC often offer weak pointers. Eg Haskell and Java do. Python does as well, but Python has both reference counting and tracing garbage collection.)
Apart from performance (latency vs throughput) considerations, the only difference between RC and GC for your algorithms and datastructure design is whether you allow circles in your datastructures.
Just allocate then deallocate manually, use neither auto methods.
What do you mean by 'manually'? Malloc and free still do lots of work.
Or do you want to manually assign memory addresses to your objects?
What about - garbage collect by reference counting, like Python?
Python has reference counting for historical reasons, and added tracing garbage collection for dealing with cycles.
If you wanted performance these days, you wouldn't want to go for that architecture. It's a historically accident that they can't really free themselves from because of backwards compatibility.
Is there such a thing as a compacting RC?
Backup compaction can be useful, like backup tracing can be, but you can also use all the initial increments in a coalescing RC collector to determine which pointers need to be fixed up for copying, without tracing. See http://users.cecs.anu.edu.au/~steveb/pubs/papers/rcix-oopsla... pages 8 and 9 on "Defragmentation with Opportunistic Copying" e.g.
There was a great talk at Strange Loop about a drop-in malloc replacement which compacts.
Apparently it actually led to memory usage improvements in industrial projects like Redis:
See https://github.com/plasma-umass/Mesh for the code and a link to the paper.
I read most the article and it's just a lot of the same tired old arguments and an extremely simplified worldview of both GC and reference counting. I wish I had the author's address, because I'd like to mail them a copy of the Garbage Collection Handbook. They clearly have a very naive view of both garbage collection and reference counting. And there isn't a single dang measurement anywhere, so this can be completely dismissed IMHO.
Agreed. What I particularly disliked is how absent of nuance it is. RC is a form of GC and all GC algorithms make tradeoffs. RC trades throughput for latency. Compacting mark and sweep trade latency (and usually memory) for throughput.
The rant at the end can be boiled down to "I use confirmation bias [1] to make my engineering decisions". The OP has already decided that "GC" is slow, so I'm sure every time a runtime with it misbehaves it's "Well, that darn GC, I knew it was bad!" and every time RC misbehaves it's likely "Oh, well you should have nulled out your link here to break the cycle dummy!"
I really don't like such absolutist thinking in software dev. All of software dev is about making tradeoffs. RC and GC aren't superior or inferior to each other, they are just different and either (or both) could be valid depending on the circumstance.
> absent of nuance
Yes, this is a good point. It makes overly general claims.
E.g. a GC proponent could claim "well, tracing collectors do no work for dead objects, so they have no overhead!" Which is a good point, but not the whole story. Tracing collectors may need to repeatedly traverse live objects. Sure. But then generational collectors only traverse modified live objects that point to new objects. True. And concurrent collectors can trace using spare CPU resources, incremental collectors can break marking work up into small pauses, on and on. There are zillions of engineering tradeoffs and the GC Handbook covers most of them really well.
Boy, I can't wait for theangeryemacsshibe (posts here as hayley-patton) to tear into this one.
But yeah, the correct way to handle resources (not just memory!) is with value semantics and RAII. Because then you know the object will be cleaned up as soon as it goes out of scope with zero additional effort on your part. In places where this is not appropriate, a simple reference counting scheme may be used, but the idea is to keep the number of rc'd objects small. Do not use cyclical data structures. Enforce a constraint: zero cycles. For data structures like arbitrary graphs, keep an array of vertices and an array of edges that reference vertices by index.
If you use a language with GC, you're probably just contributing to global warming.
Why not just write embedded programs with fixed size memory allocation then if we are that okay with restricting the programs we write?
Because maybe we're not okay with restricting the programs we write that much.
Nor am I okay with restricting my programs as much as a tree-like only allocation strategy would suffice and circular data structures are not evil.
Doesn't RAII only work when your lifetimes are in a nested hierarchy?
(Basically, your lifetimes have to be the same as your scopes, which are in a simple tree structure only.)