Ubuntu 24.04 LTS will enable frame pointers by default

222 points by jnsgruk 3 years ago · 120 comments

Reader

For anyone else who didn't know what a frame pointer was: https://softwareengineering.stackexchange.com/questions/1943...

vaylian 3 years ago

For x86 that value is typically stored in the stack base pointer register (bp): https://en.wikibooks.org/wiki/X86_Assembly/X86_Architecture

ghotli 3 years ago

I somewhat painstakingly figured this out the hard way pulling core dumps off of embedded linux devices that gdb had a hard time working with. At the time I was like whew why is omitting frame pointers the default at all in so many places when it didn't seem to make a measurable difference in the performance of the software. I guess it's just vestigial these days and yes please use a compiler flag like this when it measurably makes sense. Making debugging simpler for the rest of us is the way to go.

o11c 3 years ago

Something has to be seriously wrong with the way it was compiled for gdb to have trouble. Debugging (or in-process exception dumps), which only does a reasonable number of backtraces, should always be able to use separate unwind data sections.
Including frame pointers should only have a performance effect for sample-based profiling, which does a very large number of backtraces. And the general fact is - people don't profile, and if they do, they don't do it correctly.
Omitting frame pointers has significant performance wins on platforms with about 6 registers, like 32-bit intel x86. It's much less of a win on platforms with about 14 registers, like 64-bit x86 or 32-bit ARM, let alone platforms with about 30 registers, like 64-bit ARM.
Since modern architectures are strongly trending toward designs that support more registers, a frame pointer isn't unreasonable to choose. But that's still no excuse for all the shitty software that refuses to work correctly without them, rather than merely more slowly.
(Note that theoretically it is possible to design an ISA/ABI combo that supports easy and fast unwinding even without frame pointers, but there's always going to be some overhead and to my knowledge this choice hasn't been done.)
- brancz 3 years ago
  
  Founder of Polar Signals here, the profiling product that's mentioned in the blog post. We've already made it possible to unwind without frame pointers relatively cheaply (<1% overhead), and have written about this extensively [1].
  That said, no matter how we spin it, frame pointer unwinding is always going to be cheaper, and while profiling is getting better, I think I'm almost more excited about the other aspects of debuggability this is gaining: out-of-the-box working bpftrace, bcc-tool and anything else that needs to deal with unwinding with just about anything that's running on the box. I think we'll see a huge gain in capabilities over the next few years with frame pointers more prevalent in fedora and ubuntu and I'm sure more will now follow.
  [1] https://www.polarsignals.com/blog/posts/2022/11/29/dwarf-bas...
- stefan_ 3 years ago
  
  Embedded devices, certainly in the days of 16 MiB NOR flash, do not contain unwind or debug information. Even today OpenWrt and similar will routinely strip all binaries installed to the final firmware image.
  There are some structural issues in the Linux world, too; the default of debug data contained within the binary is often undesirable, symbol servers (they finally learned about those in Ubuntu 22) require extra setup & tooling support that isn't often invested in, widely used libraries like libunwind are both arcane and terrible (yes, an instruction pointer of 0 will not have associated unwind information; use your brain and realize someone called a NULL function pointer).
- ghotli 3 years ago
  
  > Something has to be seriously wrong with the way it was compiled for gdb to have trouble.
  Cargo culting culture in embedded devices of just using some old toolchain copy pasted from some vendor seems to be pervasive. I cut fresh compilers and align their output with the old crusty toolchains. Made entire classes of issues go away. Regardless, agreed the omission of the frame pointer was merely one issue at play with those particular core dumps on those particular devices, years back at this point. :)

brancz 3 years ago

Frame pointers are such a destructive micro-optimization to omit by default, I am beyond excited about this collaboration with the folks at Canonical to make Ubuntu debuggable by default!

duskwuff 3 years ago

This option was almost certainly a holdover from the bad old 32-bit x86 days, when disabling frame pointers gave you a seventh valuable general-purpose register. It's no longer beneficial on x86_64 -- even with rbp locked down, you still have fourteen registers there.
- jandrese 3 years ago
  
  Yeah, back in the 32 bit days -fomit-frame-pointer was the only optimization you could count on to really make a difference. It wasn't small either, often 10-15% speedup. No other flag on gcc would make even a full percentage difference in my testing.
  The Amd64 architecture fixed the underlying problem, so this is pretty much just a holdover. I'm surprised they even enabled it by default.
  - grrandalf 3 years ago
    
    (iirc) I think I used `-fomit-frame-pointer` with DJGPP on DOS on a 486. It was an [unimpressive :)] software-rendered 3D graphics demo, and I was very happy with the substantial free speedup I got.
- bodyfour 3 years ago
  
  That much is true -- the difference between 6 and 7 registers is much larger than the benefit of going from 14 to 15.
  However, even under zero register pressure having a frame pointer is still an extra register that needs to be touched on every function invocation, extra instructions taking space in the I-cache, etc. It's a small thing, but it's still a cost that has to be paid by all compiled code.
  I'm not going to claim that re-enabling frame pointers was the wrong choice -- the people involved in the debate know the tradeoffs and I have to start with the assumption that I would have made the same decision if I were in their position. It does make me slightly sad, though. The idea behind removing frame pointers isn't that backtraces aren't important, it's that computing the frame pointer after-the-fact is possible -- i.e. for normal functions without alloca() or dynamically-sized stack arrays map %rip -> frame size.
  The problem seems to be that despite years of experience with "no-frame-pointer" being the default I guess the profiling tools never got as reliable or good as the with-frame-pointer variants. My personal hope was that the problem would fade over time as tools improved, but it seems that's unlikely to ever happen. After all, once no-frame-pointer stops being the default there won't be any pressure for tools to improve. The towel has been thrown in.
  - brancz 3 years ago
    
    Profiling tools have already solved the ability to reliably unwind in the absence of frame pointers[1], but there are plenty of tools that this kind of investment is simply too much that it won't ever happen, like bpftrace or bcc-tools.
    [1] https://www.polarsignals.com/blog/posts/2022/11/29/dwarf-bas...
    
    account42 3 years ago
    
    Do those tools really need to implement their own unwinding though? This really should just need to be implemented once in a library and then used wherever unwinding is needed.
    
    brancz 3 years ago
    
    With the combination of eBPF and DWARF-based it's not quite that simple, there's a pretty elaborate dance between user-space and kernel-space that needs to happen for this to work. With frame-pointers it's walking a linked list in kernel-space.
  - rightbyte 3 years ago
    
    Ye. Bytecode interpreters very much benefit from the extra register. And any function that looks similar to one.
    I mean e.g. getter and setter functions get alot of extra code to run.
    
    account42 3 years ago
    
    Note that even without -fomit-frame-pointer, the compiler can still omit the frame pointer for in some cases. From the GCC documentation:
    > Note that -fno-omit-frame-pointer doesn’t guarantee the frame pointer is used in all functions. Several targets always omit the frame pointer in leaf functions.
    Fully inlined functions also won't have a seperate stack frame at all. I imaging that is is a big part why the perf impact of turning the frame pointer back on is as small as it is.
    It however also means that a complete stack trace will still require using debugging information, in which case you don't really need a frame pointer at all.
- kmeisthax 3 years ago
  
  Also, apparently Intel is planning to extend x86_64 to 32 GPRs, with an extension called... sigh[0]... Intel APX[1]. So the overhead of frame pointers will be even lower in the future.
  [0] Intel APX is extremely confusable with the iAPX 432, a failed non-x86 architecture Intel made that's completely unrelated to doubling the size of the x64 register file.
  [1] https://www.intel.com/content/www/us/en/developer/articles/t...
  - p1mrx 3 years ago
    
    iAPX has been dead for almost 40 years; there probably won't be much confusion.
    
    hulitu 3 years ago
    
    It's not dead. It's sleeping.
    
    p1mrx 3 years ago
    
    Intel processors prior to the 80386SL cannot sleep.
- cesarb 3 years ago
  
  To emphasize this point: on 64-bit x86 with frame pointers, you have twice as many registers as on 32-bit x86 without frame pointers, and these registers are twice as wide. A 64-bit value (more common than you'd expect even when pointers are 32 bits) takes two registers on 32-bit x86, but only a single register on 64-bit x86.
  - jamesfinlayson 3 years ago
    
    So there's no point in disabling frame pointers for 32-bit code running on a 64-bit processor then?
- brancz 3 years ago
  
  100% the way I see it! On 32-bit the performance benefits is very major on just about everything, but not so on 64-bit.
- PaulDavisThe1st 3 years ago
  
  and 14 registers should be enough for everyone!
  - mratsim 3 years ago
    
    It's not. 14 is still low compared to other ISAs.
    
    peterfirefly 3 years ago
    
    Spilling to an XMM register instead of memory is an option.
amelius 3 years ago

> to make Ubuntu debuggable by default!
How's that, you'd still need the debug symbols
Also has anyone else noticed that running stuff through Valgrind is really only possible if the program was made with Valgrind in mind? For example, Python and its many extensions generate numerous errors and warnings, so many that any real problem becomes hidden.
I'd say that modern Linux systems are very far from being debuggable.
- brancz 3 years ago
  
  Ubuntu hosts a debuginfod server that you can automatically discover debuginfo from any binary from using the binary’s build id.
  https://sourceware.org/elfutils/Debuginfod.html
- rightbyte 3 years ago
  
  You offent need to make excemption file.
  My tip is to run some thing one program execution, and then run it twice during the same program execution, and diff the reports.
account42 3 years ago

It's a "micro" optimization that can be automatically applied everywhere. And its not destructive because debug info contains all you need to calculate the frame pointers after the fact. Really a no-brainer to have it on unless you are dealing with broken tools.
- brancz 3 years ago
  
  We've "fixed"[1] all the tooling that we work on at Polar Signals, and even then I think this is the right thing to do. It enables so much more debuggability aside from profiling.
  [1] https://www.polarsignals.com/blog/posts/2022/11/29/dwarf-bas...
karmakaze 3 years ago

> Ubuntu leaps forward with frame pointers by default
Yeah it's more like "...will stop omitting frame pointers by default".

londons_explore 3 years ago

So the whole world should take a 1-2% performance penalty on everything so some users can maybe run a profiler?

Wouldn't it make more sense to just have an 'apt reinstall all --with-frame-pointers' command that power users could run before they wanted to profile something?

brendangregg 3 years ago

I don't know where 1-2% comes from, but for many scale production workloads I studied it was so close to 0% that it was tough to measure beyond noise on the cloud. That's not to say that 1-2% is wrong, but that it's likely someone's workload and other people see less.
Helping people find ~30-3000% perf wins, helping debugging and automated bug reports, is huge. For some sites it may be like 300 steps forward, one step back. But it's also not the end of the road here. Might we go back to frame pointer ommision one day by default if some other emerging stack walkers work well in the future for all use cases? It's a lot of ifs and many years away, and assumes a lot of engineering work continues to be invested for recoving a gain that's usually less than 1%, but anythings possible.
There's a couple of problems with an apt reinstall. One is that people often don't work on performance until the system is melting down -- many times I've been handed an issue where apt is dreadfully slow due to the system's performance issue and just installing a single package can take several minutes -- imagine reinstalling everything, it could turn the outage into over an hour! The other is I'd worry that reinstalling everything introduces so many changes (updating library versions) that the problem could change and you'd have no idea which package update changed it. If there was such an apt reinstall command, I know of large sites (with experience with frame pointer overheads) that would run it and then build their BaseAMI so that it was the default. Which is what Ubuntu is doing anyway.
- account42 3 years ago
  
  Eeven 0.1% scaled accross all users is a huge amount of wasted energy. You can profile without subjecting everyone to frame pointers.
  - nemetroid 3 years ago
    
    Not nearly as much as the missed optimizations from not having this easily available.
    
    account42 2 years ago
    
    [citation needed]
    That's just the same handwavey reason as given in the article. Where is the evidence that this will actually result in any significant amounts of optimization that wouldn't be possible without making everything (slightly) slower for end users.
- Levitating 2 years ago
  
  > Our analysis suggests that the penalty on 64-bit architectures is between 1-2% in most cases.
  Right from the article. I find it a difficult subject, as a developer/poweruser I am happy to see framepointers. But I can not speak for others.
aseipp 3 years ago

Many systems take various kinds of performance hits in return for things all the time; reliability, observability, safety, etc. Many systems can be run at higher peak throughput in return for various instabilities, even. Performance is not actually a uniform number across the system. You're looking at an aggregate, but changes like this can make it much, much more practical to diagnose specific performance issues for users in specific scenarios, which may have extremely large impacts far beyond 1-2%. That's very important in practice especially when users can often feel those outliers, e.g. why does this application enter a spinning state and suddenly burn CPU for 1 minute before returning to normal.
> Wouldn't it make more sense to just have an 'apt reinstall all --with-frame-pointers' command that power users could run before they wanted to profile something?
I don't see why it makes any more sense than just changing the default that the distribution uses. For one it's way more work, maintaining another copy of everything for a ~1% performance difference is not an obviously good tradeoff for the distro teams to make. Not to mention it often isn't possible to do this in the cases people want it i.e. they want to continuously profile an existing production system that they can't just run apt on willy nilly.
alexey-salmin 3 years ago

I've seen how instantly-available profiles affect the engineering culture on practice and it's transformative. The difference between "yeah strange, I'll deploy an fp build some time later and check... maybe" and "see this thing right here on the flamegraph" is huge and often repays 5-15x of the initial 1% slowdown.
- nightowl_games 3 years ago
  
  This x1000
  This "1%" loss will never manifest. It will be pure gain.
  - nerpderp82 3 years ago
    
    Single digit percentages are noise in moore-units. Layout has a bigger effect. So many "optimizations" in our tech culture are around removing the headlights and brakes so that the car can go slightly faster in the dark, on hills.
    
    nightowl_games 2 years ago
    
    Good analogy.
    Stack traces and good performance profiles are table stakes to even starting to make good software imo
dymk 3 years ago

FTA
> I’ve enabled frame pointers at huge scale for Java and glibc and studied the CPU overhead for this change, which is typically less than 1% and usually so close to zero that it is hard to measure.
alfalfasprout 3 years ago

This is very hyperbolic and inaccurate.
In 2023 it's not a 1-2% performance penalty anymore and certainly not for most use cases. Only if the 15th register is critical for performance on an x86_64 CPU.
Certain workloads might suffer more, but most will certainly suffer less than a 1-2% hit.
- akira2501 3 years ago
  
  Using any of the higher 8 registers on an x86_64 requires an opcode prefix and makes your instruction 1 byte longer. There is still a small reward for avoiding r8-r15.
- issafram 3 years ago
  
  very allegorical
m463 3 years ago

You are prematurely optimizing.
"can make use of this improved debugging information to diagnose and target performance issues that are orders of magnitude more impactful than the 1-2% upfront cost."
Also, can't you get reliable stack dumps when something goes wrong too?
- account42 3 years ago
  
  You can get reliable stack dumps without frame pointers.
  Removing needless instruction and register pressure that 99.99% of users don't rely on in any way and the rest don't need if they fixed their tools is not premature optimization but simple common sense. Which is why its on by default in the first place.
  Calling 1% or even 0.1% optimizations that apply accross the board "premature optimizing" is a great example of the culture of wastefulness that has made computers less responsive even though hardware has gotten a million times faster. These things do add up.
pjmlp 3 years ago

They already take performance hits left and right with all those containers running all over the place, parsing JSON.
coldtea 3 years ago

>So the whole world should take a 1-2% performance penalty on everything so some users can maybe run a profiler?
If so, I'm all for it. The win from easy access to profiling can dwarf this 1-2%
- account42 3 years ago
  
  That's assuming you can't profile without frame pointers which is simply not true. If some tools have issues then fix them.
  - coldtea 3 years ago
    
    No, it's only assuming that you can more easily (with less overhead) profile with frame providers. Which is simply true.
gloryjulio 3 years ago

At the age I would think the observability and the debugability are the qualities that I'd like the systems to have. The productivity gains r immeasurable.

zX41ZdbW 3 years ago

ClickHouse has always-on profiling without frame pointers. But the implementation is very hard - it required patching of LLVM's libunwind to make it 100% async-signal safe. Using frame pointers should be easier and faster.

pryz 3 years ago

The post from Polar Signal blog https://www.polarsignals.com/blog/posts/2023/12/13/embracing...

Exciting!

kapilvt 3 years ago

hmm.. I wonder if the 10% performance regression in python has been resolved. https://discuss.python.org/t/python-3-11-performance-with-fr...

jonseager 3 years ago

We’re waiting on the benchmarks once the archive has been rebuilt to double check if we’ll be affected in that way, but if indeed we do find that sort of regression, we’ll exclude Python from this change (and any other package where there is a substantial hit).

amne 3 years ago

Even in context it is hard to understand this: "The performance wins that these can provide far outweigh the comparatively tiny loss in performance."

My guess is that on average the potential performance discovered with the techniques this enables is higher than the guaranteed negligible performance loss.

o11c 3 years ago

That is highly optimistic though. Profiling is hard even when you know what you're doing, and doing it wrong can easily lead to pessimization (thinking particularly about if your profiling workload exercises a different set of branches).
- brancz 3 years ago
  
  Hyperscalers have long been doing infrastructure-wide profiling (or "Google-Wide Profiling" as the first whitepaper on the topic calls it [1]). This tech allows Google to reduce infra-resource usage by multiple percentage points per quarter.
  [1] https://research.google/pubs/google-wide-profiling-a-continu...
  - o11c 3 years ago
    
    "Software that runs at scale" is a very different story than "Software that is shipped with a distro", unfortunately.
    Very few distro packages can even define what a normal workload looks like.
    
    dralley 3 years ago
    
    That's precisely why it's nice for users to be able to collect these profiles themselves.
saagarjha 3 years ago

Yes, that's correct.

daoistmonk 3 years ago

hasn't this already been in fedora for almost a year?

brancz 3 years ago

You're not wrong, but isn't it exciting that it's gaining more traction? Ubuntu is also very widely deployed, so I'll celebrate every distro that makes this the default, be it Ubuntu in this case, or Debian, or Arch, or Suse, or or or...
pavon 3 years ago

Yeah. To be fair there is a stronger argument for it being enabled in fedora than an LTS release (RHEL or Ubuntu) since it has more cutting edge software that needs more frequent debugging, is less likely to be used in production where the (minor, but uneven) performance hits may matter, and has so many upstream developers using it as their daily driver.
- brancz 3 years ago
  
  I would argue that LTS releases are going to be deployed millions of times and stay around effectively forever with lots of very critical software being deployed on it. Having all processes/binaries be debuggable cheaply and easily in stressful situations is a major improvement that's now here to stay.
- kelnos 3 years ago
  
  Agreed that cutting-edge software likely needs more frequent debugging, but I don't think that means LTS releases shouldn't be easier to debug.
  Consider that you're a big company deploying software to hundreds or thousands of machines, and you hit a difficult-to-diagnose performance issue, crash, etc. You'll very much appreciate if the OS has made it easier for you to debug things.
  Put another way, Fedora users/developers might appreciate having frame pointers because they have to debug more frequently, but RHEL/LTS release users might appreciate frame pointers because on the less-frequent occasion when they need to debug, the stakes are much higher.
rwmj 3 years ago

Yes: https://fedoraproject.org/wiki/Changes/fno-omit-frame-pointe...
bravetraveler 3 years ago

I believe so, I remember a fair amount of hubbub over it.
Can't say it's been useful here, any development I do is miles away from this. It hasn't hurt either so... cool, I guess
Call me pessimistic, but I'm not convinced this being the default will lead to more profiling. There's plenty that could be done without this, that isn't, so I'm not buying it.
- brancz 3 years ago
  
  While profiling benefits immensely, imagine all the other ramifications. Cheap tracing of MySQL, postgres, or anything else using bpftrace or any other bcc tools. Getting stack traces of processes that are already core dumping...
  Profiling is one part, but the debuggability this enables is going to be huge in the long term I predict.

herodoturtle 3 years ago

For the layman, what does this mean?

jcranmer 3 years ago
There are two key effects of this decision.
The first effect is that it makes one additional general-purpose integer register unavailable for use for code. x86-64 has 16 general-purpose registers, but one of these is the stack pointer and basically can't be used for any other purpose; this would add a second reserved register for the frame pointer. This effect may cause slowdowns if the 15th register was critical for performance.
The second effect is on the ability to identify (and potentially unwind) the stack trace. With frame pointers, the pseudocode for computing a stack trace is essentially:
```
  do
    load return address, previous frame pointer from current frame pointer
    print return address
    move previous frame pointer into current frame pointer
  until current frame pointer is invalid
```
Without frame pointers, the way you have to do this procedure is:
```
  while current address has corresponding entry in unwind table:
    parse unwind table entry to find a program to run
    run this program on the current frame to generate return address
    print return address
    move return address to current address
```
It turns out that there is a full Turing-complete program described in the unwind tables to be able to generate a return address. This makes unwinding quite expensive, and also can create lots of security headaches if you want do something like unwind in the kernel (since the unwind table is arbitrary user code!). It can also be pretty unreliable at times, especially in cases where your program crashed due to stack smashing so that you have to expect that the data being randomly overwritten with garbage and thus horrifically inaccurate.
- PeterisP 3 years ago
  
  Does this provide there any benefit at all to the majority of machines which aren't intended to be used for development of binaries (i.e. user workstations, developers working with interpreted languages, and servers), what would be the use cases where frame pointers would help on those machines?
  Like, even for developers, I assume a random web development shop using Ubuntu and hosting stuff on Ubuntu would likely not ever attempt debugging a binary executable, and likely don't have any employees who could do it if they wanted. Of course there are companies who can and do debugging and profiling of binaries running on their servers, but IMHO those who are capable and willing to do that a relatively small minority of users of Ubuntu systems.
  - not_the_fda 3 years ago
    
    No benefit, but no downside either, its effectively irrelevant, an implementation detail.
- nerpderp82 3 years ago
  
  Since it hasn't been mentioned in this entire thread, frame pointers are required to get good high resolution flame graphs.
  https://www.brendangregg.com/flamegraphs.html
  With systems like Phlare/Pyroscope SRE can monitor application performance in a very granular way in realtime.
  https://grafana.com/blog/2023/03/15/pyroscope-grafana-phlare...
  https://github.com/grafana/pyroscope
- mFixman 3 years ago
  
  Could programs compiled in architectures with 16 general purpose registers fail in one with 15?
  - globular-toast 3 years ago
    
    Well, for a start you probably mean compiled for an arch with 16 registers. It doesn't actually matter what arch the compiler ran on (assuming modern cross-compiler like GCC).
    If a program uses 16 registers then it needs 16 registers. But note it's the program itself that decides to use a frame pointer, it's not being reserved by the operating system or something. Programs don't even have to use the stack pointer as a stack pointer, they could use all 16 as general purpose, but in practice almost all programs use a call stack (I guess all C programs must do, but you might be able to disable it if you make no function calls?)
    
    mFixman 3 years ago
    
    Ohh, I misunderstood what Ubuntu was doing.
    So the only change is that GCC and its toolkit will compile programs using a register as frame pointer by default? That seems like a very reasonable change. If having an extra general purpose register is critical for performance of then this can be disabled in that program's the makefile.
    
    brancz 3 years ago
    
    Correct and there are already known exceptions such as the python interpreter in which the “interpret function” function actually falls into this case and so for the foreseeable future python is going to continue to be compiled omitting frame pointers. But by default this destructive micro optimization is off by default until proven a performance bottleneck, just like any performance issue should be!
  - kelnos 3 years ago
    
    Not really? I mean, there are architectures with fewer than 16 GP registers (IA32 is one of them); if you can compile some C code on that architecture, then surely you can also compile it on x86_64 with 15 (well, 14, really, due to the stack pointer also being reserved) rather than 16 (15).
    If someone is writing in assembly, then they've already decided if they're going to allocate a register for the frame pointer, and Ubuntu's change isn't going to affect that, as this is about compiled code, not assembled assembly.
    The only real issue is performance: if a program has a particular hot-path function (or just many functions overall) that really benefit from having that extra register available, and would otherwise have to spill data into memory, then this change could have a big negative impact. But that's not really a big deal; the packager can decide to omit frame pointers just for that particular app or library.
- mathiasgredal 3 years ago
  
  I thought modern speculative cpu’s had way more registers than you can normally access. Why does it reserve these registers for speculative execution instead of exposing them to the program if it needs them?
  - jcranmer 3 years ago
    
    That's not really how registers or speculative execution works. Intuitively, you can think of assembly as trying to describe a graph of instruction dependencies. Having 16 registers in the ISA allows you to have 16 live outputs at any given "time". Speculative execution allows instructions to execute out-of-order, and to enable this, it has ~140 registers that allow it to have 140 live outputs at once, so that it can run some code while a really long load is waiting for its data.
    From the ISA perspective, however, adding more registers means you have to spend more bits naming a register. With 16 registers, you need 12 bits of your instruction just to name the operands of a typical 3-address instruction (rA = rB op rC). With 128 registers, that is now a whopping 21 bits, which means code density is a more pressing issue.
    
    mathiasgredal 3 years ago
    
    I get that there is a tradeoff with code density, although if you could have an encoding scheme or extended register mode to alleviate this. I was just thinking that if you have e.g a loop where you run out of registers, since you only have 16, then the compiler will swap the values to memory and reuse that register which creates an instruction dependency that doesnt really have to exist.
    If the compiler could use the hidden registers, then the cpu would know that it could run this instruction ahead of time.
    It is probably not worth it, since it adds a lot of complexity to an already complex system, which is why it isn’t done.
    
    peterfirefly 3 years ago
    
    All AMD64 CPUs support at least SSE2 which means they have 16 (not 15 or 14!) XMM registers they can spill to. This is just as fast as a move between two GPRs.
  - peterfirefly 3 years ago
    
    There is a difference between registers and register names.
    he AMD64 architecture only has 15 general-purpose registers (because the stack pointer is mostly treated as if it were a GPR as well). It is customary to use one of those (bp/ebp/rbp depending on mode) as a base pointer register. That leaves 14 GPR register names.
    The physical CPU the code runs on might have 200 physical registers -- those are the ones that matter for speculative, out-of-order execution -- but the code itself can only refer to 14 (or 15) GPRs at a time and has to include instructions to transfer values to/from memory or to/from XMM registers if that's not enough. Those extra instructions take up space + might slow the code down.
  - andreyv 3 years ago
    
    The number of registers available to the program is fixed in the instruction set. The program cannot address more registers without recompiling it to an extended instruction set.
  - mdpye 3 years ago
    
    Because then it would need more registers for the other purpose?
    But actually the decision about which general purpose registers to use for what is made at compile time (hence we're discussion a compiler flag here, the frame pointer is not a hardware dictated feature), so the question is actually kind of moot. If the compiler is out of registers to allocate and instead uses the stack, the CPU isn't reasonably going to be able to undo that.
    
    mathiasgredal 3 years ago
    
    Sure, but wouldn’t it make sense to extend the instruction set to allow the compiler to use these registers instead of reserving them for speculative / out-of-order execution? It was just a thought i had after watching a talk by a compiler guy: https://youtu.be/2EWejmkKlxs?feature=shared&t=2409
    
    peterfirefly 3 years ago
    
    jcranmer got it right. Read that reply (and mine). And then maybe rewatch watch Chandler Carruth says.
    The current practice allows for CPUs to transparently increase their physical register count (to gain performance) and still run old code -- and older CPUs can still run new code. That's usually quite practical...
    Adding more register names takes more bits for the register numbers -- which leads to larger instructions. It also leads to more complicated encodings if we want backwards compatibility. AMD64 does that by adding an optional prefix byte that carries a payload of 4 more instruction bits. That's one bit each for the three possible register names encoded in a traditional IA32 instruction + a bit to indicate whether to operate on 32-bit or 64-bit data (the actual rules are a bit more complex). Intel published a whitepaper recently suggesting a future encoding with a different (optional) prefix that encodes 8 more bits -- so each of the three register names can be extended to 5 bits (32 register names). It all ends up being quite complicated + new code won't run on older CPUs, which is not great.
    I think you are suggesting not just bigger register names but also doing away with register renaming -- that would be... less than entirely useful because you would lose almost all your out-of-order capability and thereby almost all your ability to hide cache misses. Cache misses are very, very hard to predict statically (before actually running the code on a real CPU with real data) so good luck trying to do magic ahead-of-time allocation of those registers...
im3w1l 3 years ago

I was going to write a long essay about the relation between C and assembly but after thinking for a while, I think there is an easier way to explain it.
Stack frames are basically a (single) linked list of information about the call stack. Every frame corresponds to one function call, and says where the local variables are stored, and where the function should return after it has finished.
The head of this list is stored in a register (a scarce resource, superfast memory). So to use frame pointers, you have to spend one register, and also every function has to do some work to maintain the linked list. Two instructions worth of work when the function is enterred, and one when it exits.
The alternative to doing this explicitly is to keep track of it all implicitly which is faster but a bit more complex.
brancz 3 years ago

We wrote pretty extensively about what was needed to be able to profile things without frame pointers [1]. It's still possible at less than 1% overhead with the right set of technologies, but frame pointers unwinding is virtually free.
[1] https://www.polarsignals.com/blog/posts/2022/11/29/dwarf-bas...
- mratsim 3 years ago
  
  Intel VTune, Apple Instruments and perf have been able to profile without frame pointers.
  - brancz 3 years ago
    
    A lot of profilers have, it’s literally what unwind information is for, but we did it in kernel therefore not the entire stack needs to be copied to user space so it’s way less overhead.
rwmj 3 years ago

I wrote this about the change when Fedora did it about a year ago: https://rwmj.wordpress.com/2023/02/14/frame-pointers-vs-dwar...
saagarjha 3 years ago

Better stack traces for profiling tools when you don't have debugging symbols, or can't use them for whatever reason
supportengineer 3 years ago

Exciting new failure modes

dmpk2k 3 years ago

Thank god. AMD making the base pointer optional in the x86-64 ABI was foolish.

AndyKelley 3 years ago

Why not let the upstream application developer decide, rather than choosing for them?

This is one of the downsides of using C/C++ rather than a modern programming language like Rust or Zig. In the former case, the system maintainers reach across the table and change the settings, despite what the actual application developer has chosen. In the latter, the upstream developers' choices are respected more, mainly because the tooling is less standardized.

Shoutouts to this NixOS bug which is still ongoing after causing much pain for many years: https://github.com/NixOS/nixpkgs/issues/18995

comex 3 years ago

Because the upstream developer is not the only one who needs to profile the program.
For one thing, you might be the upstream developer of a library the program links against, and your stack frames might be hidden under frames from the program that didn’t save the frame pointer. Or if it’s a library that isn’t saving the frame pointer, vice versa.
But the more interesting use case is if you’re not the upstream developer of anything, but a skilled user who wants to get a view of your entire system and diagnose problems yourself. Personally I do that all the time, both in my spare time and at work. With respect to performance in particular I have more experience with the macOS tooling than the Linux tooling, but it’s analogous – with tools like Instruments and dtrace I can get a profile of any process I want or of the entire system, and I find that incredibly valuable. And that’s made possible in part by the stock macOS toolchain enabling frame pointers by default.
The NixOS case you linked sounds rather annoying, but turning on optimizations and -Werror and PIC by default is very different from just enabling frame pointers by default.
- lathiat 3 years ago
  
  Yeah, I agree with this.
  Not clearly written in the original article is that "many" (I'm not sure what actual percentage, but vaguely "most") packages already have frame pointers enabled.
  The problem packages that don't are exactly all of those upstream projects that intentionally compile with -fomit-frame-pointer because of those small performance gains. And those are also usually the exact same projects you end up wanting to profile or otherwise analyse :)
  I work in the Support organisation at Canonical and the two most frequent projects I run into this with are Ceph and Openvswitch - they both compile with -fomit-frame-pointer by default upstream (and currently in the Ubuntu packages) which makes using perf (which I often need to do with both of those) a pain.
  While you can do it, you have to record ~8kB of stack extra for every sample (times 1000 per second, times the number of CPUs...) and then unwind it later with the DWARF debug symbols. The resulting perf exports are multiple gigabytes for 0-2 minutes. Compared to maybe 25-200MB for frame-pointer enabled cases where I can usually then easily captured.
  The problem is the product or upstream project wants to claim the absolute best performance, even 1% better, but the end user rarely needs that last 1% and both them and their support team would like to be able to easily use profiling in production to fix the random much more significant 10-100% performance bugs that inevitably crop up when actually using it and not benchmarking it :)

m463 3 years ago

interesting article on frame pointer optimization:

https://community.ibm.com/community/user/wasdevops/blogs/kev...

brenns10 3 years ago

This is a good idea for the short-term. As of now, frame pointers are the most reliable way to ensure that software can be profiled by tools like perf*. The core issue is that the kernel must be the one to unwind the userspace stack, and it only knows how to unwind stacks with frame pointers**. The .eh_frame data will never be supported by the kernel, because it involves a turing-complete program that must be executed to compute the necessary unwind info***.

For the long term, the more exciting option that's emerging is SFrame[1]. This is a new data section which would be generated by the compiler and contains unwind tables which the kernel will be able to understand. Unlike DWARF/.eh_frame, these tables would remain in the final binary (i.e. not be stripped away), and on exec(), the kernel would store them for use during profiling. Since the format is quite similar to ORC(*), and Steven Rostedt is quite invested in the format, it seems a safe bet that support will land in the kernel.

My hope isn't necessarily that a distribution completely disables frame pointers once this format becomes available... though it could be an interesting thing to try. Rather, there can be a conscious choice about whether frame pointers are used, or SFrame, which would be useful for cases like Python, where it's mentioned that frame pointers may still have a significant performance impact. The kernel should be able to fall back to frame pointers when SFrame is unavailable, which means that either will be acceptable. Ideally, in a few years time we'll be able to go back to forgetting about frame pointers for most cases :)

---

* Ironically, the kernel itself tends not to use frame pointers! It has its own unwind format called ORC, which gets generated by an in-kernel program called "objtool" which essentially reverse engineers the assembly generated by the compiler. It's x86_64-specific and frequently needs adjustment when the compiler changes code generation. It can't be used for userspace programs.

** it also knows how to unwind kernel stacks with ORC (see above)

*** There is an option to allow perf to unwind with DWARF, but it's a total hack (though a very effective one). By passing --call-graph=dwarf, you can instruct the kernel to copy the userspace stack (by default, 8k bytes!) into the perf event buffer with each sample (this can be as many as 100 or 1000 samples per second, per CPU...). Later, the perf userspace program will use that info, along with information about each process's address space, and the debuginfo for each program, to unwind the stacks. This has huge performance overhead, and it requires that you have easy access to debuginfo, which may not be the case, especially for container workloads.

[1] https://lwn.net/Articles/940686/

brancz 3 years ago

We’ve also figured out an alternative format to use from within eBPF to unwind stacks (we happen to only support dwarf at the moment but theoretically any source information could work): https://www.polarsignals.com/blog/posts/2022/11/29/dwarf-bas...
- brenns10 3 years ago
  
  Yeah I saw Vaishali & Javier's presentation [1] at LPC last year! Great stuff, & certainly available to use now rather than when SFrame becomes available and supported.
  In the same spirit, it seems that the .eh_frame -> BPF unwind table process could be (relatively) easily modified to produce SFrame, which you could attach to the binaries if you have a trustworthy way of doing that (which is... a big if). So that once SFrame support becomes available in the kernel, you could apply it to applications without rebuilding them.
  [1]: https://lpc.events/event/16/contributions/1361/
  - brancz 3 years ago
    
    I would need to double check with the team on this detail, but if I recall correctly the architecture as it is is specifically designed to make the bpf verfier happy and we didn’t think it was going to be possible with existing formats. But happy to reconsider, we’d of course much rather use a standardized format if possible!

ur-whale 3 years ago

What about binary size increase ?

therealmarv 3 years ago

Pro tip for performance optimisation in Ubuntu (you also gain more RAM):

Remove snap (or choose a Ubuntu distro variant like Pop OS without snap)

solarkraft 3 years ago

Is that even practical nowadays? My understanding is that snap is deeply integrated. Don't some apt packages point to their snap variants?
- woodruffw 3 years ago
  
  It’s still possible, although increasingly annoying.
  I believe Firefox now points to its snap variant, which I discovered when it broke a bunch of my browser extensions. Switching to the official Mozilla PPA was easily enough, but left a bad taste; if Canonical continues down the route of silently nudging users onto snap, I’ll probably switch to Debian.
  (I have no particular opinions about snap itself, other than that it seems poorly documented and doesn’t adhere to the “do what I say” philosophy when it’s secretly injected into apt.)
  - uxp8u61q 3 years ago
    
    > if Canonical continues down the route of silently nudging users onto snap, I’ll probably switch to Debian.
    Serious question: why don't you switch now? I guess I just don't see the point of ubuntu anymore.
    
    woodruffw 3 years ago
    
    Laziness.
    
    aitchnyu 3 years ago
    
    Laziness. And waiting for latest and greatest software for my 5k screen: fractional scaling, brightness setting, reconnecting etc are flaky.

westurner 3 years ago

Call stack > Structure > Stack and Frame pointers: https://en.wikipedia.org/wiki/Call_stack#Stack_and_frame_poi...

What do the Coding Guidelines listed in e.g. awesome-safety-critical say about Frame pointers? https://awesome-safety-critical.readthedocs.io/en/latest/#co...

(Edit)

/? "cert" "frame pointer" https://www.google.com/search?q=%22cert%22+%22frame+pointer%... :

- Stack buffer overflow > Exploiting stack buffer overflows: https://en.m.wikipedia.org/wiki/Stack_buffer_overflow :

> In figure C above, when an argument larger than 11 bytes is supplied on the command line foo() overwrites local stack data, the saved frame pointer, and most importantly, the return address

What about the Top 25?

/? site:cwe.mitre.org "frame pointer" https://www.google.com/search?q=site%3Acwe.mitre.org+%22fram... :

- CWE-121: Stack-based Buffer Overflow https://cwe.mitre.org/data/definitions/121.html

This is closer to a better approach for security, debuggability, and performance IMHO:

https://news.ycombinator.com/item?id=38138010 :

> gdb on Fedora auto-installs signed debuginfo packages with debug symbols; Fedora hosts a debuginfod server for their packages (which are built by Koji) and sets `DEBUGINFOD_URLS=`

> Without debug symbols, a debugger has to read unlabeled ASM instructions (or VM opcodes (or an LL IR)).

westurner 3 years ago

When frame pointers are omitted, there are fewer places in memory that can be overwritten to hijack control flow of a program.
Someone could easily prepare an demo of a frame pointer buffer overflow exploit to explain?

Settings

Ubuntu 24.04 LTS will enable frame pointers by default

Keyboard Shortcuts