C++ Headers are Expensive

107 points by kbwt 7 years ago · 114 comments

Reader

In the Zig stage1 compiler (written in C++), I tried to limit all the C++ headers to as few files as possible. Not counting vendored dependencies, the compiler builds in 24 seconds using a single core on my laptop. It's because of tricks like this:

    /*
     * The point of this file is to contain all the LLVM C++ API interaction so that:
     * 1. The compile time of other files is kept under control.
     * 2. Provide a C interface to the LLVM functions we need for self-hosting purposes.
     * 3. Prevent C++ from infecting the rest of the project.
     */


    // copied from include/llvm/ADT/Triple.h

    enum ZigLLVM_ArchType {
        ZigLLVM_UnknownArch,
    
        ZigLLVM_arm,            // ARM (little endian): arm, armv.*, xscale
        ZigLLVM_armeb,          // ARM (big endian): armeb
        ZigLLVM_aarch64,        // AArch64 (little endian): aarch64
    ...

and then in the .cpp file:

    static_assert((Triple::ArchType)ZigLLVM_UnknownArch == Triple::UnknownArch, "");
    static_assert((Triple::ArchType)ZigLLVM_arm == Triple::arm, "");
    static_assert((Triple::ArchType)ZigLLVM_armeb == Triple::armeb, "");
    static_assert((Triple::ArchType)ZigLLVM_aarch64 == Triple::aarch64, "");
    static_assert((Triple::ArchType)ZigLLVM_aarch64_be == Triple::aarch64_be, "");
    static_assert((Triple::ArchType)ZigLLVM_arc == Triple::arc, "");
    ...

I found it more convenient to redefine the enum and then static assert all the values are the same, which has to be updated with every LLVM upgrade, than to use the actual enum, which would include a bunch of other C++ headers.

The file that has to use C++ headers takes about 3x as long to compile than Zig's ir.cpp file which is nearing 30,000 lines of code, but only depends on C-style header files.

ArthurBrussee 7 years ago

What a world... Thanks for working on Zig, can't wait to see what comes of it. Anything to get some languages back that bring some joy into programming back!
pjmlp 7 years ago

Any plans to actually bootstrap the compiler?
- AndyKelley 7 years ago
  
  https://github.com/ziglang/zig/issues/853

beached_whale 7 years ago

You can know where you time is going, at least with clang, by adding -ftime-report to your compiler command line. The headers take a long time is often that the compiler can do a better job at optimizing and inlining as everything is visible. Just timing your compiles is like trying to find things in the dark, you know the wall is there but what are you stepping on :) Good to know what is taking a long time, but it may not be the header itself but how much more work the compiler can do now to give a better output(potentially)

fouronnes3 7 years ago

I've been working with -ftime-report, but unfortunately it reports times per cpp file. I'm looking for a way to get a summary across an entire CMake build. Right now reading 100+ -ftime-report outputs is not really useful, although deep down I know it's all template instancing anyway.
- beached_whale 7 years ago
  
  When I look most of the time has gone to inline and optimization. But I only look sometimes and sample size is Me

nanolith 7 years ago

I recommend three things for wrangling compile times in C++: precompiled headers, using forward headers when possible (e.g. ios_fwd and friends), and implementing an aggressive compiler firewall strategy when not.

The compiler firewall strategy works fairly well in C++11 and even better in C++14. Create a public interface with minimal dependencies, and encapsulate the details for this interface in a pImpl (pointer to implementation). The latter can be defined in implementation source files, and it can use unique_ptr for simple resource management. C++14 added the missing make_unique, which eases the pImpl pattern.

That being said, compile times in C++ are going to typically be terrible if you are used to compiling in C, Go, and other languages known for fast compilation times. A build system with accurate dependency tracking and on-demand compilation (e.g. a directory watcher or, if you prefer IDEs, continuous compilation in the background) will eliminate a lot of this pain.

shereadsthenews 7 years ago

pImpl pattern is great for those who don’t care about performance but it’s inappropriate for most header libraries. You wouldn’t want a library that hides the implementation of std::vector for example. With a visible implementation the compiler compile e.g. operator[] down one x86 instruction. With a pImpl pattern it will be an indirect function call in all likelihood that will be hundreds of times slower. It can make sense for libraries where every function is really expensive anyway, but it’s ruinous for STL and the like.
- _0w8t 7 years ago
  
  For cases when performance matters one can replace the members with a stab of the same size and alignment and cast the the stab to the real defition in the implementation.
  - adrianN 7 years ago
    
    Ugh, that violates strict aliasing, does it not?
    
    _0w8t 7 years ago
    
    I suppose it can, but in practice it works with any sane compiler that reasonably deals with reinterpret_cast and aliasing as long as aliasing requirements for the stub and the real thing are the same. The latter can be enforced with static asserts.
- kccqzy 7 years ago
  
  Using the pimpl pattern doesn't mean an indirect function call. The function to be called is always known. It's just an extra indirection in the data member. It's cheap. Think of it as Java style memory layout: everything that's not primitive stored in an object is a reference and therefore behind one level of indirection. The performance of Java is acceptance in the vast majority of use cases. Using pimpl will be the same.
  - Someone 7 years ago
    
    ”It's just an extra indirection in the data member. It's cheap”
    That extra indirection often means a cache miss. That isn’t cheap. Accessing each item traversed through a pointer can easily halve program speed.
    Java tries hard to prevent the indirections (local objects may live in the stack, their memory layout need not follow what the source code say, objects may even only exist in cpu registers)
    
    repsilat 7 years ago
    
    Hmm... if you were a horrible person you could declare a `char[n]` member instead of a pointer. Then you could placement-new the impl in the constructor, and static-assert that `sizeof(impl)>=n`... No more cache misses :-).
    :-(
    
    yuushi 7 years ago
    
    This doesn't take into account the alignment of the type though (you'd want to use std::aligned_storage<sizeof(T), alignof(T)>), but that requires knowing enough about T to be able to use sizeof() and alignof(), which means no incomplete types, bringing us back to where we started.
    
    vnorilo 7 years ago
    
    When you need this, use aligned storage: https://en.cppreference.com/w/cpp/types/aligned_storage
    
    shereadsthenews 7 years ago
    
    That’s not that gross. There are types in Abseil that do it.
    
    blt 7 years ago
    
    That's basically how modules would work, at least if you ignore LTCG.
  - vbezhenar 7 years ago
    
    Java JIT-compiler inlines short method calls whenever possible. Though C++ compiler should be able to do the same.
    
    pjmlp 7 years ago
    
    And they do, when given PGO data, or when doing LTO.
    
    maccard 7 years ago
    
    Given this is a topic of slow c++ builds, mentioning LTCG should come with the caveat that it will absolutely destroy your compile times.
    It's also not infallible and you might find it difficult to track a regression if introduced by someone silently breaking a heurestic in the optimiser.
    
    pjmlp 7 years ago
    
    Sure, I was only mentioning that it is possible.
    However with VC++ it doesn't seem to be that bad, when incremental compilation and linking are enabled.
    
    maccard 7 years ago
    
    According to MS[0], LTCG doesn't work with /INCREMENTAL (note /LTCG:INCREMENTAL is different). For my use cases, LTCG is unusable for anything other than our overnight builds.
    [0] https://docs.microsoft.com/en-us/cpp/build/reference/ltcg-li...
    
    pjmlp 7 years ago
    
    Thanks for the correction.
    That's what happens when I spend most the time on JVM/.NET worlds. :)
    
    maccard 7 years ago
    
    Yeah, the situation in .NET is much healthier as far as I've seen!
  - branchless 7 years ago
    
    The only advantage of c++ is max perf. If we could skip a beat we couldn't justify using c++.
    
    pjmlp 7 years ago
    
    It is still wins in "portability + expressiveness + safer than C" areas.
    There are still more platforms with a C++ compiler available than Ada, Java or C# ones, let alone Go, D, Rust, Swift.
    So if the goal is to make the code available to all platforms, without having to deal with C's lack of safety, then C++ it is.
- nanolith 7 years ago
  
  This depends on what you are building. Don't commit the sin of early optimization.
  Does a client of the framework you are writing -- which is probably using STL internally -- need a single instruction operation for adding a value for a call that you make less than 0.001% of the time?
  Optimization is about end results. Apply the Pareto Principle, and don't forget that your users also need to compile your code in a reasonable amount of time.
  - shereadsthenews 7 years ago
    
    That only makes sense if you are planning to offer two implementations of your library. Which I of course urge you to not do. This article is about the STL headers. The reason std::sort beats the pants off all other languages’ sort routines is because the iterators of every collection, all specializations of swap, and the comparator can all be visible to the compiler. If they weren’t, it would be a lot slower.
    Premature optimization is not really a thing but foreclosing future avenues of optimization definitely can be.
    
    nanolith 7 years ago
    
    > Premature optimization is not really a thing
    Okay. I'm going to stop this thread right there and take some opportunity to provide some mentoring. I hope you accept this, as it will help in your career.
    Read this paper. It is a classic.
    https://pic.plover.com/knuth-GOTO.pdf
    
    kllrnohj 7 years ago
    
    You should read the paper a bit more closely.
    "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3 %. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified"
    We know that cache misses are not a small in-efficiency. This has been measured & observed on many real systems as a real, systemic problem. It's why data-oriented design is currently the King of game engine world, because cache misses kill performance. It is not premature to warn against it as a general practice as a result, as that's systemic de-optimization that will likely impact the critical 3%.
    
    nanolith 7 years ago
    
    I think you may want to read the quote from the paper a bit more carefully. "...he will be wise to look carefully at the critical code; but only after that code has been identified"
    I was told that "premature optimization is not really a thing" as a response to a reply I received that pImpls should be avoided at all costs.
    When we analyze the performance impact of software, we don't shotgun change things because of a generalized fear of cache misses. We examine critical code paths and make changes there based on profile feedback. That is the spirit of what Knuth is saying in this quote. Look carefully at critical code, BUT ONLY AFTER that code has been identified.
    A cache miss is critical when it is in a critical path. So, we write interfaces with this in mind. Compilation time matters, as does runtime performance. Either way, we identify performance bottlenecks as they come up and we optimize them. Avoiding clearer coding style, such as encapsulation, because it MIGHT create faster code, is counter-productive.
    We can apply the Pareto Principle to understand that 80% of the performance overhead can be found in 20% of the code. The remaining 80% of the code can use pImpls or could be rewritten in Haskell for all it matters for performance. But, that code still needs to be compiled, and often by clients who only have headers and library files. Subjecting them to long compiles to gain an unimportant improvement in performance in code that only rarely gets called is a bad trade. Spend that time optimizing the parts of the code that matter, which, as Knuth says, should only be done after this code has been identified.
    EDIT: the downvoting on this comment is amusing, given that "avoid pImpls" is exactly the sort of 97% cruft that Knuth was addressing.
    
    kllrnohj 7 years ago
    
    > EDIT: the downvoting on this comment is amusing, given that "avoid pImpls" is exactly the sort of 97% cruft that Knuth was addressing
    Again, no, it isn't. You seem to be severely underestimating the systemic impact of cache misses if you are considering them a "small" impact to efficiency.
    It's a well-proven, well-known problem. Ignoring it falls under Knuth's guidance of "A good programmer will not be lulled into complacency by such reasoning."
    pImpls are the sort of thing you use at API boundaries to avoid leaking implementation details into users, but that's trading efficiency & complexity for a more stable API boundary. Scattering them throughout your code base would be like compiling with -O0. It's a silly, nonsensical waste of a user's constrained resources for a slight gain at compile time at a cost of code complexity.
    Or, alternatively, using pImpls to optimize for compile time is a premature optimization. You should only optimize for compile time at most 3% of your source files, ever. The other 97% of your source files should be written for clarity & simplicity, which means no pImpls.
    
    nanolith 7 years ago
    
    > Again, no, it isn't. You seem to be severely underestimating the systemic impact of cache misses if you are considering them a "small" impact to efficiency.
    You are severely overestimating the impact of cache misses if you think that all indirection must be eliminated and any use of pImpls at all is always wrong, as you seem to be implying.
    > pImpls are the sort of thing you use at API boundaries to avoid leaking implementation details into users, but that's trading efficiency & complexity for a more stable API boundary. Scattering them throughout your code base would be like compiling with -O0.
    It's a good thing that I never advocated using them everywhere then. Where did you read me saying this?
    > Or, alternatively, using pImpls to optimize for compile time is a premature optimization.
    Only if this is done by default, which I have not advocated for anywhere in this thread. I called it a tool in the toolbox. I mentioned it as one of several possibilities. Somehow you have translated this into "use pImpls everywhere", which is a strawman.
    
    erubin 7 years ago
    
    you may not be aware but your comment came off as somewhat condescending, given that you don't really have any idea where parent poster is coming from or what their background is
    
    nanolith 7 years ago
    
    If someone says that premature optimization isn't a thing, I don't think it is condescending to point out that it is by posting original source material. :-)
    
    cgrealy 7 years ago
    
    > it only wastes the programmer’s time. ... Wasting CPU time on the other hand bothers me a lot!
    As always the answer is... it depends. Programmer time costs money, CPU time is cheap by comparison.
    If you're building something that runs occasionally, or is IO/UI/network bound... CPU time is largely irrelevant. But if you're building something that runs in a tight loop or a library that will be compiled in millions of lines of code, then the wasted programmer time will absolutely be worth the ROI.
    
    shereadsthenews 7 years ago
    
    Yes, but, again, this article is about standard headers. Since it is impossible for STL library authors to decide that they are or are not writing for a performance-sensitive audience, it behooves them to provide completely visible header-only implementations of everything.
    
    kllrnohj 7 years ago
    
    The pImpl pattern costs programmer time as it's more code to write. By contrast compiling is just CPU time and you can trivially throw a bigger workstation at the problem.
    
    nanolith 7 years ago
    
    This depends on who is writing the code and who is compiling the headers. A software developer who is building headers for someone else (an internal or external client) may trade the overhead of this pattern for a faster compilation time. Reducing compilation time by 80% may be well worth the overhead of adding 10% more code to an interface.
    It is not always possible to just throw more hardware at the problem of compilation. For instance, one may be using a build pipeline that requires specific steps to be followed as part of gating tasks. The time it takes to compile code over and over again for unit testing, behavioral testing, acceptance testing, integration testing, etc., each impacts delivery time and handoff.
    Earlier in my career, I worked with a code base that was approximately 10 million lines of code in size. Compiling this code base would take approximately 7 hours on the best hardware we could buy. The C++ developers were adamant about ensuring that their headers were "complete" as they called it. With a few changes, such as forward declarations, abstract interfaces, and encapsulation, my team was able to reduce that compile time to less than 35 minutes. Based on profile feedback, we saw less than a quarter of a percent difference in overhead. Productivity-wise, we managed to reduce developer workflows so significantly that it was a better use of our time for overall cost than working on a billable project.
    Most projects aren't nearly that bad, but it does go to show that it is possible to significantly reduce compilation time without significantly impacting runtime performance, even in C++.
    
    kllrnohj 7 years ago
    
    > It is not always possible to just throw more hardware at the problem of compilation. For instance, one may be using a build pipeline that requires specific steps to be followed as part of gating tasks. The time it takes to compile code over and over again for unit testing, behavioral testing, acceptance testing, integration testing, etc., each impacts delivery time and handoff.
    All of that is solved by throwing more hardware at it.
    Alternatively if compile time is not the slow part of that pipeline, then you're prematurely optimizing the wrong thing anyway.
    > Earlier in my career, I worked with a code base that was approximately 10 million lines of code in size. Compiling this code base would take approximately 7 hours on the best hardware we could buy. The C++ developers were adamant about ensuring that their headers were "complete" as they called it. With a few changes, such as forward declarations, abstract interfaces, and encapsulation, my team was able to reduce that compile time to less than 35 minutes.
    In other words you only optimized the critical 3% of the codebase rather than prematurely optimizing everything with pImpl abstractions?
    
    nanolith 7 years ago
    
    > Alternatively if compile time is not the slow part of that pipeline, then you're prematurely optimizing the wrong thing anyway.
    Developer productivity does not matter in your world?
    > In other words you only optimized the critical 3% of the codebase rather than prematurely optimizing everything with pImpl abstractions?
    Yes, because nowhere in this thread have I advocated prematurely optimizing everything with pImpl abstractions. Those are words you have put in my mouth. pImpl abstraction is a single tool that can be used to improve compile-time performance. Not all the time, but in a fraction of the 3% of the time where it is appropriate.
    
    shereadsthenews 7 years ago
    
    Well, to the extent that it is a thing it only wastes the programmer’s time. I happen to think programmers are spending too little time, so that doesn’t bother me. Wasting CPU time on the other hand bothers me a lot!
    
    CyberDildonics 7 years ago
    
    Software still needs to be architected for performance from the start. Trying to micro optimize a loop before you know you need it what Knuth was saying to avoid.
    
    nanolith 7 years ago
    
    Right. Much like avoiding pImpl because it might make a function call that occurs 0.001% of the time faster. That is the basis of the thread I was replying to.
    Understanding what you are optimizing FOR and where the most attention should be spent is the crux of Knuth's argument. Trying to be clever up-front is often counter-productive.
    There is nothing wrong with making some architectural decisions up front, but that is much different than avoiding pImpls at all costs because indirection is slower. Indirection doesn't always matter, and it should only be tackled when and where it does.
    
    srean 7 years ago
    
    Well pImpl itself could be a premature optimization.
    
    rleigh 7 years ago
    
    It's not an optimisation. It's to encapsulate logic which does not need to be present in public headers.
    Given that it makes every object instantiation perform a memory allocation, followed by required indirection to accesss it, and will also prevent the creation of default copy constructor and assignment operator etc. due to use of unique_ptr, it adds complexity as well as two sources of performance loss.
    As a result, I would use this pattern only where strictly necessary. For example, I've used it when wrapping C libraries with C++ classes. It means the C headers aren't included in the public C++ headers, only the implementation, making the implementation cleaner. Or I might use it in situations where private members drag in dozens of headers which expose a lot of unnecessary implementation details to the user, which might drag in transitive library dependencies simply by #including the headers. The fact that the compilation speed might be marginally faster is incidental to this.
    
    AdieuToLogic 7 years ago
    
    The "pimpl idiom"[0] is about insulation, not optimization. What it affords is ensuring collaborators have no knowledge of a type's implementation details (data as well as private methods), which also has the byproduct of allowing for faster compilation times.
    HTH
    0 - https://cpppatterns.com/patterns/pimpl.html
    
    kitd 7 years ago
    
    AKA Bridge Pattern in GoF-speak
    https://en.wikipedia.org/wiki/Bridge_pattern
    
    nanolith 7 years ago
    
    Sure, and I would never recommend that indirection be used all the time as that would be premature.
    However, if compilation times have gotten painful enough that we need to examine performance improvements to our headers, the pImpl pattern is one of many tools in the toolbox. So are forward headers and other compiler firewall techniques.
    
    nikanj 7 years ago
    
    Knuth wasn't saying "just ignore performance altogether", he was saying "stop making things needlessly complicated for the last bit of juice".
    
    nanolith 7 years ago
    
    For instance, pulling in the STL for an interface header instead of encapsulating these details. :-)
    No one is claiming that we should ignore performance all together. But, understanding through profiling where performance issues are and designing toward a faster implementation is more important than trying to inline definitions up front.
    
    Silhouette 7 years ago
    
    Do you also write all your functions to pass large inputs by value until a profiler says you can pass a const reference?
    Some implementation details are so well understood that you really don't need a profiler to do what is probably the right thing by default.
    
    nanolith 7 years ago
    
    > Do you also write all your functions to pass large inputs by value until a profiler says you can pass a const reference?
    No, but neither do I pass types that fit in a native integer by const reference because copies should be avoided at all costs. There is always a tradeoff.
    > Some implementation details are so well understood that you really don't need a profiler to do what is probably the right thing by default.
    Avoiding any and all indirection at all costs is not one of these.
    The pImpl pattern, much like virtual methods, function pointers, etc., are each tools. Indirection is a trade-off that is either worth the expense in cache misses or is not. The cost is not so cut and dried as others in the thread have assumed.
    
    Silhouette 7 years ago
    
    Avoiding any and all indirection at all costs is not one of these.
    "Any and all"? Perhaps not. However, I believe the context here was standard library headers. Those are full of small, often-used functions, so avoiding idioms based on indirection such as pImpl is about as close to a black and white rule as you're ever going to find in the programming world.
    
    nanolith 7 years ago
    
    Nowhere did I say that pImpls should only be used. That was only one of several strategies I discussed.
    The article may have discussed standard headers, but it was neither titled to indicate that it was talking about only standard headers, nor are the problems it discussed localized only to standard headers. My original comment did not limit the discussion to only standard headers.
  - nikanj 7 years ago
    
    Overall user experience, mobile battery life, and many other metrics are really hard to fix by micro-optimizing a few functions. The key to a system that doesn't feel sluggish is being conscious of performance issues when making design decisions.
    This pendulum swings back and forth, and we went from "every bit counts" madness of the early days, to the polar opposite of "just burn cycles, whatever".
    Systems where every interaction feels sluggish are a pain to use, and often nearly impossible to refactor for better performance.
beached_whale 7 years ago

One hickup is that with unique_ptr you now have a rule of 5 thing, you need to declare a destructor which means you need to declar the copy/move constructor and assignment too. Not usually a big deal, but is extra code.
- ChrisSD 7 years ago
  
  This is why the rule of zero advocates are getting louder.
  - beached_whale 7 years ago
    
    Rule of zero classes are awesome. It forces a separation of concerns too, generally a good thing :), as the handling of special things is done by a class that does that(e.g. unique_ptr, vector...) and your class describes only what is in it and how to interact with it. But no more detailed than that.
    
    blt 7 years ago
    
    Totally agree. Recently used unique_ptr with a custom deleter is to consume a C library that requires heap allocation with its own alloc/free functions. No destructor!
    
    beached_whale 7 years ago
    
    I have a class, boringly called value_ptr, that is like unique ptr but if the underlying value supports copy will do a copy constructor and assignment too. Then I don’t have to make one is the classes like this too where I am using a ptr for other reasons. It also has const correctness
- nanolith 7 years ago
  
  This is true. Fortunately, these do not need to be inlined, which can still free client code of compile time overhead.
  It's a tradeoff between compile time and complexity.
  - beached_whale 7 years ago
    
    Another option is something like pimp but keep the state in the public class. Now you get stack allocation but the private part is still private and firewalling all the other headers and details not needed for the public interface. Just pass this to the private methods.
    Edit: Something like https://github.com/beached/stack_pimpl
    
    dkersten 7 years ago
    
    I don’t get the example, can you explain? What does this buy that you don’t have by simply putting the content of private.cpp into stack_pimpl.cpp? The data members are already in stack_pimpl.h (so nothing private or hidden about that) and the methods are already declared in both and implmented in both cpp files, so what are you buying over just putting the declarations in one header and implementation in one source and not delegating from one to the other?
    Was it just a oversimplistic example and the benefit is actually if priv_t has a bunch of internal methods that you want to keep out of the stack_pimpl.h interface?
    
    beached_whale 7 years ago
    
    So PIMPL is a compile firewall to keep the compile times and changes in one section from cascading and imacting your whole project with a recompile. It is not going to keep things secret as I can look at the binary and pick it apart.
    So with that, it keeps the data(state) in the public facing class. This allows one to keep everythign stack allocated instead of defaulting to the heap. So for something this is created en mass(A vector of them) or created and destroyed often, this is a runtime win.
    What it does is have a proxy that mirrors the public interface that is passed the this pointer. That proxy a friend class. Because only the proxies header(in this case private.h which I should probably rename firewalled.h) has static members that mirror the public members on the public class that limits the interaction between your classes users and it's implementation, as it is also with unique_ptr(or whatever pointer/heap way) based PIMPL designs. So changes in private.cpp that does all the work are only reflected in that one file. This file also brings in the heavy templates or algorithm code that may have large compile times.
    
    dkersten 7 years ago
    
    So, you mean that changing private.h only requires private.cpp and stack_pimpl.cpp to be recompiled, whereas changing stack_pimpl.h would require anything that uses it to be recompiled too? Ok, that makes sense.
    However...
    > So with taht, it keeps the data(stat) in the public facing class. This allows one to keep everything stack allocated instaed of defaulting to the heap.
    Ok, having it on the stack is useful, but in my personal experience, the state is exactly the thing that I find changes the most (typically together with the code), so by keeping the state in the public header, changes will still require recompiliation of anything that includes the header, so I’m not sure this really wins much (at least, based on my own C++ projects).
    
    beached_whale 7 years ago
    
    Never mind, I was thinking wrong and neglected the ABI stability
    
    AdieuToLogic 7 years ago
    
    Please ignore the question I just posted regarding this topic. I only saw this comment after posting mine own.
pjmlp 7 years ago

Additionally, actually make use of binary libraries across modules, and extern templates for common type parameters.
Gupie 7 years ago

Any advise for reducing link times?
- atilaneves 7 years ago
  
  Try the gold or lld linkers.

AdieuToLogic 7 years ago

If C++ compile time is a concern and/or impediment to productivity, I recommend the seminal work regarding this topic by Lakos:

Large-Scale C++ Software Design[0]

The techniques set forth therein are founded in real-world experience and can significantly address large-scale system build times. Granted, the book is dated and likely not entirely applicable to modern C++, yet remains the best resource regarding insulating modules/subsystems and optimizing compilation times IMHO.

0 - https://www.pearson.com/us/higher-education/program/Lakos-La...

de_watcher 7 years ago

If it's a book I'm thinking about then it appeared already very dated to me 10 years ago. Too many limitations and there are some weird rules about boundaries between elements of the architecture.

kazinator 7 years ago

Speaking about GNU C++ (and C), the headers are getting cheaper all the time compared to the brutally slow compilation.

Recently, after a ten year absence of not using ccache, I was playing with it again.

The speed-up from ccache you obtain today is quite a bit more more than a decade ago; I was amazed.

ccache does not cache the result of preprocessing. Each time you build an object, ccache passes it through the preprocessor to obtain the token-level translation unit which is then hashed to see if there is a hit (ready made .o file can be retrieved) or miss (preprocessed translation unit can be compiled).

There is now more than a 10 fold difference between preprocessing, hashing and retrieving a .o file from the cache, versus doing the compile job. I just did a timing on one program: 750 milliseconds to rebuild with ccache (so everything is preprocessed and ready-made .o files are pulled out and linked). Without ccache 18.2 seconds. 24X difference! So approximately speaking, preprocessing is less than 1/24th of the cost.

Ancient wisdom about C used to be that more than 50% of the compilation time is spent on preprocessing. That's the environment from which came the motivations for devices like precompiled headers, #pragma once and having compilers recognize the #ifndef HEADER_H trick to avoid reading files.

Nowadays, those things hardly matter.

Nowdays when you're building code, the rate at which .o's "pop out" of the build subjectively appears no faster than two decades ago, even though the memories, L1 and L2 cache sizes, CPU clock speeds, and disk spaces are vastly greater. Since not a lot of development has gone into preprocessing, it has more or less sped up with the hardware, but overall compilation hasn't.

Some of that compilation laggardness is probably due to the fact that some of the algorithms have tough asymptotic complexity. Just extending the scope of some of the algorithms to do a bit of better job causes the time to rise dramatically. However, even compiling with -O0 (optimization off), though faster, is still shockingly slow, given the hardware. If I build that 18.2 second program with -O2, it still takes 6 seconds: an 8X difference compared to preprocessing and linking cached .o files in 750 ms. A far cry from the ancient wisdom that character and token level processing of the source dominates the compile time.

int_19h 7 years ago

> Ancient wisdom about C used to be that more than 50% of the compilation time is spent on preprocessing.
Ancient wisdom was that more than 50% of the time is spent compiling the headers, after they become a part of your translation unit after preprocessing. I don't see why preprocessing itself would ever be singled out, given that it's comparatively much simpler than actual compilation.
- josefx 7 years ago
  
  Opening and reading all the included files could be costly. Also it is "ancient" wisdom so it might predate compilers that could detect the include guard pattern and had to repeatedly preprocess the same files. There is an ancient "Notes on programming" article by Rob Pike that comes up every now and then with a paragraph against include guards for that outdated reason.
deng 7 years ago

> ccache does not cache the result of preprocessing
Only if you explicitly disable 'direct mode'.

RcouF1uZ4gsC 7 years ago

> The test was done with the source code and includes on a regular hard drive, not an SSD.

In my opinion, this makes any conclusion dubious. If you really care about compile times in C++, step 0 is to make sure you have an adequate machine (at least quadcore CPU/ lot of RAM/SSD). If the choice is between spending programmer time trying to optimize compile times, versus spending a couple hundred dollars for an SSD, 99% of the time, spending money on an SSD will be the correct solution.

loeg 7 years ago

> The test was performed by compiling the source code below 128 times, calculating the average time.
Presumably, 127/128 runs have both the test file and the single header file in memory cache, so the distinction is moot.
Also, I find the conclusion that we should all just buy top end machines and ignore performance problems that don't manifest there fairly unconvincing. I think that kind of thinking is responsible for a good chunk of the reason the web is so bloated today. :-)
- cgrealy 7 years ago
  
  There's a difference between development (read: build) performance and runtime performance.
  For any kind of even vaguely profitable software, your developers should all have kick arse machines.
  But they should test on a $200 laptop :)
- 21 7 years ago
  
  Doom/Quake were developed on a NeXT machine much faster and more capable than the targeted IBM PCs.
  You don't need to develop on a $200 notebook to care about performance.
StillBored 7 years ago

Its that RAM that is the key, you want enough to keep all the source files and intermediate files sitting in cache, so the only disk activity is updating timestamps and flushing the .O files to disk.
I've seen this problem a few times, someone looks at their N core machine with M GB and says, oh look i'm only using 3/4 of M so when I buy the 4xN cores machine I'm going to put M ram in it again. Then everything runs poorly because the disks are getting hammered now that there are another 32 jobs (or whatever) each consuming a GB. Keep adding ram until their is still free RAM during the build. Its going to run faster from ram that waiting for a super speedy disk to read the next c/.o/etc file.
cjensen 7 years ago

Just to address part of your concern: Traditionally disk speed makes very little difference to compile times for real world C/C++ projects. This is because real world projects have many files, and each one can be compiled in parallel. Once you spawn sufficient compilers in parallel, the CPU becomes the bottleneck, not the disk. (I.e. when a compilation asks for I/O, it then yields the CPU to other compilers which have CPU work to do)
Note that Visual Studio, for example, does a poor job of this because it only spawns one compilation per CPU thread. This results in individual threads being idle more than they ought to be.
- berkut 7 years ago
  
  I guess it depends on how you define "very little", and what system includes you have.
  I've just tested one of my ~300 KLOC C++ projects, broken into 479 .cpp files and 583 .h files.
  Using Linux (GCC) after dropping the disk cache, on a 5400 RPM HD, the full build on 14 threads took: 78 seconds.
  On a fast SSD (same machine, after dropping caches again) it took 61 seconds.
  Linking was ~7 seconds faster on the SSD, so arguably you could say that actual compilation wasn't the same ratio as fast, but overall build time is most definitely faster.
  Source was on the same drive as the build target directory.
  At a previous company I worked at, we got SSDs to speed up compilation (and it did).
- rleigh 7 years ago
  
  This very much depends upon the project. Have you seen the size of C++ object files with -g3 and lots of template usage? It can swallow tens of gigabytes of disc space only to have the linker elide most of it and give you a library or executable a few megabytes in size. Compared with the size of the inputs, the output is causing a vastly disproportionate amount of disc I/O, and this can end up being limiting, both during compilation and in particular during linking.
- ahaferburg 7 years ago
  
  Absolutely not true. The problem is not the compilation of one single file, but that every one of these single files pulls in large amounts of headers, distributed over various libraries (e. g. Qt/Boost/STL), all of which won't fit into the disk cache.
  If it doesn't make a difference, all that means is that your project is small, or doesn't have too many dependencies. Good for you. But that's not the reality for all projects.
  - cjensen 7 years ago
    
    My projects take 10 minutes to build on a modern system, which is plenty complicated enough. Don't appreciate the "good for you" flippancy.

lbrandy 7 years ago

All of msvc, gcc, clang, and the isocpp committee have active work ongoing for C++ modules.

We'll have them Soon™.

Valmar 7 years ago

Who knows whether they'll see much use, due to C++ needing to keep backwards compatibility for older projects that demand older versions of C++.
It probably partially depends on whether old-style headers can be used simultaneously with new-style modules.

_0w8t 7 years ago

Opera contributed jumbo build feature to Chromium. The idea is to feed to the compiler not the individual sources, but a file that includes many sources. This way common headers are compiled only once. The compilation time saving can be up to factor of 2 or more on a laptop.

The drawback is that sources from the jumbo can not be compiled in parallel. So if one has access to extremely parallel compilation farm, like developers at Google, it will slow down things.

maccard 7 years ago

> The drawback is that sources from the jumbo can not be compiled in parallel. So if one has access to extremely parallel compilation farm, like developers at Google, it will slow down things.
Generally the way this works is rather than compiling into one jumbo file, you combine into multiple files, and you can then compile them in parallel. UE4 supports it (disclosure, I work for them). and it works by including 20 files at a time, and compiling the larger files normally.
There is also a productivity slow down where a change to any of those files causes the all the other files to be recompiled, so you can remove those files from the individual file.
> The compilation time saving can be up to factor of 2 or more on a laptop.
The compilation time savings are orders of magnitude in my experience, even on a high end desktop. That's for a full build. For an incremental, there is a penalty (see above for workarounds)
speps 7 years ago

This is also called "unity builds" and is used by Unreal Engine 4[1] and can definitely be used in parallel (eg. IncrediBuild).
[1] https://api.unrealengine.com/INT/Programming/UnrealBuildSyst...

mcv 7 years ago

This reminds me of my very first job after university. We used Visual C++, with some homebrew framework with one gigantic header file that tied everything together. That header file contained thousands or possibly tens of thousands of const uints, defining all sorts of labels, identifiers and whatever. And that header file was included absolutely everywhere, so every object file got those tens of thousands of const uints taking up space.

Compilation at the time took over 2 hours.

At some point I wrote a macro that replaced all those automatically generated const uints with #defines, and that cut compilation time to half an hour. It was quickly declared the biggest productivity boost by the project lead.

fizwhiz 7 years ago

Isn't this the reason precompiled headers are a thing?

jchw 7 years ago

As far as I understand it's also one of the reasons modules are a thing... or at least people want them to be.
Precompiled headers are a pretty ugly solution and the way they've been implemented in the past could be really nasty. (IIRC in old GCC versions it would copy some internal state to disk, then later load it from disk and manually adjust pointers!)
- jepler 7 years ago
  
  There must still be some dark pointer magic going on, because I noticed that unless I disabled ASLR on Debian Stretch, each build of a precompiled header came out different, screwing up ccache. I can only conclude that the specific memory layout during an individual run influences the specific precompiled header (".gch") output. We now run our build process under 'setarch x86_64 --addr-no-randomize.
  - jepler 7 years ago
    
    $ for i in `seq 3`; do gcc-6 -x c-header /dev/null -o x.h.gch; sha256sum x.h.gch; done 98d8093503565836ba6f35b7adf90330d63d9d1c76dfb8e3ad1aeb2d933d1a45 x.h.gch 17e5de099860d94aaa468c5ad103b3f0dd5e663f6cdbd01b4f12cf210023e71c x.h.gch 3cc2f1c0a517b5fedbbd49bb3a34084d9aa1428f33f3c30278a8c61f9ed9ba88 x.h.gch
- nneonneo 7 years ago
  
  This isn't uncommon, especially for file formats which are meant for internal consumption. Of course, they end up being huge cans of worms in terms of security, stability and maintainability going forward.
  Basically, instead of defining a real serialization format (and thus having to write serializer/deserializer code), it's way easier to just `fwrite` out your internal structs to disk, one after another, and write some much simpler walker code to walk through any pointed fields appropriately. At some point though this becomes technical debt which needs to be repaid in the form of a total serialization rewrite.
  Blender, the popular open-source 3D modelling tool, uses a format like this for their .blend files, and it is really gross. IIRC a few releases back they started working to improve the format to be a little less dependent on low-level internal details, but now they have the nightmare of backwards compatibility to deal with.
  The basic problem is that C/C++ have no mechanism for native serialization, unlike e.g. Java, Python, or any number of other languages, so you're either stuck `fwrite`ing structs or reinventing the wheel.
  - StillBored 7 years ago
    
    Which is mostly due to the lack of run-time reflection. OTOH, with a little creativity its possible to create code generators to attach a commonly named (say .serialize method) to classes to dump their POD fields, and call serialize on directly encapsulated classes.
    But your basically right, everyone ends up doing it their own way which just ends up being a PITA.
- icholy 7 years ago
  
  What do you mean by "manually adjust pointers"?
  - jchw 7 years ago
    
    Relocate them.
    
    gumby 7 years ago
    
    Why couldn't that be deferred to the linker?
    
    BenjiWiebe 7 years ago
    
    I believe they're talking about gcc's internal state, not the code getting compiled.
usefulcat 7 years ago

IME, using precompiled headers with gcc is largely a waste of time. I desperately wanted it to be otherwise. I tried many variants: including everything, nothing, a few select, commonly used headers. No matter what I tried, nothing was faster than no PCH. This is a project that has ~90k LOC in 136 object files and compiles in about 2 minutes on 64 cores.
Yes, I was measuring time to rebuild everything (including the PCH) from scratch. So it's probable that incremental compilation is slightly faster using PCH, it's just not nearly as much as I was hoping for.
Someone 7 years ago

It is, but precompiled headers are a pain in the a… (often less so than not using them, but still)
They force programmers to tell the compiler what intermediate result to cache. Finding the best intermediate result to cache is a black art, and that set will change when your source code grows, forcing you to either accept that your precompiled headers may not help that much, or to spend large amounts of time optimizing build times.
- int_19h 7 years ago
  
  There's no need for black arts for most. For most code out there, the headers that really blow up compilation times come from either the standard library, or from large third party libraries. A simple rule of thumb, therefore, is to simply shove all such headers into your precompiled header, and only ever #include the latter in your code.
  Simply put - if it's #include <...>, it goes into the precompiled header. Otherwise, it goes directly into the source.
  The downside of this is that every time you add a new dependency, the entire project is rebuilt, since the change in your precompiled header affects all translation units. But adding dependencies is rare, and changing code and rebuilding is far more common.

timvisee 7 years ago

I would love to see the times of this on a Linux system (preferably on the same hardware).

Settings

C++ Headers are Expensive

Keyboard Shortcuts