Settings

Theme

Making a Go program faster with a one-character change

hmarr.com

511 points by hcm 3 years ago · 248 comments (247 loaded)

Reader

ludiludi 3 years ago

> If you read the title and thought “well, you were probably just doing something silly beforehand”, you’re right!

Don't feel too silly. Russ Cox, one of the technical leads on the Go language, made the same mistake in the regexp package of the standard library.

https://go-review.googlesource.com/c/go/+/355789

sakras 3 years ago

A while ago at my company we switched from GCC to Clang, and noticed a couple of massive regressions (on the order of 50%?) in performance having to do with floating point.

After profiling for a bit, I discovered that suddenly a lot of time was spent in isinf on Clang and no time in GCC… Clang was emitting a function call where GCC wasn’t. I happened to randomly change isinf to std::isinf (it’s a random habit of mine to put std:: in front of these C functions). Suddenly the regression disappeared! I guess on Clang only std::isinf was a compiler intrinsic while GCC recognized both? Anyway, that’s my small-change optimization story.

  • 10000truths 3 years ago

    C defines isinf as a macro, whereas C++’s std::isinf is a function. Perhaps the discrepancy has to do with differences in how they’re evaluated?

  • jeffrallen 3 years ago

    > random habit of mine to put std:: in front of these C functions

    And did you learn your lesson about making random changes that "shouldn't matter" without proving they don't matter? :)

    I find that once I spend the time to make these changes correctly, they are not worth the time to make correctly.

asim 3 years ago

If you want to have a solid understanding and need to do it in just a few hours here's a few things to review.

- The Go programming language spec https://go.dev/ref/spec

- Effective Go https://go.dev/doc/effective_go

- Advanced Go concurrency patterns https://go.dev/talks/2013/advconc.slide#1

- Plus many more talks/slides https://go.dev/talks/

karmakaze 3 years ago

> I did consider two other approaches:

  Changing Ruleset from being []Rule to []*Rule, which would mean we no longer need to explicitly take a reference to the rule.
  Returning a Rule rather than a *Rule. This would still copy the Rule, but it should stay on the stack instead of moving to the heap.
> However, both of these would have resulted in a breaking change as this method is part of the public API.

The problem with heap allocated objects could be due to the incorrect public API.

The change that improves performance also gives out pointers to the actual elements of Ruleset itself permitting the caller to change the contents of Ruleset which wasn't possible before the speed-up. Perhaps you're already aware since change to []*Rule was being considered.

  • mook 3 years ago

    The best part is, there's a subtle API breakage here: the returned *Rule is now a reference to an element of the []Rule, so if the caller was previously modifying the returned value it'll change the slice.

    It's debatable what API guarantees existed around this though; most of the time this would be unspecified.

  • ok_dad 3 years ago

    Sometimes it doesn't matter if a public API is incorrect, because it's set in stone for whatever reason, and you just need to fix the problem internally.

    • spockz 3 years ago

      This is why I like https://github.com/openrewrite so much. One gets to tell users how to rewrite code automatically. It makes refactoring almost as easy as in a mono repo.

      • bspammer 3 years ago

        This looks super interesting, but as far as I can tell they don’t go into how to inform downstream users to add your recipes to their maven/gradle configuration when you make a breaking change.

        In my head, the ideal flow would be an annotation on your library function/class which triggers an IntelliJ suggestion for downstream users affected by your breaking change to run the automated refactor for them. Kinda like a more helpful @Deprecated.

        • spockz 3 years ago

          It would definitely be nice to have this generated from a deprecated like annotation!

          The maven plug-in seems to use the recipes directly as dependencies: https://github.com/openrewrite/rewrite-maven-plugin

          In our internal situation the parent Pom could already control this plug-in definition including versions of the recipes. At the very least the recipes could follow the same version as the artefact.

          I thought I saw somewhere where the recipe was bundled with the artefact itself. That is very neat for simple usecase.

          However, it suffers from the same flaw as the native-image configuration for GraalVM in artifacts. Sometimes the configuration/recipe needs to change. Eg because of new insights or because it is incompatible with new versions of GraalVM/open rewrite. So I suppose having an extra version dimension Could solve this. Then one can always depend on the latest version of the recipe for that artifact.

    • karmakaze 3 years ago

      The way to fix it in that manner here is to undo the 42% speedup and return the heap allocated object for the caller to mangle.

      • account42 3 years ago

        Yes, if you are serious about backwards compat then you pay that cost and perhaps add a separate function that is faster for callers to opt into.

        • karmakaze 3 years ago

          Effectively deprecating the old one and recommending the new one being a way to smoothly 'fix' an API.

coder543 3 years ago

There is potentially another option: use the midstack inliner to move the allocation from the heap to the stack of the calling function: https://words.filippo.io/efficient-go-apis-with-the-inliner/

As long as the global slice is never mutated, the current approach is probably fine, but it is definitely a semantic change to the code.

  • ploxiln 3 years ago

    That seems like overkill for this particular case, but it's a very interesting technique, thanks for the link!

  • throwaway232iuu 3 years ago

    Why doesn't go use RVO like C++ and Rust?

    https://en.wikipedia.org/wiki/Copy_elision#Background

    • coder543 3 years ago

      I don’t think we’re on the same page about what midstack inlining is being used for in my suggestion. This discussion is about eliminating a heap allocation, which as far as I understand, RVO never does. Please read the article I linked if you want to discuss this further. I don’t want to repeat the article pointlessly.

      I’m also fairly sure Go uses RVO here too, which cuts down on the number of times the object is copied around, but again, it’s irrelevant to the discussion of heap allocations. Copying the object isn’t the performance problem here, needlessly allocating a very short-lived object on the heap over and over is.

  • morelisp 3 years ago

    Packages should be exposing an API with destination slices more often to begin with. The stdlib is pretty good about this (there's a few missing though 1.19 closed the most obvious absences), but most third-party code is awful. Or worse, it only takes strings.

tuetuopay 3 years ago

Aaaaaand that's why I love Rust's decision to make copies explicit with `.clone()`. Annoying as hell when you're not used to it but overall worth it.

  • BlackFly 3 years ago

    Except a lot of structs also derive and prefer `Copy` and a lot of rust code also avoids heap allocation which requires `Clone`. The `Copy` trait can be used implicitly like in the example here. On the other hand, due to the lack of garbage collector, you wouldn't be able to return the reference to the copy which might lead you to find your accidental copy.

    • tuetuopay 3 years ago

      I have yet to come on structs that implement `Copy` while being expensive to actually copy. The largest I can think of is `Uuid` from the `uuid` crate, which is 128 bits in size. This is a single word copy for most machines since modern hardware has 128 bit support for case like this. Still, two 64-bit words to copy is definitely negligible: that's equivalent of copying two pointers.

      I agree with you for the garbage collector. By design, a GC allows you to willy-nilly copy without thinking about the consequences.

    • kangalioo 3 years ago

      The Copy trait can only be used for bitwise copies. Expensive copying with heap allocations will never happen implicitly

assbuttbuttass 3 years ago

Returning a pointer to a local variable is convenient, but can be a source of hidden allocations.

It's best to treat each struct as a "value" or "pointer" type, and use one or the other consistently for each type. This mostly avoids the need to use & in the first place

cbsmith 3 years ago

As an old C/C++ programmer, I'm always surprised by how often software developers are surprised by the performance costs of inopportune value semantics (C and C++ even more so, punishes you severely for using value semantics when you shouldn't). I increasingly see the wisdom of languages with implicit reference semantics.

It's not that value semantics can't be better (they most assuredly can be), or that reference semantics don't cause their own complexity problems, but rather that so often we thoughtlessly imply/impose value semantics through interfaces in ways that negatively impact performance; getting interfaces wrong is a much tougher bell to unring.

The vast majority of my mental energy when I define an interface in C++ is carefully thinking through a combination of ownership contracts and value vs. reference semantics that I can mostly ignore in languages with implicit reference semantics. While occasionally ignoring those contracts while developing in Java/Python/whatever comes back to bite me, the problem isn't nearly as common or problematic as when I unintentionally impose value semantics in a language that allows me to.

  • thomascgalvin 3 years ago

    > I increasingly see the wisdom of languages with implicit reference semantics.

    I spend most of my time in a JVM language of one flavor or another, and when I was learning Go, the first thing that stuck out at me was, "why would I ever want the compiler to invisibly copy a data structure for me?"

    I suppose the primary reason is to prevent the callee from modifying the caller's data out from under them; unless you pass a reference value, you know the callee cannot modify your data.

    But, as someone who leans heavily into "everything should be as immutable as possible," the second thing that stuck out at me was "wait, a struct can't have const fields?"

    When I write code, it's common to have references to immutable classes thrown around with wild abandon, heedless of ownership, threads, or good taste, because the data just can't change. But that's a paradigm that Go simply doesn't support.

    • cbsmith 3 years ago

      > When I write code, it's common to have references to immutable classes thrown around with wild abandon, heedless of ownership, threads, or good taste, because the data just can't change.

      If there's anything I wish languages with implicit reference semantics would adopt, it's implicit immutability. I wish Java would be so much nicer with keyword that is half way between "final" and "volatile" that means, "yes, you can actually mutate this" and then make final semantics the default for fields & variables.

      • cogman10 3 years ago

        Agreed.

        Could you Imagine a Java where you have a `Map` and a `MutableMap` and that's what you put at your API? I'd make it SO much clearer how safe any individual API is to call.

        • spockz 3 years ago

          Scala has had this for ages. You can have it today. Even in Java either through the Google collection library or through a library that mimics fp style programming. The name eludes me for the moment.

        • morelisp 3 years ago

          In general this doesn't work, the history rule says mutable types are not proper subtypes of immutable ones (and the converse is obvious). If you want to capture mutability in your type system, it needs to be orthogonal to subtyping (like C/C++ const).

          • a_t48 3 years ago

            C++ still has this problem - std::unordered_map<std::string, std::string>` and `std::unordered_map<std::string, const std::string>` are basically unrelated types - you can't const-cast the templated const away. (I may be misunderstanding here)

            • masklinn 3 years ago

              > you can't const-cast the templated const away.

              That seems like a good thing. If you're handed a map to const values you can't just go "imma gunna mutate them anyway".

              • a_t48 3 years ago

                Yup, it's definitely a bit of a code smell if you do. The issue is more the reverse, though - I can't make a mutable map, then hand it by pointer/reference to something that says it wants an immutable map later.

                • masklinn 3 years ago

                  Not necessarily a bad thing either, things can get odd if you're not the only owner and it's mutated under you unexpectedly.

                  • a_t48 3 years ago

                    That only matters if you're storing it. The big issue is 99% of code out there uses mutable template types for containers, and if you ever declare a container that doesn't have a mutable template type, you stub your toe as your new container isn't compatible with anyone else's code. You can't even easily copy your way out.

                      std::unordered_map<std::string, const std::string> m;
                      m["foo"] = "bar";
                      std::unordered_map<std::string, std::string> m2 = m;
                    
                    doesn't compile.
                    • account42 3 years ago

                      > That only matters if you're storing it.

                      Or if you (or something you call) mutate it trough another reference/pointer. Sometimes I wish there was a stronger const where the compiler (and programmer) could actually assume the object is not mutated while that reference/pointer is alive. Of course checking that is no longer doable with just the type system.

                      > You can't even easily copy your way out.

                      That depends on the type. std::unordered_maps could provide a copy constructor that does the conversion.

                      And you can do the copy via iterators, although that is probably not as efficient as it could be because it needs to re-hash the keys.

                        std::unordered_map<std::string, std::string> m2(m.begin(), m.end());
                      • a_t48 3 years ago

                        Sure certainly, copying via iterators work, it's just sad and very not implicit.

              • morelisp 3 years ago

                Normally (when containers are not involved) this is exactly the point of a const cast.

                • account42 3 years ago

                  No, the primary point of a const cast is "I need to pass this to an API that expects a non-const pointer even though it won't mutate it".

                  Casting const away and then mutating is a footgun as you are in undefined behavior territory as soon as your pointer/reference is not just const itself but points to a const variable.

          • cogman10 3 years ago

            > the history rule

            I'm unfamiliar with this rule (and not finding anything good to google). Can you elaborate?

            I can't really think of a scenario where an immutable datastructure isn't a subset of actions against a mutable datastructure.

            • kapep 3 years ago

              I had to look it up too, it apparently is a constraint for subtypes defined in Liskovs substitution principle [1]. From Wikipedia:

              > History constraint (the "history rule"). Objects are regarded as being modifiable only through their methods (encapsulation). Because subtypes may introduce methods that are not present in the supertype, the introduction of these methods may allow state changes in the subtype that are not permissible in the supertype. The history constraint prohibits this. It was the novel element introduced by Liskov and Wing. A violation of this constraint can be exemplified by defining a mutable point as a subtype of an immutable point. This is a violation of the history constraint, because in the history of the immutable point, the state is always the same after creation, so it cannot include the history of a mutable point in general. Fields added to the subtype may however be safely modified because they are not observable through the supertype methods. Thus, one can define a circle with immutable center and mutable radius as a subtype of an immutable point without violating the history constraint.

              [1]: https://en.wikipedia.org/wiki/Liskov_substitution_principle#...

              • aatd86 3 years ago

                Just curious, isn't it obvious from a logical standpoint? I don't see how one could consider a mutable type to be a subtype of an immutable one. On the other hand, an immutable subtype of a mutable one seem plausible?

                • cogman10 3 years ago

                  Immutable from a mutable is just as implausible. You cannot remove a method from a subtype which results in mutating methods being disabled through other means (like throwing exceptions).

                  This is also a violation of LSP and the open close principle.

                  Consider a `List` with an `add` function. What would you do with that `add` function to make an `ImmutableList` subtype?

                  • aatd86 3 years ago

                    Ah I see. Why throw exceptions? They could just be noop depending on the typestate?

                    But still it doesn't look very nice in practice. I think that in that case, it does make much more sense for a mutable type to be a subtype of the immutable one.

                    The immutable threw me off track. It's just that the supertype would lack mutation operations. (in a sense immutable would be the default for every type).

                    Thank you.

          • gowld 3 years ago

            Not quite.

            Conceptually, you can have constness in your subtype-system (as long as you are sticking with interfaces (methods), as Liskov's subtyping model does, and aren't inherting potentially-mutable fields).

            MutableMap and ImmutableMap are both subtypes of a hypothetical ReadableMap. ImmutableMap is the same as ReadableMap, but has an informal contract that subclasses shouldn't add mutability.

        • foverzar 3 years ago

          Kotlin has this

          • thomascgalvin 3 years ago

            Kotlin has this, but the Map is (usually) a MutableMap under the covers, because it's Java bytecode at the lower levels. You have to go out of your way to footgun yourself, but it's still possible.

    • titzer 3 years ago

      > When I write code, it's common to have references to immutable classes thrown around with wild abandon, heedless of ownership, threads, or good taste, because the data just can't change. But that's a paradigm that Go simply doesn't support.

      You might get a kick out of Virgil. It's easy (and terse!) to define immutable classes and you can have immutable ADTs too. (plus tuples, generics with separate typechecking, etc).

  • throwaway894345 3 years ago

    I also have a background in C/C++, etc and I've only ever found myself missing value semantics when I use languages with implicit reference semantics. I guess I always figured the solution was "value semantics with better education / tooling". Education: people should understand value semantics. Tooling: imagine an IDE that highlights allocation points automatically (or perhaps the problem is implicit allocations rather than value semantics?).

    • codeflo 3 years ago

      > perhaps the problem is implicit allocations rather than value semantics?

      I think that’s true. Expensive copies should never have been implicit. There was a story some time ago about a single keypress in the address bar of Chrome causing thousands of memory allocations. The culprit: lots of std::string arguments up and down the call stack.

      Rust gets this right, with the hindsight of C++’s example: “a = b” is a move operation by default and clone() is always explicit, except for plain data types where copying is literally memcpy — and those are clearly marked as such by the type system.

      • cbsmith 3 years ago

        IMHO, implicit allocations is a bit of a red herring. Yes, in C/C++ heap allocations are proportionately pretty expensive, but I've seen Java programs have just ridiculous amounts of implicit allocations but there really isn't much of a problem.

        But allocations aren't the same as copies, and the argument for reference semantics has always been that implicit copies are problematic. In your std::string example, having that many String copies in a Java program would be similarly terrible (and this sometimes happens by accident because of abstraction layers that hide all the copying going on under the covers).

        I do think Rust gets a lot of stuff right, but Rust's cognitive load is broadly recognized. I tend to see it as C++ with a lot fewer foot guns. ;-)

        • throwaway894345 3 years ago

          > Yes, in C/C++ heap allocations are proportionately pretty expensive, but I've seen Java programs have just ridiculous amounts of implicit allocations but there really isn't much of a problem.

          Java programs make "ridiculous amounts of implicit allocations" because allocations are cheap in Java. And they need to be cheap because Java doesn't have value semantics so it leans hard on escape analysis + cheap allocations.

          I agree with the rest of your comment, although I think most of Rust's "cognitive load" amounts to borrow-checker-vs-garbage-collection. You could envision a Rust with explicit allocations and a GC, and that language would have a "cognitive load" approaching that of Go while also being a fair bit more performant insofar as people can much more easily reason about allocations and thus performance.

          • cbsmith 3 years ago

            > Java programs make "ridiculous amounts of implicit allocations" because allocations are cheap in Java. And they need to be cheap because Java doesn't have value semantics so it leans hard on escape analysis + cheap allocations.

            Yes, but that's kind of the point, right? Implicit allocation isn't really a problem because a runtime that optimizes the allocations magically for you is a lot easier to build than a runtime that optimizes whether you really need to be copying objects as much as you do.

            • throwaway894345 3 years ago

              > Implicit allocation isn't really a problem because a runtime that optimizes the allocations magically for you is a lot easier to build

              As far as I know, Java's (default) runtime gives cheap allocations at the cost of long GC pause times.

              > than a runtime that optimizes whether you really need to be copying objects as much as you do

              It's not "copying", it's "allocating", and avoiding allocations isn't that much work (and frankly I'm surprised it's such a minor problem that no one has bothered to build an IDE plugin that highlights these allocation points automatically--or at least I haven't heard of such a thing). Anyway, "a runtime that minimizes allocations" is just an escape analyzer and Java has one of these too, and IIRC it's a lot more sophisticated than Go's (but it's also a lot harder to reason about as a consequence).

              • cbsmith 3 years ago

                > As far as I know, Java's (default) runtime gives cheap allocations at the cost of long GC pause times.

                "long GC pause times" is kind of vague, so I guess you could be correct, but in practice there's a LOT of different ways the memory management can be handled, many of which are deemed "pauseless GC" (though the term is somewhat misleading).

                My statement was considering that reality though. While not true for some use cases, in the vast majority of cases, the runtime optimizes the allocations more than sufficiently.

                > It's not "copying", it's "allocating"

                Allocators can do a pretty good job of minimizing the overhead of allocation, to the point the amortized cost isn't much more than a single machine instruction. Allocating gigabytes of memory quickly is possible. Copying the data can be a lot more work, and often objects have copy semantics that add a lot more additional work.

                > Anyway, "a runtime that minimizes allocations" is just an escape analyzer and Java has one of these too, and IIRC it's a lot more sophisticated than Go's (but it's also a lot harder to reason about as a consequence).

                I think you're implicitly saying "a runtime that minimizes heap allocations" there, in which case I'd agree.

                • throwaway894345 3 years ago

                  > in practice there's a LOT of different ways the memory management can be handled, many of which are deemed "pauseless GC" (though the term is somewhat misleading).

                  Yes, but I'm pretty sure those "pauseless GC" schemes impose other tradeoffs.

                  > My statement was considering that reality though. While not true for some use cases, in the vast majority of cases, the runtime optimizes the allocations more than sufficiently.

                  I'm not sure I follow. The same could be said for Go--in the vast majority of cases, Go's tradeoffs (slow allocations, low latency / non-moving GC) are also suitable.

                  > Allocators can do a pretty good job of minimizing the overhead of allocation, to the point the amortized cost isn't much more than a single machine instruction.

                  As far as I know, speeding up allocations to this degree requires a moving GC which imposes a bunch of other constraints (including copying a bunch of memory around).

                  > Allocating gigabytes of memory quickly is possible. Copying the data can be a lot more work, and often objects have copy semantics that add a lot more additional work.

                  Yes, but the bottleneck here wasn't the copying, it was the allocations. And if you optimized away allocation cost entirely such that only the copy cost remained, that cost would be so small that the OP would never have bothered to profile because copying small objects like this is so cheap compared to everything else (even if it is expensive compared to bump allocating).

                  > I think you're implicitly saying "a runtime that minimizes heap allocations" there, in which case I'd agree.

                  Yes, the allocator and GC are concerned with heap allocations and not stack allocations. I'm using "allocations" as a shorthand for "heap allocations".

                  • cbsmith 3 years ago

                    In hindsight, I think I chose how to present this poorly, because yes, in this case, the allocation is what is killing the performance. I look at it, and I just see unnecessary implied behaviour creating a performance problem. Usually it isn't the allocations themselves that kill you, but it certainly is the case here.

                    • throwaway894345 3 years ago

                      I agree with you (I think?) that the implicit allocations are a pain point. I think in the Go case, it is the allocations that kill you most of the time (at least that's the case for me), but in C++ it's more likely to be expensive copy constructors or destructors or so on.

          • jhoechtl 3 years ago

            A long long time ago Rust was a GC language.

          • pjmlp 3 years ago

            OCaml and Standard ML.

        • morelisp 3 years ago

          Allocations are as damaging as your free function is slow.

          Java has a tremendously good GC, so can cope with lots of allocations. Go has an OK one, so needs some help (but mollifying it often pays dividends elsewhere in locality and memory usage too). C++ has your default system heap, good luck.

          • throwaway894345 3 years ago

            Historically Java has traded long pause times for fast allocations, although I'm of the impression that it has recently found a way to have its cake and eat it.

            • klodolph 3 years ago

              Java has been tunable for a long time. Periodically, the recommended tuning changes, or new GC algorithms become available, etc. But it has long been possible to get short pause times with various combinations of choosing the right algorithm and writing your program the right way.

              I think what really throws people off here is that getting good performance out of a Java application involves some skills which are alien to C++ programmers, and vice versa. You take an experienced C++ programmer and drop them into a Java codebase, they may have a very poor sense of what is expensive and what is cheap. Vice versa… experienced Java programmers don’t do well in C++ either.

              The result is that you have precious few people with any significant, real-world experience fixing performance issues in both languages.

              • throwaway894345 3 years ago

                Agreed, but usually tuning for short pause times involves trading off throughput or allocation performance. But at the end of the day, if you aren't allocating a bunch of garbage in the first place, then you don't need to be as concerned about the costs of allocating or cleaning up the garbage. I wish Go did more to make allocations explicit so they could be more easily recognized and avoided; I dislike Java's approach of making allocations even more implicit/idiomatic while trying to fix the cost problem in the runtime (although I admire the ambition).

                • SpaghettiCthulu 3 years ago

                  Parallel garbage collection in Java has been a thing for a long time. With the right tuning you might have very infrequent STW GCs, even in a game as allocation-heavy as Minecraft.

                • lazide 3 years ago

                  What do you consider a ‘long’ pause time?

                  I’ve had no issues with Java 17+ under heavy allocation/garbage collection (data encryption pipeline I haven’t tuned to reuse buffers yet), and it’s pause times are on the order of a handful of milliseconds, without meaningful tuning. I think it’s doing something like a GB/s of garbage collection.

                  And the jvm in question is doing a LOT more than just this, so it’s coping with millions of allocated objects at the same time.

                  • throwaway894345 3 years ago

                    I consider tens of milliseconds to be a long pause time (P99 should be <10ms), which is what I understand to be the ballpark for G1 (I haven't tested 17+). No doubt JVM is a fine piece of engineering, but I prefer Go's approach of just not allocating as much to begin with (Java's cheap allocations require a moving GC, which introduces challenges with respect to rewriting pointers and so on, which in turn introduces constraints for FFI and makes the whole system more complex). I wish it went further and made allocations more explicit.

                    • pjmlp 3 years ago

                      If one cares about pause times, G1 isn't it, rather the pauseless ZGC, Azul's C4 or Shenodah.

                      Capable of handling TB sized heaps with micro seconds pauses.

                      • throwaway894345 3 years ago

                        Yeah, I’m nominally familiar with these, but I can’t understand why one of these low latency collectors wouldn’t be the default GC unless they impose other significant tradeoffs.

                        • lazide 3 years ago

                          They don’t change it from the default because G1 isn’t terrible and Java has a long history of ensuring backwards compatibility - not just in APIs, but also behavior, warts and all.

                          If they changed the default GC, a lot of folks would freak out, even if it was objectively better in nearly every situation. Because someone, somewhere is relying on some weird bug in G1, or whatever and now their software behaves differently, and it’s a problem.

                          Give it a couple more major revs, and it might still happen. G1 became the default in what, Java 9?

                        • pjmlp 3 years ago

                          Naturally there are tradeoffs, for example the amount of infrastruture they need only justifies when one is working at such scale, for example ZGC requires large pages and makes use of NUMA if available.

        • codeflo 3 years ago

          I’m not sure I get what you mean. You wouldn’t have that many String copies in Java by passing an unchanged String down the call stack. My point is that it’s too easy to make this mistake in C++.

          • cbsmith 3 years ago

            In Java, the mistake happens only when there's an abstraction that hides the copying from you, so it isn't implicit in the same way, but it's still implicit.

      • adrian17 3 years ago

        > Rust gets this right, with the hindsight of C++’s example: “a = b” is a move operation by default and clone() is always explicit

        Note that a move can still do a copy; in fact, Rust is kinda notorious for generating more on-stack memory copy operations than C++. It’s slowly improving, but it can still be surprisingly bad in some cases.

      • xdavidliu 3 years ago

        > except for plain data types where copying is literally memcpy

        what do you mean by this? If I say `let x = 5; let y = x;` in rust, that's a "plain data type copy" of a stack value, but memcpy is usually used to copy heap memory. What connection between copying of primitive simple stack values and memcpy are you suggesting here?

        • kccqzy 3 years ago

          The compiler can optimize memcpy with a known size into a small number of move instructions so they are identical to copying stack values.

          Try playing with memcpy on Godbolt and you'll find that the compiler will compile the memcpy to a single mov instruction when the size is small, and some movdqu/movups when the size is slightly large, and only a function call when the size is huge.

          > memcpy is usually used to copy heap memory

          memcpy is often used in low-level serialization / deserialization code since you can't just cast a buffer pointer to a uint32_t pointer and dereference that; the solution is memcpy between variables that are often both on the stack.

        • orra 3 years ago

          > What connection between copying of primitive simple stack values and memcpy are you suggesting here?

          They're just using 'memcpy' as a shorthand for saying the bitpattern is blitted. Semantically, that's like a memcpy. The point is, there are no expensive allocations, nor do any embedded pointer fields need adjusted, etc.

        • chlorion 3 years ago

          Why do you think memcpy is normally used to copy heap memory? It's just a general bitwise copy from one location to the other.

          I think the confusion here is that there isn't always a literal call to memcpy for copying small types like ints in the emitted code, but it's always doing something with the same effect and maybe sometimes using an actual memcpy (probably when copying arrays?).

          Also something interesting is that memcpy is used for copying data between stack variables in C sometimes when you need to convert some type to another one without using a cast.

      • throwaway894345 3 years ago

        Allocating isn't "an expensive copy"; it's not analogous to clone() in Rust. The copy isn't the problem, it's the allocation.

        • cbsmith 3 years ago

          I'd argue quite the reverse. Allocation can be quite efficient if done properly, but copying involves a lot of other work.

          • throwaway894345 3 years ago

            I disagree--the bottleneck here is entirely the allocation. The copying is just a memcpy and it's very fast for small structs like this; like I said, it's not the same as a clone() in Rust, which is a deep copy. If you optimized the allocation away entirely (leaving only the copy cost), there wouldn't have been a significant performance problem and this blog would never have been written.

            • glandium 3 years ago

              Actually, you'll find that in Rust, Box::new(stuff) will too often put stuff on the stack before copying it in the newly allocated memory. For large enough stuff, that can be slower than the allocation.

    • cbsmith 3 years ago

      > I've only ever found myself missing value semantics when I use languages with implicit reference semantics.

      Oh, I miss it every time. ;-)

      I will say though that some newer languages seem to have a confused idea about how to offer mixed semantics. A bunch of them tie semantics to types. The ideal interface can vary by usage context. It's hard enough getting the semantics right as the callee (as opposed to caller), let alone when you're defining a type that will be used by who knows how many interfaces.

      > I guess I always figured the solution was "value semantics with better education / tooling".

      I've always thought much the same, but I have slowly come to appreciate that it's more than just education & tooling. Even with good education & tooling, there's a cognitive load that comes with getting interfaces right that for the general case is just not worth it.

      • adgjlsfhk1 3 years ago

        I think this is half right. For anything 64 bits or smaller, value semantics are pretty much always going to be better. That said, being able to choose between value and reference semantics for larger objects per object is a pretty useful feature.

        • cbsmith 3 years ago

          > For anything 64 bits or smaller, value semantics are pretty much always going to be better.

          That's assuming a 64-bit CPU (which admittedly seems like a reasonable assumption. The nice thing about the abstraction though is that there's nothing preventing the runtime from applying value semantics for those trivial small-object cases where they're obviously more efficient.

          • account42 3 years ago

            Even for a 32-bit CPU a 64-bit type is only two words to copy - and in many cases those "copies" are just register loads. In contrast, reference types means to even access it you have to read the reference and then indirectly load the memory it points to. You have to really make something contrived where a two-word type ends up being more efficient as a reference than as a value.

      • throwaway894345 3 years ago

        > I will say though that some newer languages seem to have a confused idea about how to offer mixed semantics. A bunch of them tie semantics to types.

        Curious about what you mean here. This sounds like C#'s class/struct distinction to me.

        • cbsmith 3 years ago

          That's exactly the example I was thinking of.

          • throwaway894345 3 years ago

            Yeah, I never cared for that. Specifically, I'd prefer that everything was just "struct", but structs could implement interfaces, which is essentially the Go/Rust model.

    • TremendousJudge 3 years ago

      >or perhaps the problem is implicit allocations rather than value semantics

      To me, this sounds like this is it. Explicit is better than implicit is a very useful truism

      • cbsmith 3 years ago

        The counter argument to the "explicit is better than implicit" is that abstraction & encapsulation are such significant force multipliers. If done properly, implicit is good. It's just that in case of copying, doing it "properly" is well nigh impossible.

        • kazinator 3 years ago

                    explicit  implicit
          
            good       *    <    *
           
                       v    v    v
          
            bad        *    >    *
          
          
          Good implicit is better than good explicit. (If all is good, go for implicit.)

          Bad explicit is better than bad implicit. (If all is bad, go for explicit; don't hide bad explicit with bad implicit.)

          Good explicit or implicit is better than bad explicit or implicit.

    • ok123456 3 years ago

      > Tooling: imagine an IDE that highlights allocation points automatically

      Rider does this already for C#.

    • Serow225 3 years ago

      The JetBrains IDEs can do this, at least for .NET

  • titzer 3 years ago

    I think the main issue in C++ isn't value semantics, it's deep copy semantics. E.g. in a functional language ADTs are immutable and don't have identity. They can be freely copied, or not, passed by reference or by value, but they are never deep copied. Comparison may be deep, but not passing them.

    That is to say, I think I mostly am agreeing with you. In Java, objects are always passed by reference, never by value, and never implicitly copied. But Java doesn't have any value types other than primitives. When I added ADTs to Virgil, I wanted to get the best of both worlds; objects are pass by reference, like Java, and ADTs are immutable, identity-less, so they are deep compared but never deep copied. (ADTs in Virgil can also sometimes be flattened, so they don't necessarily result in heap allocation).

    • cbsmith 3 years ago

      In a functional language, you don't have to worry about bits of code mutating your data. ;-) On the flip side, there's a lot of cognitive load that comes with functional languages, so while they do address the problem neatly...

      I'd have to take a look at Virgil to appreciate your approach, but I'm always leery of implicit value vs. reference semantics tied to types (aside from the whole array fiasco, easily the ugliest part of Java's type system). So often the particular semantics you want are driven by the context of what you're doing, rather than the what you're doing it with.

    • nyanpasu64 3 years ago

      I still don't see why value structs need to be immutable; ints are mutable in all languages, and structs are mutable in C, C++, and Rust (if you `let mut`) and it's a feature of the language.

  • tylerhou 3 years ago

    The performance issue here is not value semantics, it's the overhead of automatic lifetime management. The copy is cheap. The lifetime tracking is not because it forces a heap allocation and creates additional GC pressure. In fact, assuming Rule is small, if Match returned by value, the code would be similarly as fast.

    • cbsmith 3 years ago

      > The performance issue here is not value semantics

      There's no performance cost to value semantics, so of course not.

      > The copy is cheap. The lifetime tracking is not because it forces a heap allocation and creates additional GC pressure. In fact, assuming Rule is small, if Match returned by value, the code would be similarly as fast.

      I'm referring more to how this stuff seeps in without the programmer realizing it. It's the implicit nature of all this behaviour that is the problem.

      • tylerhou 3 years ago

        Sorry, because of value semantics. (Also, you wrote "performance costs of inopportune value semantics". So the correction is annoying; you know what I meant.)

        This stuff seeps in without the programmer realizing it because Go made a deliberate design decision to have automatic lifetime management. In other words, this is a feature of the language and not a bug. The only way for this to not seep in is if Go forced programmers to specify most lifetimes, which would make the language much more cumbersome to use.

        I.e., this is not a value-vs-reference semantics issue. It's a manual-lifetime-management vs automatic-lifetime-management issue. The solution is to either 1) write in a language with more explicit lifetimes if performance is that important, or 2) profile your code and reduce heap allocations, which is what the person who wrote the article did.

        • cbsmith 3 years ago

          Sorry for being annoying. I wasn't trying to be.

          I would disagree about this stuff seeping in because of automatic lifetime management. The equivalent code in Java, Smalltalk, etc., would either not have the performance problem or it would be much more obvious that the code wasn't as efficient as it could be.

          • tylerhou 3 years ago

            > Sorry for being annoying. I wasn't trying to be.

            Thanks, no worries.

            You're right that Java doesn't have this performance problem because it (apart from primitives) has no implicit copies. (So the code wouldn't copy, it would return a reference, hence no allocation.)

            I think what you're trying to say is: programs written languages with value semantics can implicitly create more objects, which (in some cases) have additional allocation/lifetime tracking in languages with automatic lifetime management. So maybe one solution is to prevent copies in most circumstances, and when you do need to copy, make them explicit.

            But I think this conflicts with Go's philosophy as well. First, in highly concurrent programs, copies are necessary. A mutex/atomic will become a bottleneck; sharing is bad for performance. This means that copies should be part of most types in the language. (If they aren't, you'll run into issues like not being able to pickle when using multiprocessing in Python [1].)

            OK, so we need copies. But clearly, I shouldn't have to write `a = b.Copy()` if b is an integer. That's too verbose. And if everything has a `.Copy()` call, that leads to visual noise and it's not clear when copies are actually expensive. So what set of "blessed" types should be implicitly copyable? Java picks primitives, but this means it's basically impossible to create an int-like object with value semantics. I think this is bad. C++ is at the other extreme with "(almost) all objects are implicitly copyable" but some people dislike how forgetting a & on a std::vector type might lead to a huge copy.

            Go has chosen a reasonable middle ground -- dynamically sized types like arrays and maps are not implicitly copyable, but "cheap" objects like structs are implicitly copyable. This means that when you see a Copy call, it's meaningful. But this means you run into these situations when an implicit copy interacting with lifetimes/garbage collection causes a slowdown. But I don't see any other alternative. Getting rid of implicit copies for cheap types will cause visual noise. Getting rid of GC makes your language harder to use. Both of these alternatives seem worse to me.

            To be clear, I'm not a huge fan of Go. Especially around the lack of monadic error handling. But I think they have done a decent job wrt. their design goals.

            [1] https://stackoverflow.com/questions/8804830/python-multiproc...

  • bitexploder 3 years ago

    Becoming good at Go is mostly knowing all the sharp edges with the built in types IMO. Of course go routines and concurrency primitives are difficult to master, but that is a different beast and if you understand concurrency from some other languages Go just makes that easy. But knowing all the behaviors of slices, and really intuiting how they work makes your life a lot easier and a lot less prone to bugs. And slice behavior almost all comes down to what is a slice internally and how and where am I copying things as I wrote my code. Generally the copying is fine, but in some cases it is not or it is in a tight performance critical section where you need to be thinking about it.

    Esit: Also, pointers into slices will probably leave you sad. You get a pointer to the slice storage, not a pointer to the thing. And if the slice resizes your pointer mow references a dead slice. Basically, pointers and slices are not friends. Unless you have a slice of pointers, which idiomatic go avoids unless there is a decent reason for it :)

  • Jemaclus 3 years ago

    I'm a dumb dumb. Can you define "implicit reference semantics" and "value semantics"? You use the phrases several times in your post, but I don't really understand what you mean. If it helps, I'm not a C++ programmer, but I am familiar with higher level languages like Go, Python, Ruby, PHP and Javascript.

    • TheRealPomax 3 years ago

      It's the difference between assigning/passing around "copies of the data" vs. assigning/passing around "the memory address for that data" under the hood.

      PHP, for example, has explicit references. If you have an `$arr1=array(1,2,3)` and an `$arr2 = $arr1`, that second array is a full copy of the first array, and updating $arr1 does nothing to $arr2. Similarly, `function update_array($arr) { $arr[0] = 'cake'; }` called with `$arr1` will create a copy of $arr1 for use inside that function. Any changes to `$arr` inside the function only apply to that copy, not to $arr1. Unless you explicitly tell PHP you want to work with references, by using `function update_array(&$arr) { ... }` instead. PHP uses value semantics.

      JS, on the other hand, uses implicit reference semantics. If you have a `const arr1 = [1,2,3];` then `arr1` is an alias, simply pointing to a memory location, and declaring a `const arr2 = arr1`, makes arr2 also be an alias for that same memory location. They both reference the same thing, but neither "is" the thing. Similarly, if you have a `function updateArray(arr) { arr[0] = 'cake'; }` then that `arr` is also just an alias pointing to the memory location of whatever you passed in, and changes to `arr` change the underlying data, so whether you try to access that through `arr1`, `arr2` and `arr`, it's the same thing. JS uses implicit reference semantics.

      (But note that many languages with implicit reference semantics make exceptions for immutable primitives like numbers or strings)

      • datalopers 3 years ago

        That's not technically correct with regards to PHP. Your statement that any changes to $arr1 or $arr2 only impact the one in question, however, is accurate. If no changes are made they still refer to the same data in memory. It's copy-on-write semantics.

        $arr1 = [1,2,3]; // $arr1 is a pointer to a zval array [1,2,3] and refcount:1

        $arr2 = $arr1; // $arr1 and $arr2 are pointers to the same zval array but incremented refcount to 2.

        $arr2[] = 4; // at this point, $arr2 creates a new zval array, copies over the data from the first, sets the recount to 1, and appends int(4). The [1,2,3] array has its refcount decremented to 1 as it's still referenced by $arr1.

        [1] http://hengrui-li.blogspot.com/2011/08/php-copy-on-write-how...

        • johannes1234321 3 years ago

          Actually that is outdated since at least PHP 7. In modern PHP the engine uses copy on write only for "larger" things, but "small" things like integers or booleans live on the stack and are copied around, avoiding heap allocations etc.

        • morelisp 3 years ago

          This isn't semantics, though, but implementation. As far as I'm aware there's no way to "observe" the copy short of digging into runtime debugging tools, so I think it's still fair to say the language has pass-by-value semantics. In C++ we still refer to things returned-by-value as returned by value despite the reality of NRVO, etc.

        • masklinn 3 years ago

          > It's copy-on-write semantics.

          CoW is not semantics, it’s a way of implementing value semantics which avoids unnecessary defensive copies.

        • TheRealPomax 3 years ago

          true, good point.

      • unfunco 3 years ago

        PHP has implicit references too, objects are passed by reference by default (primitives by value, like your note)

      • Jemaclus 3 years ago

        Thank you

    • gdwatson 3 years ago

      “Implicit reference semantics” means that variables ordinarily refer to objects rather than containing them. “Value semantics” means that variables contain values rather than references, though there are pointer types that let you explicitly store references when you want them. (Often this is discussed in terms of parameter passing, for historical reasons, but the same ideas apply to variables more generally.)

      If your language has implicit reference semantics, “x = y” will cause x and y to refer to the same object. If it has value semantics, x will be a copy of y.

    • masklinn 3 years ago

      "Implicit reference semantics" = everything (or near enough) is implicitly a reference. That's what you get in Python, Ruby, or Javascript: when you pass something around, you're really passing a reference to it, a pointer. Everyone ends up referring to the same thing, and sharing it.

      "Value semantics": you pass the value itself around, which means you're shallow-copying it all the time. That's what you get when you pass or return a bare non-interface type in Go.

      PHP is a bit of a mess, objects are passed by reference (= implicit reference semantics) but arrays are passed by value, and you can opt them into pass-by-reference. You can also pass non-arrays by reference, though I don't think that's very common.

    • marcosdumay 3 years ago

      a = [1, 2]

      b = a

      b[0] = 3

      print(a)

      What does the above print? If the language implements reference semantics, it prints [3, 2]. If it implements value semantics, it prints [1, 2].

    • klodolph 3 years ago

      Languages like Python, Java, and C# have implicit reference semantics. When you create an object, you get a pointer to that object (usually, or commonly).

      In languages like C, C++, Go, Rust… references are more explicit. If you want a pointer to an object, you have to &, or something similar.

      It gets a bit fuzzy.

    • strifey 3 years ago

      If you're familiar with Python and Go, you'll likely be able to quickly spot the differences in how they handle parameter passing. Python uses references and Go uses value.

      • erik_seaberg 3 years ago

        Go maps and channels have reference semantics. Slices are passed and returned by value, but backing arrays are often shared (not copied) in an error-prone way, so it’s safest to pretend they have move semantics and treat only one copy as valid.

        IMHO they should have used pointers everywhere they didn’t want value semantics with deep copies.

    • scubbo 3 years ago

      Thank you for asking - I, too, was a dumb dumb.

  • brundolf 3 years ago

    Having used C++ relatively little, and a long time ago (and never used Go), I don't think I ever realized that value semantics could be used to implicitly clone heap objects

    Yeesh

    • masklinn 3 years ago

      Normally they’re not, that’s really specific to C++ nonsense.

      In a normal value-semantics system if you have a pointer you just copy the pointer. Obviously if the langage doesn’t have your back it also means you’ve now fucked up your ownership, but you’re in C so that was to be expected.

      • brundolf 3 years ago

        Yeah, this was my assumption when I read the original comment, so I was confused about what it was saying (I thought it was suggesting you should pass references instead of doing a bitwise-copy most of the time, which is just plausible enough that I believed it for a minute) until I read further down in the thread and realized they meant implicit heap-cloning

  • dimitrios1 3 years ago

    As a current Go programmer, hunting down escaping pointers via escape analysis is a common part of the routine now.

  • dleslie 3 years ago

    I would also lay a chunk of blame on the use of type inference which causes the information about the behaviour to be hidden from view.

    Had the author been looking at the type information within the syntax of the code the profile output may not have been a surprise. Perhaps the problem would never have existed in the first place.

    • adrianmonk 3 years ago

      Yeah. Often, writing out the type explicitly is just busywork. But it seems it would have paid off here.

      If you were forced to stop and think what type to declare, I bet you'd write "var rule *Rule". Even if you don't think deeply and just look at the return type.

      And then if you assigned "r[i]" to "rule", you'd get a type error.

amluto 3 years ago

Somewhat off topic, but I find a different part of this to be quite ugly:

    if match || err != nil {
        return rule, err
    }
Translating this code to actual logic takes too much thought and is too fragile. Is that an error path or a success path? It’s both! The logic is “if we found a rule or if there was an error then return a tuple that hopefully indicates the outcome”. If any further code were to be added in this block, it would have to be validated for the success and the error case.

But this only makes any sense at all if one is okay with reading Go result returns in their full generality. A canonical Go function returns either Success(value) or Error(err not nil, meaningless auxiliary value). And this code has “meaningless auxiliary value” != nil! In fact, it’s a pointer that likely escapes further into unrelated error handling code and thus complicates and kind of lifetime or escape analysis.

I don’t use Go, but if I did, I think this part of the language would be my biggest peeve. Go has very little explicit error handling; fine, that’s a reasonable design decision. But Go’s error handing is incorrectly typed, and that is IMO not a reasonable design.

  • ericbarrett 3 years ago

    I write a lot of Go, and I agree that this is a big wart in its error handling that would be served by a proper Result type.

    Nevertheless, the convention is that if a function returns (value, err), and err != nil, the value is discarded (I think of it as "undefined"). So the code is conventional.

    • amluto 3 years ago

      In C, “discarding” a pointer in a way that leaves the value visible is quite common. At least if one doesn’t accidentally use the pointer, it’s harmless. (In the way that all manner of unsafeness is harmless in C as long as no actual UB occurs, which is to say it’s not great.)

      But Go is a garbage collected language, and there is so such thing as “discarding” a pointer. Either it’s there or it isn’t, and this kind of leak has side effects. I find it baffling that the language designers and the community consider this acceptable.

      (One thing I really like about Rust is that you can’t play fast and loose with lifetimes like this. If you have function taking &'a Vec<T> and you return &'a T, you can’t arbitrarily “discard” and leak that reference up the call chain. You need to genuinely get rid of it by the time the vector is gone.)

      • foldr 3 years ago

        >and this kind of leak has side effects

        Only if the calling function does something with the pointer (which it generally won't, if err is non-nil). If the calling code does do something with the pointer even when err is non-nil, then either

        (a) the value of the pointer is meaningful, and this is fine; or

        (b) there's a logic error in the calling code, which is a far more serious bug than a potential memory leak.

        So I don't really see the problem here.

        • amluto 3 years ago

          I should have been more explicit. Go is garbage collected. If you keep a pointer live, then it’s not garbage, which prevents it from being collected.

          At least Go doesn’t use RAII, so inadvertently pinning a pointer won’t keep a socket open or a lock held.

          • foldr 3 years ago

            Just returning the pointer doesn't keep the value that it points to live, unless the calling function does something with the pointer.

            The following toy example doesn't leak memory. Are you thinking of some other pattern that would leak memory? If so, what is it? I can only think of patterns where a logic error is also involved. That is, where the meaningless value of 'val' is used in code paths where err != nil.

                func alwaysError() (*int, error) {
                    var dummy int
            
                    // We return &dummy even though it's a meaningless value.
                    // Will this cause a memory leak?
            
                    return &dummy, fmt.Errorf("An error")
                }
            
                func caller() {
                    val, err := alwaysError()
                    if err != nil {
                        fmt.Printf("Error\n")
            
                        // It won't! Because the value pointed to by 'val'
                        // can be GCed from this point on.
                        return
                    }
            
                    // never get here        
            
                    fmt.Printf("Value %v\n", val)
                    //
                    // ...lots of code that uses *val...
                    //
                }
            
            You might think that you'd get a leak if the error branch was also long-lived, but Go's GC seems to be precise with respect to conditional branching (as far as I can tell by experimenting with runtime.SetFinalizer). That is, as long as the error branch doesn't refer to 'val', then the value pointed to by 'val' can be collected once the error branch is entered (and before 'caller' returns).
            • amluto 3 years ago

              I haven't actually played with this, but if the same pattern of lazily (sloppily?) returning arbitrary pointers along with errors continues, the lifetime ought to be extended arbitrarily. The caller could itself return val, err, and so on up the chain.

              • foldr 3 years ago

                I'm not seeing how that would cause a memory leak. It just means that the value might be GCed a bit later. Could you give an example?

                • amluto 3 years ago

                  It’s a temporary leak. My point is that it’s an actual side effect, unlike how in C it has no effect whatsoever if the value is ignored.

                  In a language where destruction of an object often releases external resources (C++, for example, or Python or Rust to a lesser extent), this effect would be more dramatic. Imagine someone doing this in Python:

                      def func():
                        f = tmpfile(…)
                        err = do something
                        return f, err
                  
                  This style would be absurd and wrong — it keeps f alive to long.
    • za3faran 3 years ago

      What I found frustrating that there is nothing stopping a function from returning both a non nil value and a non nil error. I've seen weird code that did that and it could have easily resulted in incorrect behavior.

      • klodolph 3 years ago

        The one place where you’ll actually see this is in Read().

        If you try to read 1000 bytes, you might get “800 bytes read, EOF” as a result. The worst part is that os.File won’t ever do this!

        • ok_dad 3 years ago

          I really hate when people use "errors" for signals. Or, alternately, use the signals mechanisms for actual errors. An error should be "something went wrong", and reading a file to the EOF is not "going wrong", that's just what the "read()" should do if you tell it so!

          I like Go's multiple returns and error checking by default, but it definitely should have been implemented with some sort of "Result/Error" type union instead. Go made so many good decisions, that I am frankly amazed they made this one really bad one.

          • robotresearcher 3 years ago

            If you take the position that the semantics of read() are 'read data from this endless stream', then EOF is an error indicating that this model no longer applies.

            Doesn't bother me at all.

            • account42 3 years ago

              Except in that model it would be fine to return an error as soon as soon as you know there is an EOF while in the real world you do want to read right up to the EOF because EOF is usually not an error condition but expected, even with streams of indeterminate length.

          • bouncycastle 3 years ago

            But that's what "errors" are, the're signals.

        • nemo1618 3 years ago

          Presumably this is for performance reasons: if a `Read` call involves, say, a network request, then returning "800, EOF" in one roundtrip is certainly nicer than "800, nil" -> "0, EOF" in two roundtrips. In practice... I suspect this happens rarely enough that it's not worth the cost. The performance edge case could be handled with an alternative API, e.g. an optional EOF() method.

  • slcjordan 3 years ago

    They should probably replace that with 2 if statements to make the error path and the non-error path obvious:

        if err != nil {
            return nil, err
        }
        if match {
            return rule, nil
        }
  • foldr 3 years ago

    The regular return value doesn’t have to be meaningless just because the error is non-nil. In this case the function returns the rule that triggered the error, which is potentially useful information. I don’t think it is an official Go convention that err being non-nil entails that the other return value should be meaningless.

    • za3faran 3 years ago

      It's quite error prone, most golang code is if err != nil { return nil, err }

      Now of course it's important to read the documentation, but a language with sum types (or exceptions) would have used a separate type to indicate an error condition plus useful information on that error.

      • foldr 3 years ago

        Right, but Go doesn't have sum types, and I think it's a mistake to interpret Go's error handling conventions through that lens. It's perfectly fine to return an error together with a meaningful value. At least, I haven't been able to find any official Go docs suggesting otherwise. You might wish that Go had sum types, but that doesn't mean that it's an actual Go coding convention to pretend that (err, X) tuples are sum types.

      • adamwk 3 years ago

        That only works if a function is returning a pointer though, so I don't think it's right to say "most google code." Any function returning a string or any struct by value that can fail doesn't return nil

enedil 3 years ago

Went from 4.139s to 2.413s. I fail to see how it is 70%. I think it is explained as 4.139/2.413 = 1.7 which of course doesn't make sense here.

  • CodesInChaos 3 years ago

    I do think saying it's 71% faster makes sense here, since "x% faster" and "speed increased by x%" mean the same thing. This reduces the runtime by 42%, but that doesn't mean it's just 42% faster.

  • bmicraft 3 years ago

    It would probably be more accurate to say it can do 70% more stuff in the same time. Or that it takes 42% less runtime

    • xnorswap 3 years ago

      But 70% more stuff in the same time is 70% faster.

      • OJFord 3 years ago

        Let's say you do 10 things in 100 time (100/10 =10 time per thing) to start with:

        70% more stuff in the same time is 17 things in 100 time (100/17 =5.9 time per thing);

        70% faster is 10 things in 30 time (30/10 =3 time per thing) - or, I would argue incorrectly, perhaps said to mean 10 things in 70 time (70/10 =7 time per thing).

        • LanceH 3 years ago

          90mph is 50% faster than 60mph. 17/time is 70% faster than 10/time.

          You're describing "in 70% less time", not "70% faster".

  • morelisp 3 years ago

    This is an extremely common mistake in reporting performance numbers. That the old version is 70% slower does not make the new version 70% faster.

    • wizofaus 3 years ago

      70% slower is a bit ambiguous though - it could mean 70% extra runtime or it could mean 30% of the new speed. Whereas 70% faster would always suggest to me that it can do 70% more work in the same amount of time, i.e. a 1.7x increase in speed.

      • enedil 3 years ago

        I do not agree. When you benchmark you usually measure the differences between times needed to complete. This is because it is highly non-obvious that if you increase workload twice, the time increases also twice. Perhaps the algorithm is not linear. Perhaps if you have more data, you suddenly need to swap memory. Perhaps something (like disk access in parallel) means that actually it takes less than 2x time. This means that a concept of speed per unit of work is undefined. So the only reasonable interpretation of "70% faster" means "spends 30% time of original".

        • wizofaus 3 years ago

          I can categorically state I've never thought of or understand 70% faster as meaning that, and certainly not 100% faster as meaning "completes instantly".

          I see the OP has solved the problem by removing any references to how much faster from the article title!

          You're right about non-linear algorithms though. If an O(n^2) algorithm is 2x / 100% faster, it can't process 100% more items in the same time, but I'd understand it to mean taking half the time for the same n.

        • yongjik 3 years ago

          But under your definition, if something becomes "three times as fast" (i.e., 200% faster), it will have to finish its task in negative time!

  • hcmOP 3 years ago

    Doh! Thanks for pointing out another silly mistake – I'll fix that.

    • xnorswap 3 years ago

      You had it right the first time, 1.7x speed is 70% faster.

      If something previously took 4s now takes 2s then it's 100% faster.

      Think of driving 10miles. If you drive at 20mph then it takes 30 minutes. If you drive twice as fast, 40mph, it takes 15 minutes.

      40mph is 100% faster than 20mph.

      Half the time is twice as fast!

    • wizofaus 3 years ago

      I disagree, I think the 70% is right, and matches what you still describe as a 1.7x speed increase. If it originally took 4 seconds and now takes 2, I'd call that a 100% speed increase, i.e. twice as fast.

    • markoman 3 years ago

      @hcm: Would have loved to see the 'after' flamegraph just for comparison purposes! I'm still trying to get used to groking flamegraphs when optimizing. They're a somewhat new tool, IMO.

  • barbegal 3 years ago

    It's gone from 14.5 operations per minute to 24.9 operations per minute so a 70% speedup.

gp 3 years ago

I was trying to debug and improve the performance of some parallelized C++ code over the weekend for parsing CSV files. What would happen was parsing each file (~24k lines, 8 columns) would take 100ms with one execution context, but when split across many threads, the execution time of each thread would slow down proportionally and the throughput of the whole program would strictly decrease as thread count increased.

I tried all of the obvious things, but the offender ended up being a call to allocate and fill a `struct tm` object from a string representation of a date. This doesn't have any obvious reasons (to me) that it would cause cache invalidation, etc, so I'm a little in the dark.

Still, replacing this four line block improved single threaded performance by 5x, and fixed the threaded behavior, so on the whole it is now ~70x faster and parses about 400mb of csv per second.

  • Thorrez 3 years ago

    Maybe date related code calls out to the operating system to find your time zone, and maybe that can't be done in parallel.

  • Quekid5 3 years ago

    False sharing, maybe?

lanstin 3 years ago

That seems like a potential for compiler optimization. It should already know that the rule value is only used one time, as the target of a & and this must be somewhat common in managing return values.

  • ithkuil 3 years ago

    The semantics change. You're now returning a pointer to the actual Rule in the Ruleset, while before you'd be returning a pointer to copy of the Rule.

    The optimization would only work if you had a way to tell the compiler that some values are constant/immutable.

    • ithkuil 3 years ago

      BTW; I'm using both Go and Rust lately.

      In Rust you can write a function that returns the pointer of one element of a slice. You can also write a function that returns the pointer to a heap-allocated copy of an element of the slice. The two functions would have different signatures.

      The compiler would also prevent mutation of the slice as long as there are any references to individual elements of the slice being passed around.

      • ok123456 3 years ago

        >In Rust you can write a function that returns the pointer of one element of a slice.

        Have fun fighting the borrow checker on that one.

        • oconnor663 3 years ago
          • ok123456 3 years ago

            Now try to actually do that in a larger program without static lifetimes...

            • piperswe 3 years ago

              It shouldn't be too bad as long as you keep in mind that it is a reference to an item in that slice, so whatever that slice is pointing to needs to stick around as long as you're using elements from it. I don't often encounter borrow checker issues anymore, because once you program Rust long enough you know what things need to live for what lifetimes, and you architect your "larger programs" around that.

        • ithkuil 3 years ago

          Yes and that's the whole point!

          The borrow checker ensures that you either do it right or you find another way that is safe.

          For example it may be totally fine to help allocate a result, not all programs need to be optimized to the bone. Just return a boxed value. If you need to share it, wrap it in an Rc or Arc.

          One problem I see with it is that rust chooses to make the most efficient way of programming also the most "simple looking", and so it lines up incentives in such a way that people will unnecessarily try to avoid using an extra Arc just because it looks like unnecessary clutter.

          When you learn to embrace your boxing you'll learn that the borrow checker is not your enemy.

    • lanstin 3 years ago

      Oh yeah. Too bad. Just ran the escape to heap analysis on my current project, not looking too promising. Mostly it is allocating structure and saving them in a huge in memory hash.

  • oconnor663 3 years ago

    I think the optimization is only valid if we know that nothing is ever going to use thr returned pointer to do mutation.

  • kgeist 3 years ago

    I don't think it can be optimized without altering semantics. If it's a pointer to a value in the slice, changing the slice's values (ruleset[i] = ...) will be reflected in all Rule values returned from the function, because they all point to the same memory. In the same way, changing the returned value's fields will change the data in the original slice. The author's code is prone to this behavior after the change.

    When it's a pointer to a copy, no such implicit dependencies occur.

    • derefr 3 years ago

      You're not wrong in general, but one interesting thing about Go as an ecosystem (rather than as a language) is that golang programs are mostly statically compiled — all sources, one pass, one code unit, one object-code output — so they're (theoretically) very amenable to compile-time (rather than link-time) Whole-Program Optimization techniques.

      In this specific case, that technique would be whole-program dataflow analysis. Given a Golang function that passes out references-to-copies-of owned data, you could actually determine for certain — at least in the default static-binary linkage mode — whether these two properties hold universally within the resulting binary:

      1. whether no caller of the function will ever try to do anything that would cause data within their copy of the struct to be modified;

      2. whether the owner of the data will never modify the data of the original struct in such a way that, if the copy were elided, the changes would be "seen" by any reads done in any of the callers. (The owner could still modify internal metadata within the struct for its own use, as long as such internal metadata is 1. in private fields, 2. where all callers live outside the package defining the struct, making those fields inaccessible; and 3. the fields are never accessed by any struct methods called by borrowers of the struct — keeping in mind that such methods can be defined outside the package by caller code.)

      If you could prove both of these properties (using dataflow analysis), then you could safely elide the copy within the function, turning the return of a reference-to-a-copy-of-X into a return of a reference-to-X.

      (And, in fact, if you can only prove the second property universally, and the first property in specific instances, then you can still elide the copy from the function itself; but you'd also generate a wrapper function that calls said function [receiving a reference-to-X], copies, and so returns a reference-to-a-copy-of-X; and then, for any call-site where the first property doesn't hold — i.e. callers whose transitive call-graph will ever modify the data — you'd replace the call to the original function with a call to the wrapper. So "safe" connected caller sub-graphs would receive references, while "unsafe" connected caller sub-graphs would receive copies.)

  • runevault 3 years ago

    Nah an optimization is dangerous as others have said. A lint that detects oversized copies could be worthwhile though.

jimsmart 3 years ago

From the headline alone, I guessed this was to do with pointers/references to values vs values themselves.

Yep, with values that take a lot of memory, it's faster to pass pointers/references around than it is to pass the values around, because it is less bytes to copy.

Of course there is more to such a decision than just performance, because if the code makes changes to the value which are not meant to be persisted, then one wants to be working with a copy of the value, not a pointer to the value. So one should take care if simply switching some code from values to pointers-to-values.

All of these things are things that coders with more experience of languages that use such semantics kinda know already, almost as second nature, since the first day they got caught out by them. But everyone is learning, to various degrees, and we all have to start somewhere (i.e. knowing little to nothing).

hoosieree 3 years ago

> You can see these decisions being made by passing -gcflags=-m to go build:

That's a very nice feature! I wonder if compilers for other languages have something similar.

  • masklinn 3 years ago

    You'd have to see if all compilers support it, but LLVM has a "remarks" system, which should provide similar information (though likely a lot more of it) for optimization passes which are traced: https://llvm.org/docs/Remarks.html#introduction-to-the-llvm-...

    The frontend may or may not have its own optimizations and logs tho e.g. rustc now has MIR optimizations (https://rustc-dev-guide.rust-lang.org/mir/optimizations.html) but while you can dump MIR data (https://rustc-dev-guide.rust-lang.org/mir/debugging.html) I don't remember seeing an optimisation log.

    At the end of the day, I think it's more likely that you take a look at the assembly and infer problems from there if the profiler doesn't tell you straight. An other difference is the kind of decisions the compiler makes e.g. while a compiler can optimize away allocations in "manual allocation" languages (https://godbolt.org/z/5nEo7xjEr) the allocations are plainly visible, so if they're trivially avoidable... you'll just avoid them.

    Using Rust as an example, you'd have something like this:

        pub fn match_(&self, path: &str) -> Result<&Rule, Error> {
            for rule in self.0.iter() {
                if rule.match_(path)? {
                    return Ok(rule);
                }
            }
            Err(Error)
        }
    
    You couldn't miss an allocation, because the return type would have to change, and you'd need to perform the copy out:

        pub fn match_(&self, path: &str) -> Result<Box<Rule>, Error> {
            for rule in self.0.iter() {
                if rule.match_(path)? {
                    return Ok(Box::new(rule.clone()));
                }
            }
            Err(Error)
        }
  • _old_dude_ 3 years ago

    In Java

        -XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining
    
    and

        -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation
    
    The generated logs can be seen with JITWatch [1].

    [1] https://github.com/AdoptOpenJDK/jitwatch

  • throwaway894345 3 years ago

    Does anyone with VS Code or vim plugin development experience know how hard it would be to call this behind the scenes and highlight allocation points in the editor?

Beltalowda 3 years ago

The deeper lesson here is "don't use pointers unless you're sure you need them". I've seen quite a few people use pointers for no reason in particular, or there's simply the assumption it's faster (and have done this myself, too), but it puts a lot more pressure on the GC than simple local stack variables.

Of course sometimes pointers are faster, or much more convenient. But as a rule of thumb: don't use pointers unless you've got a specific reason for them. This applies even more so if you're creating a lot of pointers (like in a loop, or a function that gets called very frequently).

  • throwaway894345 3 years ago

    Eh, I've waffled a couple of times between "pass values by default" and "pass pointers by default". Ultimately, I don't think there's a really good answer except to understand escape analysis and get comfortable with the escape analyzer. Notably, "using pointers" doesn't inherently put pressure on the GC, but rather allocations put pressure on the GC and there are plenty of instances where pointers don't put pressure on the GC at all (notably, if you're passing data into a function by pointer, it's not going to allocate, but it may if you're returning a pointer to "a stack-allocated value").

    Notably, if you're defaulting to values, you may still have a bunch of allocations when you try to implement interfaces, which usually (always?) requires implicitly boxing values; however, if you pass a pointer into a function that takes an interface, I don't think it gets moved to the heap (but I'm not sure, which is why Go programmers need to be comfortable with profiling the escape analyzer and also why it would be great if Go actually had explicit semantics for allocations).

    • kevincox 3 years ago

      > notably, if you're passing data into a function by pointer, it's not going to allocate

      Isn't it still possible for the value to escape here? For example the callee could stick it until a global data structure.

      In fact it seems like a pointer passed to a function would need to be on the heap "by default" unless the compiler can prove that it doesn't escape.

  • dist1ll 3 years ago

    > but it puts a lot more pressure on the GC than simple local stack variables.

    Do you have evidence for this claim? AFAIK the Go compiler does escape analysis, and allocates pointers that don't escape on the stack.

    • throwaway894345 3 years ago

      This is true, but it's hard to tell if a pointer escapes or not without actually profiling. That said, I don't think the answer is to avoid pointers, but rather to get comfortable with profiling the escape analyzer. By default, I just stick to the subset of Go which I know won't escape--functions can take pointer parameters, but I'm very careful about returning pointers to data that would otherwise be stack-allocated (even though it's not especially idiomatic, I'll often prefer mutating an `out T` parameter rather than returning a `T` because I know the former will not allocate).

  • kccqzy 3 years ago

    That's not a deeper lesson. A deeper lesson is to understand where the pointer points to and then decide accordingly.

    • Beltalowda 3 years ago

      That is pretty much what I said, except phrased different.

      • throwaway894345 3 years ago

        It sounded to me like you were saying "avoid pointers by default" (bad advice IMO) rather than "if it matters, verify whether the pointer escapes" (good advice IMO).

  • bitwise101 3 years ago

    Most of the times I use a pointer just because I want to return nil in case of error. Is this a valid reason?

chubot 3 years ago

FWIW, to prevent the bug where a = b is slow for big types, Google's C++ style guide used to mandate DISALLOW_COPY_AND_ASSIGN (which used to be DISALLOW_EVIL_CONSTRUCTORS I think) on all types (most types?)

Looks like that's been gone for awhile in favor of C++ 11 stuff, which I don't really like:

https://google.github.io/styleguide/cppguide.html#Copyable_M...

A lot of good software was written in that style, but it has grown bureaucratic over time, and as the C++ language evolved

sendfoods 3 years ago

1 character, in 2 places ;) I did not know profiling support for go was so seamless, thank you!

May I ask, is that theme custom or available somewhere? I really enjoyed it

  • gwd 3 years ago

    > 1 character, in 2 places ;)

    Moving a single character from one place to another. :-)

    A good explanation of why "fire the developers with the lowest 50% of lines added" is an idiotic thing to do: this sort of deep analysis takes a lot of time and expertise, and frequently results in tiny changes.

  • hcmOP 3 years ago

    Thanks! It's just a few dozen lines of CSS. The body font is Inter and the monospaced font is JetBrains Mono.

  • blowski 3 years ago

    > is that theme custom or available somewhere

    Looks a bit like https://newcss.net/ or Water CSS

is_taken 3 years ago

Would be interesting to see the performance difference if you undo that move-&-change and change the function signature from:

  func (r Ruleset) Match(path string) (*Rule, error)
to:

  func (r *Ruleset) Match(path string) (*Rule, error)
  • masklinn 3 years ago

    Likely none: Ruleset is

        type Ruleset []Rule
    
    The original code creates a local copy of a rule and explicitly returns a pointer to that. Taking the ruleset by address wouldn't change that issue.
lxe 3 years ago

This is the kind of stuff that the compiler needs to really understand. If all this de-referencing and referencing magic is at the control of the user, it needs to have meaningful effect on what the code does. Otherwise we might as well just write C.

  • silisili 3 years ago

    The compiler does understand it and did what was asked - it was just written rather poorly.

    There are valid use cases for wanting to take a copy, and then pass along a pointer of the copy. Perhaps to go through a series of modification methods that don't touch the original. I'd sure hate it if the compiler tried to outsmart me on that and changed the behavior away from what I'd written.

tonymet 3 years ago

Overall good review of profiling tactics . But there’s nothing egregious about Golang here . Pass by value vs reference is a common performance issue.

  • masklinn 3 years ago

    > But there’s nothing egregious about Golang here . Pass by value vs reference is a common performance issue.

    The trap here is that everything is passed by reference (pointer), but the intermediate local value is, well, a value (a copy).

    Rule is not a gigantic monster struct (it's 72 bytes), chances are returning it by value would not have been an issue.

    Anyway I would say there is an issue with Go here: it's way too easy to copy out of a slice.

infamousclyde 3 years ago

Thank you for sharing. I'm curious if you would recommend any good resources for profiling with Go. I enjoyed your code snippets and methodology.

stephen123 3 years ago

Great post. I always feel smart when I find these kind of optimisations. Then I wonder why the compiler isnt smarter, I dont have to be.

erdaniels 3 years ago

Is there any nice tooling / static analysis for golang that instruments the builds the process to add all the gcflags with verbose output and give you hints as to what can be optimized?

renewiltord 3 years ago

Clear tutorial of how to go about identifying this. Good blog post. Since the problem was constrained and real, it helps someone know when to use these tools. Thank you for sharing.

cratermoon 3 years ago

This has made me go back to look at all the Go I've written recently and look at the & uses.

amtamt 3 years ago

It falls in those 3% of code lines one should think of while not optimizing prematurely.

notpushkin 3 years ago

Well, technically it's either a 2-character or 0-character change! :-)

AtNightWeCode 3 years ago

So, this is very basic Go design and you could write something about how it works in C and Go and why a older lang like C don't have this prob but then at the end of the day the Go fanclub will down vote the hell out you no matter what.

  • AtNightWeCode 3 years ago

    Go compiler is garbage by the design. A 20 year old C compiler does not have this prob. This is also why Go have declined so much during the last couple of years. The benefits of Go have not increased and most of the quirks are still there. Like the error handling, the naive compiler and the syntax sugar that somewhat hides the diff between pointers and direct heap allocs.

    -1

    • kosherhurricane 3 years ago

      I work on a code base that is a mixture of Go and C.

      It's IO, CPU and Memory hungry, and it's distributed.

      C is fast because it's close to how CPU and memory actually work. Go gives you 95+% of that plus easy to learn, easy to use language. A new person could start contributing useful features and bug fixes immediately. A senior person could get C-level performance.

      More and more of our code is moved from C to Go, with very little performance penalty, but with a lot more safety and ease of use.

      Our customers benefit, and our company makes more money.

      In the end, that's what software is about.

      • dist1ll 3 years ago

        > C is fast because it's close to how CPU and memory actually work.

        Out-of-order execution, cache hierarchies, branch prediction, virtual memory, pipelining, vector instructions, ILP, NUMA are all pretty transparent to the C spec.

        Trying to accommodat hardware quirks with C feels like blackbox engineering. It's certainly better than with managed languages but still....

        • kosherhurricane 3 years ago

          C is an abstraction, and most (all?) cpu architectures pretends to be c machines.

          I don't know of any other way of manipulating the cpu as closely as C and assembly. Do you?

      • AtNightWeCode 3 years ago

        I do prefer Go over C in most cases and it's not like there are no pitfalls in C. It is just disappointing when a lang doesn't improve over time and have the same stupid problems year after year.

      • YesThatTom2 3 years ago

        Here’s an article that refutes your claims. https://queue.acm.org/detail.cfm?id=3212479

        • kosherhurricane 3 years ago

          I know about this article. Thank you.

          I said C is "close to" how a computer works. C is an abstraction, and most (all?) cpu architectures pretends to be c machines. There was a talk (Kelly?) where even assembly is nowhere close to what a cpu actually does.

          I don't know of any other way of manipulating the cpu as closely as C and assembly. Do you?

    • YesThatTom2 3 years ago

      C is not a good match for modern cpus. https://queue.acm.org/detail.cfm?id=3212479

    • fear91 3 years ago

      Can you show any examples of "go compiler being garbage"? In my experience, it often generates much smarter code than C# or Java.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection