Settings

Theme

The 185-Microsecond Type Hint

blog.sturdystatistics.com

70 points by kianN 20 hours ago · 13 comments

Reader

zahlman 14 hours ago

> the compiler had enough static information to emit a single arraylength bytecode instruction.

I'm skeptical.

If it can prove that the input actually matches the hint, then why does it need the hint?

If it can't, what happens at runtime if the input is something else?

> We replaced a complex chain of method calls with one CPU instruction.

JVM bytecodes and CPU instructions really shouldn't be conflated like that, although I assume this was just being a bit casual with the prose.

  • mkmccjr 14 hours ago

    Thank you for this! On the second point, you are absolutely correct; that was sloppy writing on my part. I will correct that in the post.

    I'm not certain I understand your first point. When I add the type hint, it's me asserting the type, not the compiler proving anything. If the value at runtime isn't actually a byte array, I would expect a ClassCastException.

    But I am new to Clojure, and I may well be mistaken about what the compiler is doing.

    • zahlman 13 hours ago

      I mean, I think that probably is what happens. But then, while it's sped up a lot, the generated bytecode presumably also includes instructions to try the cast and raise that exception when it fails.

      (And the possible "JIT hotpath optimization" could be something like, at the bottom level, branch-predicting that cast.)

      • mkmccjr 13 hours ago

        Aha, I think I better understand your point: since the generated bytecode includes a cast, my explanation about the optimization is too simplistic.

        I haven't actually inspected the emitted bytecode, so I was only reasoning from the observed speedup.

        Your point about branch prediction is really interesting; it would explain how the cast becomes almost free once the type is stable in the hot path.

        I'm learning a lot from this thread -- thank you for pushing on the details!

        • EdNutting 9 hours ago

          Without seeing the actual differences in the bytecode it will be hard to tell what’s really going on. From my experience with other JITs, I’d expect the situation to be something like:

          A) Without the typecast, the compiler can’t prove anything about the type, so it has to assume a fully general type. This creates a very “hard” bytecode sequence in the middle of the hotpath which can’t be inlined or optimised.

          B) With the typecast, the compiler can assume the type, and thus only needs to emit type guards (as suggested in this thread). However, I’d expect those type guards to be lifted through the function as far as possible - if possible, the JIT will lift them all the way out of the loop if it can, so they’re only checked once, not on every loop iteration. This enables a much shorter sequence for getting the array length each time around the loop, and ideally avoids doing type/class checks every time.

          This would avoid pressuring the branch predictor.

          Most JITs have thresholds for “effort” that depend on environment and on how hot a path is measured to be at runtime. The hotter the path, the more effort the JIT will apply to optimising it (usually also expanding the scope of what it tries to optimise). But again, without seeing the assembly code (not just bytecode) of what the three different scenarios produce (unoptimised, optimised-in-test, optimised-in-prod) it would be hard to truly know what’s going on.

          At best we can just speculate from experience of what these kinds of compilers do.

          • imtringued 7 hours ago

            It's speculation because the author didn't show the byte code or even just what the code decompiles to in Java.

            But even with speculation, it shouldn't be that surprising that dynamic dispatch and reflection [0] are quite expensive compared to a cast and a field access of the length property.

            [0] https://bugs.openjdk.org/browse/JDK-8051447

            • mkmccjr 35 minutes ago

              These are both great points. When I wrote the post, it didn't occur to me that I could inspect the emitted bytecode. In hindsight, including that would have made the explanation much stronger.

              To be honest, this is my first time really digging into performance on a JIT runtime. I learned to code as an astronomy researcher and the training I received from my mentors was "write Python when possible, and C or Fortran when it needs to be fast." Therefore I spent a lot of time writing C, and I didn't appreciate how aggressively something like HotSpot can optimize.

              (I don't mean that as a dig against Python; it's simply the mental model I absorbed.)

              The realization that I can have really good performance in a high-level language like Clojure is revolutionary for me.

              I'm learning a ton from the comments here. Thanks to everyone sharing their knowledge -- it's genuinely appreciated.

  • packetlost 3 hours ago

    > If it can't, what happens at runtime if the input is something else?

    ClassCastException

EdNutting 19 hours ago

Wild speculation: Could the extra speedup be due to some kind of JIT hotpath optimisation that the previous reflective non-inlinable call prevented, and which the new use of the single `arrayLength` bytecode enabled? E.g. in production maybe you're seeing the hotpath hit a JIT threshold for more aggressive inlinng of the parent function, or loop unrolling, or similar, which might not be triggered in your test environment (and which is impossible when inlining is prevented)?

  • mkmccjr 17 hours ago

    Author of the blog post here. That explanation sounds very plausible to me!

    If the whole enclosing function became inlinable after the reflective call path disappeared, that could explain why the end-to-end speedup under load was even larger than the isolated microbench.

    I admit that I don't understand the JIT optimization deeply enough to say that confidently... as I mentioned in the blog post, I was quite flummoxed by the results. I’d genuinely love to learn more.

TacticalCoder 14 hours ago

> When a client asks for the time, it sends a random nonce. The server replies with a signed certificate containing both the nonce and a timestamp, proving the response happened after the request.

Oh that's cool. Apparently one of the protocol's goal is to catch lying parties and to prove they were lying about the (rough) time.

  • geon 10 hours ago

    What's the use case? I'm guessing it doesn't actually have anything to do with getting the time?

    • kianNOP 2 hours ago

      Roughtime is a really cool protocol we came across when we were hardening a license server. It provides a distributed mechanism for cryptographically verifiable time via chained requests. It’s not as precise as NTP (hence rough) but in practice it’s more than precise enough. It also has some nice additional properties: for example, NTP servers are often used as DDOS amplifiers, whereas roughtime servers return a smaller payload than the request.

      The ecosystem is currently very young. Each additional deployment meaningfully strengthens the ecosystem (ours is only the fifth server) and each additional implementation helps harden the spec (which is soon approaching 1.0).

      We wrote a bit more about it in a separate article: https://blog.sturdystatistics.com/posts/roughtime/

      Official protocol document: https://datatracker.ietf.org/doc/html/draft-ietf-ntp-roughti...

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection