The missed opportunity of constrained decoding
michaelorenstein.comI was working on a speculative decoding optimisation and its accompanying blog post. Explaining the more basic concepts filled so much of the post I decided to pull them out, forming this article.
I had a bit too much fun with the tokenisation diagrams / animations. The raw text is provided to an Astro component, which tokenises it, and forms the individual DOM elements of the tokens. I find it really hard to read 'tokenised' text, I figured some consistent colouring would help. The 'Probabilities' component is a trivial grid, but all the other components support 'word wrap'.
I ended up writing a 'responsive design aware graph colouring solver'.
Multiple screen widths, 'desktop' and 'mobile' are 'simulated', forming an adjacency graph of tokens that touch. Colours are then greedily allocated, then optimised per page over a few hundred iterations, swapping allocations to enforce minimum hue distance between touching tokens at those common screen sizes. The optimising value function prioritises even distribution of colours, because it looks nicer than maximal hue difference.
Originally I naively outputted the palette styles per component, but found the css post processing optimisers didn't handle that as well as I'd have thought. So then I wrote a little 'CSS compiler' that takes the high level palette and timing concepts of the animations, and optimally merges rule declarations.
The start of the post really relies on the animation occurring while fully in view, so I set up some IntersectionObservers that do the 'please scroll' text.
I tried my best to have it all work when JS is disabled on the client. I tried to get the 'hovering' to be CSS-only, but found the JS solution much more performant.
The DAG diagrams are formed with this neat Needleman-Wunsch algorithm from the bioinformatics field. The Astro component accepts several 'examples' then aligns common subsequences, producing the CSS grid and the 'basic SVG' on the server. The responsive nature meant I had to move the final 'allow' generation to the client.
Some browsers seem to throttle the token animations sometimes but I haven't figured out what causes that. This is my first time leaning hard on CSS variables.