Getting Lossless Compression Adopted For Rigorous LLM Benchmarking

They set out to test the generative capabilities of gzip against an LLM.
They understand that the LLM uses a tokenizer, i.e., the internal "language" of the model isn't formed by literal bytes (usually; not for that particular LLM anyway), but by a carefully optimized collection of tokens, designed to "increase the information content in their context" (their own words, and no debate there). Since the LLM provides an easily accessible cost-model, one can choose from a myriad of sampling methods (top-k, top-p, beam-search, etc) when using it as a generative model.
But then they fail to understand that DEFLATE (of which gzip is an implementation) also has an internal language that is made up of "tokens" too - its
literals and matches, not bytes. Just like the LLM preprocesses the input sequence into an internal representation of tokens, DEFLATE also preprocesses it - into a series of matches and literals, and it's that which gets encoded.
By doing so, they completely screw up their analysis of the cost-model, and hence get meaningless results.
Their proposed method for evaluating gzip as a generative model by conditioning it and then using autoregressive sampling is simple, yet flawed: compress a sequence of length L and note the compressed sequence size S; for every possible byte B in the input alphabet, concatenate it to the input sequence and check the compressed sequence size S_b'; sample among these possibilities with probability relative to how each minimizes S_b', i.e., the shorter the concatenated compressed sequence, the higher the probability that this particular byte B will be sampled.
This method completely ignores the intrinsic cost-model of DEFLATE and so even for inputs where it's usually reasonably efficient at compressing, like text, you mostly get just a rubish sequence of whatever literals happened to have the lowest coding cost.
In DEFLATE, a literal "token" encodes a single input byte in N bits, whereas a "match" token encodes L (anywhere from 258 to just 3) inputs bytes into M bits. So the cost for a literal is N bits, the amortized cost for a byte inside a match is M/L bits, and so very long match tokens completely dwarf literal tokens in terms of autoregressive sampling. Intuitively, this makes sense - if you have a method which basically spills out long references to previous content intertwined with single characters as its failure mode, and you ask it to guess (extrapolate from it's known state) what would be a likely continuation of the sequence its just seen, it will overwhelmingly spit out long copies of what it saw. Do it long enough, and once its sliding window is made up of just its own extrapolated data, it will, with high probability, start "looping" on that data.
At any point in processing the input sequence, we would either be inside a match or we'd be looking for a match.
If we're inside a match, simplifying our analysis without loss of generality as different parsing strategies might be used, the extended input sequence that minimizes the compressed size is the one that eventually furthest extends that match, followed by other less ideal matches. Since when extrapolating we don't need to care about whether we could extend the match or not, all match offsets into the sliding window that could have provided a match up to this point can be extended to the maximum possible length, so those closest to our current position will have ever so slightly more probability of being sampled next.
If we've just coded a literal, then the extended input sequence that minimizes the compressed size is any that provides the longest possible match that can be encoded by that token, so any match token which minimizes the distance to the match offset will have the highest probability of being sampled.
This cost-model analysis relies on internal state information that cannot be accurately derived by simply looking at the compressed sequence length at each step of their method.
The LLM also has tokens expressing an individual byte. If you were to perform this process on the LLM, you'd be introducing a bias that would also greatly diminish it's modelling power.
Look at page 9 of that paper. What they list as "gzip Samples" is exactly what you'd expect, complete rubish. As stated earlier, just random sequences sampling mostly among whatever literals had lower coding cost.
Now look at the context text, and consider the DEFLATE encoding process.
The text ends with ",_he_or_" (where "_" is a space here for clarity purposes; it's not clear if there's a terminating space character due to formatting, but it won't make much difference for our thought process). Since the expression "_he_" isn't found anywhere else in the context-text, but "_or_" is, and it reaches the threshold of minimum length for a match, I'd say the generated sequence would (depending on implementation details) probably be "baronesses, and are created under the [[Life Peerages Act 1958]]. Like all other peers, life peers are created by the Sovereign, who...".
To reiterate, the maximum likelihood estimation is that gzip would just start regurgitating a (eventually looped) input sequence, not gibberish. Now hold on, look at Fig. 4, on that same page. That's exactly what their LLM does when extrapolating (generating) audio data, and they note it too - "Chinchilla predictions exhibit a typical “loop” pattern of autoregressive generation."
Well, so would gzip, if they'd done it correctly, but yes, that loop pattern is what you'd expect, as both gzip and the LLM have a context window (up to 32KB sliding window for DEFLATE, 2048 tokens for the LLM) and it's clear that if at a certain point they're generating on already extrapolated data only as context, they'll get "stuck".
I don't care if it is from DeepMind or if Marcus Hutter is listed as one of the authors, that paper is junk, I'd be ashamed to put my name on it.
Now on to my initial remark that this is all easier said than done. Without a proper understanding of the process, it isn't always clear (or feasible) how to evaluate a lossless compressor under such requirements. DEFLATE is simple enough and they botched it. PPM compressors might seem like an obviously good choice, but even there you find subtle nuances that need proper analysis. For a simple RLE compressor, you'd expect it to eventually generate a sequence of a single repeated byte. For LZ compressors with context-modelling, things quickly escalate to extremelly high computational cost as you'd need to lookahead not just a single byte but potentially quite longer sequences to accuratelly sample the coding-model.
Context-mixing compressors usually also have a match model, so that would make it even more expensive to evaluate, but if you remove that model, then they are arguably the best suited option for this among traditional methods. There are however some caveats. Do you stop online learning once we're extrapolating? Do we use a pre-trained model, and if so, do we again use online learning also or not?