Settings

Theme

78% MNIST accuracy using GZIP in under 10 lines of code

jakobs.dev

363 points by js98 2 years ago · 140 comments

Reader

montebicyclelo 2 years ago

I tried replacing the distance function in the code with some simpler distance measures:

    Gzip distance: ~3 minutes, 78% accuracy
    Euclidean distance: ~0.5 seconds, 93% accuracy
    Jaccard distance * : ~0.7 seconds, 94% accuracy
    Dice dissimilarity * : ~0.8 seconds, 94% accuracy

    * after binarising the images
So, as a distance measure for classifying MNIST digits, GZIP has lower accuracy, and is much more computationally demanding than other measures.

I'm not that familiar with how the GZIP algorithm works, it's kind of interesting that it's so much lower. I wonder whether image focused compression algorithms might do better?

(Edit: Btw, I enjoyed the post; it's a creative idea, the writing and code was great, and it's sparked some good discussion. But after having a closer look, I think the baselines above provide some context to the gzip scores.)

  • w-m 2 years ago

    Best one I found is Normalized Mutual Information at 95%. It's a little more complex, but you could still compute it relatively quickly on binarized images.

        NMI skimage: ~30 seconds, 95% accuracy
        NMI numba *: ~0.6 seconds, 95% accuracy
    
    Here's the code, courtesy of ChatGPT:

        @njit(cache=True)
        def compute_joint(img1, img2):
            stacked = 2 * img1.ravel() + img2.ravel()
            return np.bincount(stacked, minlength=4).reshape(2, 2)
    
        @njit(cache=True)
        def entropy(counts, total):
            # Convert counts to probabilities
            p = counts / total
            # Only calculate for non-zero entries to avoid log(0)
            return -np.sum(p[p > 0] * np.log2(p[p > 0]))
    
        @njit(cache=True)
        def normalized_mutual_information(img1, img2):
            joint_counts = compute_joint(img1, img2)
            total_pixels = 28*28
            
            # Marginal counts
            c1 = np.sum(joint_counts, axis=1)
            c2 = np.sum(joint_counts, axis=0)
            
            # Compute entropies directly from counts
            h1 = entropy(c1, total_pixels)
            h2 = entropy(c2, total_pixels)
            joint_entropy = entropy(joint_counts.flatten(), total_pixels)
            
            mutual_info = h1 + h2 - joint_entropy
            return 2 * mutual_info / (h1 + h2)
  • tysam_and 2 years ago

    Holy cow, I knew that MNIST was simple, but not that simple.

    Could you post a snippet of the code that you used to achieve this? It would be really, really nice to have a baseline to work from I'm sure, and I feel like this could be really useful to a few other areas (my personal obsession is speed-training on CIFAR10 :')))) )

    Holy cow, that's insane. :O

    • montebicyclelo 2 years ago

      I used the notebook linked in the original post [1]. It evaluates using 100 samples from the test set, (I'm guessing because the gzip method is slow - it would take ~7 hours on the full test set, on my machine).

      I plugged in the distance measures, for the `compute_ncd` function. (Jaccard/Dice have been negated and the -1 removed.)

          def euclidean(x1, x2):
              return np.sum(np.square(x1 - x2))
      
          def jaccard(x1, x2):
              x1_binary = x1 > 0.5
              x2_binary = x2 > 0.5
              return np.logical_or(x1_binary, x2_binary).sum() / np.logical_and(x1_binary, x2_binary).sum()
      
          def dice(x1, x2):
              x1_binary = x1 > 0.5
              x2_binary = x2 > 0.5
             return 2 * np.logical_or(x1_binary, x2_binary).sum() / (x1_binary.sum() + x2_binary.sum())
      
      
      [1] https://github.com/Jakob-98/mono/blob/73168bc0ea904e75865815...
      • tysam_and 2 years ago

        Ohhh, you know what? I think this makes sense, since it's basically like making prototypes from the training set, only it's instead comparing against every training example, then averaging. Cool. Did not know that would be remotely close to 90%, but that makes sense to me.

        I wonder if something like a weighted max_mean would perform better or something L1. Or maybe L2 is ideal, it is at the center of a lot of things information theory, after all! ;PPPP

    • KateLawson 2 years ago

      MNIST is what the Post Office solved in the late 1990s. It's basically the "hello world" of visual recognition these days.

      https://www.govexec.com/federal-news/1999/02/postal-service-...

  • sr192 2 years ago

    ben recht's kernel method implementation in 10 lines hits 98%

    https://github.com/benjamin-recht/mnist_1_pt_2/tree/main

  • cqz 2 years ago

    Very cool. I tried using PNG compression, it does actually do better, slightly:

        PNG : ~15.1 seconds, 83% accuracy
    
    I also tried dropping in zstandard compression:

        Zstd (level=3) : ~3.5 seconds, 88% accuracy
    
    Much faster than gzip, at least. And iff I use (x1-x2)*2 instead of x1+x2 to calculate Cx1x2 that pushes zstd up to 93% accuracy.

    I'm somewhat interested in the fact that if I stack the two arrays on top of each other instead of adding them together, I get absolutely garbage performance (<20%), however as far as I can tell that actually works well when classifying strings.

    • w-m 2 years ago

      Re garbage performance on stacking: I had tried interleaving the values (creating a 56x28 img), it dropped one percentage point of accuracy when using gzip.

  • jdthedisciple 2 years ago

    That's pretty amazing.

    So the whole gzip thing - while fancy - really is extra steps for less bang.

GaggiX 2 years ago

For a comparison with others techniques:

Linear SVC (best performance): 92 %

SVC rbf (best performance): 96.4 %

SVC poly (best performance): 94.5 %

Logistic regression (prev assignment): 89 %

Naive Bayes (prev assignment): 81 %

From this blog page: https://dmkothari.github.io/Machine-Learning-Projects/SVM_wi...

Also it seems from reading online articles that people are able to obtain much better results just by using K-NN, so I imagine that the author just made his job harder by using gzip but I could be wrong about this.

  • IKantRead 2 years ago

    Thanks for posting this.

    Most people don't realize that Logistic regression can get ~90% accuracy on MNIST.

    As a big fan of starting with simple models first and adding complexity later, I've frequently been told that "logistic regression won't work!" for problems where it can in fact perform excellent.

    When faced with this resistance to logistic regression I'll often ask what they think the baseline performance of it would be on MNIST. The guesses I hear most often are 20-30%.

    People, even machine learning people, often don't realize the rapidly diminishing returns you get for adding a lot of complexity in your models. Sometimes it's worth it, but it's always good to start simple first.

    It's also been my experience that if you don't get good performance with a simple model, you are very unlikely to get great performance from a more complex one.

    • nextos 2 years ago

      Came here to say the same thing, actually NCD can probably do much better than 78%. Li & Vitanyi's book about Kolmogorov complexity has some interesting unsupervised examples.

      A simple CNN as implemented in Keras tutorial can easily exceed 98%. 78% is very poor performance for MNIST even if model complexity is penalized.

    • mardifoufs 2 years ago

      When would a 90% accuracy on a dataset like mnist ever be useful? And I mean useful as in usable for actual products or software. especially considering mnist is more of a toy dataset.

      I think that's why machine learning is the way to go for this type of detection, why go with anything else than a CNN (in this case) when it is now trivial to set up and train? Again, unless it's just to mess around with, 90% mnist accuracy is not useful in the real world

      • vannevar 2 years ago

        I don't think the point was that you should use logistic regression on MNIST. In lesser-known problems, say a custom in-house model, if you don't try the simpler approach first, you'll never know that your more complex solution is not worth the extra expense, or is actually worse than a simpler, cheaper model. MNIST is well-known to have nearly perfect solutions at this point, but for most novel problems, the data scientist has no idea what is theoretically possible.

        Now, you can say that CNNs or other techniques are easily accessible these days, and almost trivial to set up. But they may not be trivial to train and run in terms of compute in the real world.

    • KRAKRISMOTT 2 years ago

      If you stack logistic regression in layers, you get a neural network.

    • bugglebeetle 2 years ago

      > People, even machine learning people, often don't realize the rapidly diminishing returns you get for adding a lot of complexity in your models.

      Are you me? I was just having this argument at work with someone about using an old (fasttext/word2vec) model vs. the overhead on fine-tuned BERT model for a fairly simple classification problem.

    • short_sells_poo 2 years ago

      To a large degree this happens because logistic regression is not a sexy approach that one can add to their CV. Everyone wants to solve problems with big, complicated and buzzwordy models, because that sells (and is perhaps more interesting as well).

      It's really a tragedy, because so many classic models would work fine for real world applications.

    • vjerancrnjak 2 years ago

      Extending existing features with their non-linear mappings would improve logistic regression too, probably to the svc level (rbf or poly kernel is that but implicit).

      Linear models are really well researched, and today with the compute and with proper training and data preparation they can easily get to satisfying levels of performance for a variety of tasks.

    • make3 2 years ago

      this is true unless you can use large pretrained models, which are very simple to use, are very resistant to ask sorts of noise, & would get ultra high accuracy with just logistic regression on the output of the model

    • salty_biscuits 2 years ago

      You can get 80ish percent with a linear model (if you one hot the labels).

  • dmazzoni 2 years ago

    That blog page is not showing state-of-the-art results, it's just taking relatively naive SVM implementations and comparing them.

    The original research paper that introduced the MNIST data set was getting around 98% accuracy, and today neural nets are getting 99.87% accuracy.

    https://paperswithcode.com/sota/image-classification-on-mnis...

    • GaggiX 2 years ago

      Yep I was showing others simple techniques, of couse a neural network would be much better at it.

  • TimPC 2 years ago

    The whole point isn't to do better. It's to show that enough information survives compression that you can still get very large signal. Compression is intended to make things harder and still does.

    • GaggiX 2 years ago

      The article OP linked is inspired by Gzip+Knn paper that was released a few months ago, compression is not intended to make things harder but to extract useful features that can be used by simple algorithms like Knn or SVG etc...

      For example it would be almost impossible to distinguish cat and dog images by using SVG or Knn directly on the data but it would be much easier if the images were first passed into an image encoder and a small embeddings vector would be used for each one instead. In the article OP linked it doesn't seem that Gzip is very good at extracting useful patterns for the MNIST dataset.

    • amelius 2 years ago

      Why? I mean it's trivial to add a layer of encryption ...

      (Also the terminology seems off here, as 100% of information survives gzip.)

    • minitech 2 years ago

      No, that wasn’t the point at all. Compressibility was used to determine similarity.

      • vintermann 2 years ago

        And what it accidentally showed, was that NCD between individual digits in the training set is a really terrible distance metric for classification.

        You can do classification with KNN, which is obvious. You can also do classification with compression, which is less obvious, and neat. This approach tries to combine them in a way which doesn't work.

  • grandma_tea 2 years ago

    While it's cool that this works at all, I wish we would stop using MNIST as a benchmark given how trivial it is.

    • IKantRead 2 years ago

      It's a good benchmark because it's so trivial.

      Sure it's not great at differentiating between SotA techniques, but it's very useful for sanity checks like this one.

      Even for SotA models, it's still useful to verify that you can get greater than 98% accuracy on MNIST, before exploring larger, more complex bench marks.

      It certainly shouldn't be the only benchmark but it's a great place to start iterating on ideas.

      • TimPC 2 years ago

        It's a bad benchmark because it's artificially clean. It's effectively a 2d dataset with no occlusions. So nearly everything you try on it will work, and many things you try on it won't scale to typical image problems. There are good 3d datasets with more realistic examples that are still fairly simplistic compared to the state of the art large datasets, but at least give you signal that your technique is robust to common problems in vision. MNIST is so simplistic that you encounter none of the typical problems in computer vision settings so it doesn't give you a good prediction of how good your technique is.

        • ghshephard 2 years ago

          Can you imagine a model that performs very poorly on MNIST that would perform well in real-world computer visions problems? If you can't, then MNIST is a nice simple smoke test when assessing models.

        • tysam_and 2 years ago

          That's what makes it a good benchmark.

          It's a benchmark.

          Not a real world problem.

          That's why the traditional path has been MNIST -> CIFAR10 (optionally -> CIFAR100) -> ImageNet -> ????!?.

          Because it gets gradually more complicated.

          Your iteration time is the constraint to development progress.

          Keep that down, and the bugs from your initial implementation will be significantly less impactful.

        • d3w4s9 2 years ago

          What's wrong with "artificially clean"? The goal of benchmarks is to compare and know whether one model is better than the other. There is never a "perfect" or "objective" benchmark. Different benchmarks may highlight advantages in certain models, which is a good thing, but there is absolutely nothing wrong with using MNIST as a dataset to give you a basic idea of how models perform.

          • TimPC 2 years ago

            Artificially clean gives you too many false positives. Most researchers I know these days start on CIFAR10 which is much more real world and has far fewer false positive signals. A portion of the hype in CV is a paper showing a technique works on MNIST that then fails to go anywhere on any other dataset.

      • tbrownaw 2 years ago

        So MNIST for ML models is kind of like FizzBuzz for humans doing software development.

    • dmazzoni 2 years ago

      It's still a great data set because it's real-world, useful, high-quality, and an excellent size (~60k examples), and the "correct" classification isn't very subjective at all - humans can easily get 99% accuracy.

    • unoti 2 years ago

      MNIST is pretty good as a proof of concept. While I agree it's trivial, I see MNIST as being more like how when you're making a toy programming language, the first thing you do with it is writing a recursive Fibonacci function.

    • antiquark 2 years ago

      Take a look at the EMNIST - Extended MNIST dataset. It has both digits and letters of the alphabet.

  • westurner 2 years ago

    So there is a more optimal compression algorithm for the relation between the MNIST inputs and outputs!

    Other models tend to add noise somewhere in there; what about feature engineering before the gzip? Maybe a gaussian blur and a convolution first and then deep learning for feature selection

tdr2d 2 years ago

Obviously, the code may be elegant and compact, 78% accuracy is considered very very bad for MNIST.

A dummy model written with Tensorflow easilly reaches 90% accuracy. The best models ranked at 99,87%, see the benchmark : https://paperswithcode.com/sota/image-classification-on-mnis...

  • esafak 2 years ago

    The article emphasizes the wrong thing, in my view. The interesting part is that compression -- without learning a model -- can be used for classification. This raises the question of what other information-theoretic measures can be used; cheaper, lossy ones.

    To Compress or Not to Compress- Self-Supervised Learning and Information Theory: A Review https://arxiv.org/abs/2304.09355\*

    • anitil 2 years ago

      I remember seeing an example of using zip to classify languages. You take a set of documents of equal size where you know the languages, then individually concatenate and zip them with the unknown text. The smallest compressed output is likely to be the target language.

      I can't find the original blog, but there's a note about it here - https://stackoverflow.com/questions/39142778/how-to-determin...

      • vintermann 2 years ago

        Ideally, you'd take all the documents in each language, and compress them in turn with the unclassified text, to see which compresses it better. But this won't work very well with gzip, since it compresses based on a 32KB sliding window. You might as well truncate the training data for each class to the last 32KB (more or less). So to get any performance at all out of a gzip-based classifier, you need to combine a ton of individually quite bad predictors with some sort of ensemble method. (The linked code demonstrates a way of aggregating them which does not work at all).

      • 6510 2 years ago

        How much better would that get if you append all but one of the equal size documents? (or other combinations like 2 of the top results after using a single one)

        • vintermann 2 years ago

          Better, if the compressor can use all that extra context. Gzip, and most traditional general purpose compressors, can't.

          It's hard to use distant context effectively. Even general purpose compression methods which theoretically can, often deliberately reset part of their context, since assuming a big file follows the same distribution throughout as in its beginning often hurts compression more than just starting over periodically.

      • esafak 2 years ago

        Now that you mention it, I vaguely recall writing a language classifier based on character histograms as a youth. Good times.

    • Quenty 2 years ago

      I mean, they’re using knn underneath. You can apparently get 97% accuracy with normal knn, at n=4 if you compare pixel distance.

      So another way to frame this might be that gzip costs a lot of accuracy but may lead to better performance.

      https://newpblog.netlify.app/2018-01-24-knn-analysis-on-mnis...

    • a_wild_dandan 2 years ago

      An increasingly common refrain in machine learning is “intelligence is compression.” Folks who believe that might bristle at the distinction between learning and compression.

      • esafak 2 years ago

        I doubt it because learned features already constitute lossy compression. The question is what kind of compression; lossy vs. lossless, learned vs. unlearned.

  • sundarurfriend 2 years ago

    The point is not to have "elegant and compact" code, this is meant to be a fun curiosity, and doing it in 10 lines is just an additional layer of challenge for the heck of it.

    The interesting thing is not in whether GZip can achieve SOTA, it's that it can do a decent job at all. (The interesting thing is not in whether the bear can recreate Mozart exactly, it's that it can play the piano at all.)

    • ActivePattern 2 years ago

      Yeah, it does demonstrate that you can use compression to measure similarity of two images.

      But it also demonstrates that it's a pretty poor similarity measure. Something as simple as counting % of matches between the black and white pixels performs much better.

  • m00x 2 years ago

    It's not trying to break records, it just shows a neat aspect of compression. It's still 8 times better than baseline, which showcases that compression can learn representation.

tobeyey 2 years ago

Replacing the NCD

  distances = [(compute_ncd(x1, x), label) for x, _, label in compressed_lengths]
with the Euclidean distance

  distances = [(np.sqrt(np.sum(np.square(x1-x))), label) for x, _, label in compressed_lengths]
gives you +15% test accuracy and saves you a lot of compute.
benreesman 2 years ago

My favorite book about the deep connections between information theory, compression, and learning algorithms is MacKay (most probably know about it but I didn’t for a long time so maybe some will benefit from the mention).

I gather this is common knowledge (if not sufficiently emphasized at times) among those with serious educations, but as a self-taught, practical application-type ML person this profound thread running through all these topics (and seemingly into heavy-ass particle physics and cosmology and stuff like that) was a blinding “Aha!” moment that I’ll venture this comment in the hopes that even one other person has that unforgettable moment.

  • doubloon 2 years ago

    i have put MacKay on my to do list, thank you

    i was quite struck when i learned the original Lempel Ziv compression (which gzip is partly based on) came out of their study of the "complexity of finite sequences" not necessarily trying to shrink things, https://ieeexplore.ieee.org/document/1055501

jszymborski 2 years ago

In fairness you can run MNIST through UMAP and get near perfect seperation. I'm of the belief that you have to try pretty hard not to do well on MNIST these days.

https://github.com/lmcinnes/umap_paper_notebooks/blob/master...

EDIT: I should add, unless it isn't clear, that we really should retire the dataset. Something like the QuickDraw dataset makes a lot more sense to me.

  • js98OP 2 years ago

    Author here: completely agree. I think it is not much of an achievement by itself, but it is interesting to see that it works!

    I will add a comment/edit to the post once I am home to clarify the relative ease of solving MNIST

    • jszymborski 2 years ago

      Sorry if my comment came off as dismissive of the post! I was just trying to frame the results for those unfamiliar with the properties of MNIST.

      I think it is indeed interesting as it shows that Gzip can capture the same things that e.g. UMAP et co. are capturing when they acheive good scores on MNIST.

      Also, I'll add, that even despite some of the suspicions people have cast on the Gzip results in other experiments, I'm bullish on the utility of reducing entropy for classification problems.

  • ekidd 2 years ago

    > I should add, unless it isn't clear, that we really should retire the dataset.

    From a research perspective, MNIST is basically solved. I think current performance on MNIST is better than humans, correct?

    But it's still valuable as a teaching tool or a "Hello world" data set, because it's so simple and because most reasonable algorithms will reach 97% accuracy. Even if you build your tools from scratch, it makes a nice homework-sized problem. And it's an obviously useful task that anyone can understand. "Oh, digit recognition for mail!"

    • jszymborski 2 years ago

      I was thinking about this just now... the real utility of MNIST is that, since it's basically impossible to get a bad result, it's a good sanity check.

  • krick 2 years ago

    Yeah, but the thing is, gzip isn't "these days". It was a thing long before UMAP and even MNIST itself. And the approach isn't perticulary novel too, this is a very simple idea, if you do understand compression. It could've been written the first day MNIST was published, and it would be still 78% accuracy. This is kinda amazing to me.

  • sundarurfriend 2 years ago

    Y'all are making the (rude) person below complaining about acronyms look reasonable. The repository doesn't define UMAP either, but if you believe ChatGPT it is:

    > UMAP, which stands for Uniform Manifold Approximation and Projection, is a dimensionality reduction technique and data visualization method commonly used in machine learning and data analysis.

    • wenc 2 years ago

      We work in our own ecological niches which sometimes entails having a specialized language that allows us to be precise when referring to things.

      I sometimes see HN comments complaining that acronyms and terms are dropped with no explanation. These comments come across as entitled to me — instead of complaining, why not be curious and ask “what does that mean in this context?”

      Imagine someone dropping in here and complaining, “the article sucks because it uses the word ‘compiler’ and assumes we all know what a ‘compiler’ is.” This is what it sounds like to those of us to work in specialized fields.

      Instead if someone said, “I’m new to this, could someone help me understand what a ‘compiler’ is?”, it demonstrates curiosity and people are more inclined to explain.

    • jszymborski 2 years ago

      It's[0] a non-linear dimensionality reduction technique in the vein of t-SNE[1] that is very well known in the field.

      My two cents is that if your problem can be solved with something like UMAP + kNN[2], then you really shouldn't be using Deep Learning to solve it.

      [0] https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduc...

      [1] https://en.wikipedia.org/wiki/T-distributed_stochastic_neigh...

      [2] https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

    • tomrod 2 years ago

      UMAP is as ubiquitous as T-SNE these days, though to be fair those on HN that focus on JVM, LLVM, NPM, or TLS may not know about UMAP/T-SNE.

wayeq 2 years ago

I'm merely a hobbyist in this domain but isn't highly compressed data (like encrypted data) also high entropy? If this is finding patterns in the compressed data to figure out which digit the uncompressed data represents, shouldn't those patterns be exploitable for better compression?

  • fritzo 2 years ago

    The demonstration isn't classifying based on compressed data, rather it's classifying on the compressibility of data. The idea is that "7 7" should be more compressible than "7 3", and similarly raster images of "7 7" should be more compressible than raster images of "7 3".

  • Zamicol 2 years ago

    Encrypted data ideally is incompressible. Incompressibility is a hallmark of efficient cryptographic operations.

    See the Wikipedia article on Kolmogorov complexity, which has a short section about compression. https://en.wikipedia.org/wiki/Kolmogorov_complexity#Compress...

    Edit: One of my favorite concepts in the domain of compression is the pigeonhole principle, that states that for all compression algorithms, some outputs will be larger than the inputs.

    Well designed encrypted payloads may be compressed, but the outputs should on average be larger than the inputs, rendering compression useless thus it is said to be "incompressible".

    https://en.wikipedia.org/wiki/Pigeonhole_principle#Uses_and_...

tjungblut 2 years ago

I don't immediately find it, but couple of years back there was a "meta-feature" which was the size of the MNIST image. I think that scored about 90'ish % accurate results on its own - without even looking at the image.

  • 13of40 2 years ago

    A few years back I worked on a project that involved fingerprinting screenshots of web pages, and compressed image size was pretty much as good as any fingerprinting method we could come up with for comparing the similarity between them.

    • momojo 2 years ago

      The off-the-beaten-path nature of this reminds me of banks sending $<arbitrary decimal amount> as a PIN for auth.

    • tysam_and 2 years ago

      Makes sense, considering that's not too terribly far off from what the KL divergence does

  • p1esk 2 years ago

    What do you mean by “size”? Gzipped size? If you simply look at how dark a Mnist image is (count the percentage of dark pixels) you’ll get about 20% accuracy, which is twice better than random guess but a long way from 90’ish %.

    • yvdriess 2 years ago

      What do you mean with accuracy here? Usually 50% accuracy means cointoss, meaning 20% accuracy is equal to 80% accuracy, which is better than the article's 78% and not that far from 90%.

      • Frotag 2 years ago

        > meaning 20% accuracy is equal to 80% accuracy,

        Only if your model is outputting a yes/no answer right? And that your definition of accuracy is "class with highest probability" (and not "N classes with highest prob")

        If your dataset has more than 2 classes like MNIST, a super low accuracy only tells you to ignore the class the model guesses. It doesn't tell you which of the remaining classes is correct

      • rjh29 2 years ago

        There are ten choices, so getting the answer right 20% of the time is very plausible.

      • tysam_and 2 years ago

        "One simple trick to beat the statistical odds...."

      • p1esk 2 years ago

        There are 10 classes in Mnist. Random guess would is 10%.

3abiton 2 years ago

Didn't it turn out that authors of that paper have made mistakes that catapulted their results to the top of the benchmark charts? I thought the theory was inconsistent after that incident. 78% accuracy from just GZIP is impressive.

jp57 2 years ago

Leaving aside whether this problem is a good application of this compression trick, I want to say that everyone experimenting with this should stop using `gzip` and start using `zlib`.

If you change the first line from `gzip.compress` to `zlib.compress` you should get the same classifier performance with a 3x speedup.

bob1029 2 years ago

General purpose compressors and information distance measures have become super interesting to me while I've been investigating alternative language models.

I've been playing around with an attention mechanism that combines the idea of using normalized compression distance (gzip) with discrete convolution between candidate sequences (sliding window of N bytes over each). Another round of normalization over the convolution outputs - accommodating varying lengths - allows for us to compare candidate sequences for relevant information on equal grounds.

The NCD formula I am using right now:

  NCD(x,y) = (C(xy) - MIN(C(x),C(y))) / MAX(C(x),C(y))
No weird parameters or any other things to tune. The only parameters are the source documents and the input context/query.
  • tripzilch 2 years ago

    So, if I understand your sliding window explanation right, would the distance between two strings X and Y then be a feature vector of all the NCDs of its windows?

    Kind of reminds me of auto correlation :)

    Also, just a note about your NCD formula, if C(xy) is the compressed size of the concatenation of x and y, then I would recommend also trying (C(xy)+C(yx))/2 for that term, because a lot of compressors don't compress xy the same as yx, and you probably want your distance to be symmetrical.

yannccc2 2 years ago

All these ideas date back to at least 2010, probably earlier. It was called "information distance". https://arxiv.org/abs/1006.3520

tysam_and 2 years ago

If you'd like to play around with MNIST yourself, I wrote a PyTorch training implementation that gets ~99.45%+ val accuracy in <13.6 seconds on a V100, est. < 6.5 seconds on an A100. Made to be edited/run in Colab: https://github.com/tysam-code/hlb-CIFAR10

It's originally kitted for CIFAR10, but I've found the parameters to be quite general. The code is very easy to read and well-commented, and is a great starting place for exploration.

Min-cut deltas to run MNIST:

  .datasets.CIFAR10(' -> .datasets.MNIST(' (both occurences)

  'whiten': Conv(3, -> 'whiten': Conv(1,

  crop_size = 28 - > `crop_size = 28
Compute for the project funded by Carter Brown and Daniel Gross, my appreciation to them both for helping make this possible. Their support via encouragement has been very helpful as well. <3 :)
m00x 2 years ago

It goes along the same lines as the recent paper from Deepmind stating that DL (language modeling in their case) is compression.

https://arxiv.org/pdf/2309.10668.pdf

thomasahle 2 years ago

Similar result (2019) using ZIP, getting 74%: https://www.blackhc.net/blog/2019/mnist-by-zip/

0xDEF 2 years ago

Has anyone tried how different compression algorithms compare when doing NCD classification?

gzip first does LZ77 and then Huffman coding. ANS is a more modern alternative to Huffman coding that can achieve higher compression ratios than Huffman coding.

Lerc 2 years ago

I'm sure there was something like this mentioned in "Managing Gigabytes" or thereabouts. It might have been using bitwise arithmetic compression which makes sense for the problem at hand.

benob 2 years ago

What explains the huge difference in perf with the approach from 2019 mentioned at the end?

For image classification, couldn't one use the residual from applying a jpeg codebook learned on a training example as metric?

SomewhatLikely 2 years ago

Wouldn't it make more sense to create ten streams of images of the same class and then see which one results in the smallest compressed size for a test image? That is, if I gzip a hundred '6's plus my test image and get a compressed size for my test image of 10 bytes, but doing the same for other digits gives me say 15 bytes then I conclude the test image is a '6'.

yieldcrv 2 years ago

I want someone to show me how to use compression to compile exploits on emulated logic gates, like NSOGroup keeps doing

Seems like a sufficiently novel and important advancement that should be taught in universities at this point, since we need to harden software around this kind of possibility.

petters 2 years ago

I wonder why you started with a code-golfed version? That seems original to the main point of the post

  • seeknotfind 2 years ago

    original -> orthogonal?

    I like short code! It's like a joke "this is so easy, 10 loc" :)

    • hinkley 2 years ago

      The internet has turned into a game of “is it him or is it autocorrect?”

      I can’t count how many times I’ve looked up at a line of text to spot check it, type in a couple more letters, hit save without looking, and find later that I sound like I’m having a stroke. Never mind the times I’ve simply forgotten to look. Yes that’s a word, stop trying to replace it.

      (Five minutes ago I had to type “laissez-faire” three times. Yesterday it turned “understanding” into god knows what. And if iOS doesn’t stop adding apostrophes to “its” I’m gonna fuckin lose my shit)

      • burnished 2 years ago

        I ended up turning auto correct off due to iOS being more aggressive with its corrections than my previous android phones (both of which frequently fail to recognize common words, and maybe both but certainly the android was absurdly brand aware).

        • hinkley 2 years ago

          > brand aware

          I got so tired of Mathematica italicizing its own name when I was in college so I figured out how to type it in two phases so it wouldn’t. It was my tiny little petty victory over Stephen Wolfram ruining math for me.

          I know that struggle.

great_psy 2 years ago

Having some artificial intelligence algorithm that is completely understood and tu able like gzip is would be great for many uses.

I think it’s pretty hard to just improve a NN without more data or some non trivial amount of effort.

tysam_and 2 years ago

Additionally, try flipping your images and averaging the size before comparing the distance between them. I'd expect a boost of about 78% -> 84% or so, based on how this typically works as TTA.

  • jdthedisciple 2 years ago

    Can you elaborate more on that technique?

    Why would it improve the result so much? Sounds very interesting so I'm curious.

    • tysam_and 2 years ago

      facepalm I just realized I was talking about this in the context of MNIST -- my apologies.

      It would be better in a problem with horizontal symmetry like CIFAR, esp if the zipping is serialized by unwinding the 2d pixel array into 1d.

      One trick that _should_ work is by comparing the distances of starting the compression at the 4 different corners of the image, in the 2 separate directions for each corner. That should provide way more than enough information for k-means clustering.

      My apologies again for my mistake, thank you for asking, I wouldn't have really seen that otherwise :')))).

tripzilch 2 years ago

What they mean with "“intelligence is compression”" is that there's actually an equivalence between them.

See, an "intelligence" is a predictor. Like a LLM, it predicts the next char/token depending on what it's seen before.

In order to turn this into a compressor, you store the "difference" between the predicted token and the actual token. If the predictor is good, this stream of data will be very low entropy which can be compressed with arithmetic encoding or something similar.

In the case of arithmetic encoding you get lossless compression. (additionally, because arithmetic encoding is based on probabilities (frequencies), if the predictor outputs probabilities instead of a single prediction, this can also be used to crunch the entropy)

Now look at for instance the speech codecs in GSM. They use Linear Prediction Coding, which has the same concept of using a predictor and storing the error stream. Except the latter's coefficients are rounded down, making it a form of lossy compression.

And yes, you can probably make a pretty good (lossless or perhaps even lossy, but I don't think you want that) text compressor by using an LLM to (deterministically) predict (the likelihood of) tokens and storing only the errors/differences. It should be able to outperform zip or gzip, because it can make use of language knowledge to make predictions.

There's a catch, however, which is that in the case of LLM compression you also need to store all those weights somewhere, cause you need the predictor to decode. This is always the catch with compression, there is always some kind of "model" or predictor algorithm that is implicit to the compressor. In the case of gzip it's a model that says "strings tend to repeat in patterns", which is of course "stored" (in some sense) in the gzip executable code. But with gzip we don't count the size of gzip to our compressed files either, because we get to compress a lot of files with one model. Similarly for this hypothetical LLM-text compression scheme, but you just need to compress a whole lot more text before it's worth it.

All that said, however, like many others pointed out, 78% isn't a great score for MNIST.

Then again, I also don't think that gzip compression is the best predictor for gray scale image similarity. For one if you have two pixels of grayscale values 65 and 66, gzip will see them as "different bytes", regardless of them being very similar in grayscale level. You might even be able to increase the score by thresholding the training set to BW 0/255.

_a_a_a_ 2 years ago

"MNIST"?

accuracy - but of what?

what's this about?

  • simonw 2 years ago

    MNIST is a classic image classification exercise - a dataset of 60,000 training images and 10,000 testing images where each image is a handwritten numeral as a 28x28 pixel grayscale image.

    The challenge is to build a computer vision model that can tell which numeral each handwritten digit represents.

    https://en.wikipedia.org/wiki/MNIST_database

    78% accuracy on a solution is pretty bad, but achieving it just using GZIP is a very neat hack.

Karellen 2 years ago

You solved some problem, including implementing the GZIP algorithm, in less than 10 lines of code?

...

Oh, ok, not that.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection