Using machine learning to choose compression algorithms
vks.aiIs there a way to do the reverse?
There's a quite "legendary" Game Boy Advance game out there (Klonoa - Densetsu no Star Medal) that never got a translation to English because it has some sort of in-house created compression by Namco applied to the game that was made so it could fit into a GBA cartridge. AFAIK no one was ever able to crack it open and release the code to de/compress it.
A while ago I had a "bounty" of USD100 for anyone that could do it (just the decompression and re-compression, not translation) but there aren't many people that want to fiddle with low-level GBA coding.
Anyone who does malware analysis professionally has to deal with packed data in new and funky ways, as malware binaries tend to be packed (ie compressed) in unique ways to get past antivirus software.
If the bounty was high enough there are people out there who do this sort of thing professionally and would probably jump on the opportunity.
For something like an obscure GBA game, you can probably write up some blogposts to add reputation to the minuscule monetary value of the bounty.
Well, the bounty nowadays is $0. :-)
I have given up until I have the proper time along with a friend to try to crack the game insides in our own. If anyone did this for now, it'd be for the community.
Instead of trying to crack the compression algorithm directly, it would make sense to disassemble the machine code and try to understand what it does.
I'm assuming the game is playable, i.e. the decompression code is included on the cartridge and you just don't know how it works. In that case you could emulate the game and use a language model to identify strings containing Japanese text (you'd need to know the encoding to do that) so they can be extracted for translation. That doesn't allow you to put the translations into the compressed code, but you might be able to instrument the emulator to inject translated strings on-the-fly.
You might as well write a tool that extracts strings from a video signal using OCR, and translates them. That would make the solution more universal, and you could even use it to e.g. suppress ads.
Well that's super hard since Japanese encoding is an epic story in digital archaeology itself.
I'm not well-versed on the subject, how does encoding come into play for text displayed on the screen? Did they use a strange way of representing the Japanese text because of technical limitations?
The GBA didn't have much RAM. There is a good chance tiny chunks of the game get decompressed as needed, and there is never a time when the whole thing is decompressed at once and can be dumped.
You don't need everything to be decompressed all at once to dump it. You could continuously dump the memory contents while exploring the game (or having a fuzzer do it for you).
It might end up smarter to RE the decompression and then patch the game to accept an uncompressed translation (and bump up the size of the rom from 16mb to 32mb)
You should be able to do it with enough sample data, but it might not be perfect since you'd still be guessing at what it's really doing.
Doesn't it depend on whether the algorithm is lossy? If it's lossy (not bijective) it's impossible to invert the function
This is discussing text or code compression. There's no point in using lossy methods for the dialogue of your game.
Are all compressors simply being run with their default arguments? There's a lot of scope for speed/filesize tradeoffs within a single compressor.
EDIT: You are missing `csv+zstd` ? It should obsolete `csv+gzip` at all speeds and compression levels.
There is a pareto-optimality frontier here - I ran my testing back in 2016 https://code.ivysaur.me/compression-performance-test/ but the numbers are now a little bit obsolete (e.g. zstd and brotli have both seen a lot of improvements).
Yea, I thought parameters per compression algorithm should indeed be added in a next version :) more compute but definitely an improvement. I think pandas doesn't offer zstd as option with csv, but I'll check once more.
EDIT: indeed, it's missing in `to_csv` - seems like an oversight.
+1 - and if you could use zstd with custom dictionary, then you can achieve even better compression ratios
Yea zstd is really amazing... if I would choose a single one all the time it'd be zstd for sure.
A somewhat different example but related (at least by name) but I recall an article from Chris Wellons' blog, Null Program, where he wanted to "discover" a new hashing algorithm, so he randomly generated them, JT compiled them to native, then tested them.
There was, how ever, no machine learning or optimizing. Instead, he called it "prospecting" and just generate a new one from scratch each time until he found something interesting.
Yea - that is related to genetic programming. That, and using auto-encoders for e.g. image compression are known approaches in "AI".
I'm particularly proud of this meta approach and I am actually thinking this could become huge: the same thing can be done for hyperparameter optimization in machine learning tasks.
Hyperparamter optimization is currently focused on minimizing cross-validation error, but using this concept you could have weights on accuracy, training time and prediction time (very similar to compression where the 3 dimensions are size, write time and read time), and then given a new unknown dataset you could predict what model/hyperparameters to use.
Maybe this should be patented ;)
> I'm particularly proud of this meta approach and I am actually thinking this could become huge: the same thing can be done for hyperparameter optimization in machine learning tasks.
There is already a substantial field of Machine Learning/Meta Learning which focuses on exactly this. For example, this paper [1] from NeurIPS 2015 does exactly what you suggest.
[1]: https://papers.nips.cc/paper/5872-efficient-and-robust-autom...
Yea I am aware of meta hyperparameter approach for ML, except they only focus on accuracy instead of also including train/prediction times in to the equation :) That's what I was referring to! (you can save A LOT of compute and zoom in on things that work if you can weed out slow / badly performing algorithms as part of meta learning hyperparameters).
To make it extra clear: by doing a lot of compute on different datasets and not only recording the accuracy but also time it took, and then by including that as dimension it will even give better results.
When and why would you want to apply ML to this problem?
Off the cuff, this seems like it competes with an alternative of simply running every considered compression algorithm and choosing the optimal one. I guess this would be advantageous if the RF classifier is meaningfully faster to run than the compression algorithms themselves. Is it?
Spot on (I briefly touch on this in the article)! This is why I try to work with cheap-to-compute features. I used to calculate how unique all values were, but ended up taking a sample instead to speed that part up for large data!
How does that compare to just doing compression on a sample of data? :)
Perfect interview questions to get to the juice details haha!
The problem is that choice of compression is very much dependent on the sample size, so this is why just choosing the algorithm based on running benchmarks on the sample will be off.
However, computing certain statistics on the sample and then doing machine learning makes a lot of sense!
Obviously not for simple/cheap to compute features such as number of rows/columns which actually do not take longer to compute as the data gets larger.
But I actually do this for predicting the uniqueness of values! You basically want to get an idea of how many unique values you have per column. But.. if you take a sample this will give you completely wrong numbers.
You can see my approach here: https://github.com/kootenpv/shrynk/blob/6a8675061d82aa65fc3b... Basically the formula calculates the uniqueness on a sample (e.g. 10000 rows), but then extrapolates the result to your actual data size. E.g. it finds 100 unique values in a sample of 10000, but in reality that means you maybe have 500 unique values.
Running the compression algorithms is O(kn) where k is the number of compression algorithms. Taking the machine learning approach is O(n).
As the author noted after you posted this, it’s not a given that the ML algorithm is O(n). It may be constant time (by looking only at column headers and a sample of data, say).
That said, I was really more interested in practical runtimes. Like, in practice the ML may have a high startup cost (e.g., due to cost of loading a model), whereas for most sized datasets linear complexity may be fine...
Models are cached and not large so the setup time is very low. I'll time it when I get to a PC
Isn't one of the benefits of time-series databases more compact storage at minimal overhead?
Definitely! I really optimized for "no development time spent" and was just using pandas to extract html tables into csv and just store them :-). 2 lines of code really. I had no idea I would have it running for so long.
It was really just the example that made me wonder why I have to consider which compression would be best for files with my characteristics - but not saying it was best practice to begin with haha!
Does anyone know what is the leading time-series database people use today? Like, eg, bar / tick data.
(I use PostgreSQL, due to my ignorance.)
If you can model the domain, then you can achieve very good compression.
See this recent timescaledb post (since you mentioned Postgres) which goes over an array of techniques used in column-store databases that general-purpose compressors on a csv file full of data would not be able to match:
https://blog.timescale.com/blog/building-columnar-compressio... discussion: https://news.ycombinator.com/item?id=21412596
I've been happy with influxdb, but I've also noticed that a lot of people switch to a time series database way before they need to. You can go pretty far with postgres/oracle/sql server, and it has the advantage you don't need to manage different databases.
Thanks for the response. I've avoided influxdb because it drops data. Bar and tick data needs to be ACID or at least very close to ACID.
InfluxDB is great for server usage stats though.
I don't know much about bar or tick data. Mind going into more detail about dropping data?
We didn't really have a need for transactions because we were just recording a bunch of sensor data from oil and gas wells from the around the world and then running computations and then displaying and alerting on the results.
Bar and tick data is financial, so dealing with numbers. It has to be precise, like not 0.0000001 off and dropped data isn't the end of the world, but it's pretty bad.
I've never used it myself, but kdb+ is meant to be excellent if you need it.
There's TimescaleDB built on top of PostgreSQL.
I'm a long time fan of your blog :O
Me too! A few months ago I made detailed notes on how to set up your own: https://github.com/shawwn/wiki
Gwern was kind enough to assist by sending over the exact version numbers of all the Haskell libraries it depends on, and answering some questions about deployment. The version numbers turned out to be crucial to getting everything running.
IMO https://www.gwern.net/ is the ideal combination of style + ease of use (for the writer) + effective ways of organizing knowledge.
The whole thing is hosted out of an S3 bucket, so there's no server to manage and zero downtime. I've wondered if it'd be possible to use github pages for this purpose, since that would make it completely free. But it only takes a couple hours of work to get everything up and running. The biggest delay is waiting for haskell to compile all the libraries.
Yea, look at jekyll in combination with github pages. You can see my blog for example (https://vks.ai), the code is hosted here: https://github.com/kootenpv/kootenpv.github.io
The downside with switching over to Jekyll is that gwern.net is increasingly dependent on Haskell integration with Pandoc, which would be difficult if you were compiling with another language: you'd have to refactor all of the additional rewrite passes (for interwiki scripts, image dimension specification, affiliate links, link annotations, inflation adjustment...) to make it at all possible, and you'd be using a lot of Haskell anyway. So it goes - the Law of Equivalent Exchange.
Thanks. I think it's a nice point in design space. Sort of a Art Deco Modernist minimalism, perhaps? But with rich features and excellent performance.
We did a fun experiment in Chapter 5 of Classic Computer Science Problems in Python. We ran a genetic algorithm to find what order of a list (assuming the order doesn't matter) of names led to the best compression using one of the Python standard library's built-in compression algorithms. This article has made me interested in re-running the experiment with multiple compression algorithms.
Why not use ML to build a statistical model of your data, and then use that model to compress it? You will likely get better codelengths for your specific dataset than off-the-shelf algorithms can achieve. And if you figure out a good model that describes the fluctuations in the cryptocurrency prices very well, then you can use that model pretty directly to make money via automated trading.
I've got an easy-to-use library for arithmetic encoding in Java, it would be easy to port to other languages: https://github.com/comperical/MirrorEncode
How does it compare to context mixing algorithms such as https://en.wikipedia.org/wiki/ZPAQ ?
shrynk is not doing any compression itself, it's using ML to do a "meta" approach: it computes features on your data like how many rows, how many columns, how much duplication etc, and then predicts which existing compression algorithm (out of the available compressions) will be best given your requirements/preferences in terms of size, write time and read time.
Is it possible to run data through a DNN, making sure the output is the same as the input (or close for lossless data), then take an autoencoder variant, record the 'compressed' data, then on the other end have the other half of the DNN to decompress it?
I'm pretty ignorant on the topic, so I get that that may be off, but if so, why wouldn't that be a valid solution to compressing data?
Sure, and you can add some correction bits to fix the lossy encoding of the autoencoder.
The problem is that you have to encode/send the DNN itself, otherwise your receiver won't know how to decode the data. If you are not smart, the added codelength of the DNN will likely blow away your savings. If you are smart, this leads to a whole formulation of machine learning called MDL:
Autoencoders have lossy compression.
Indeed, but this is for lossless compression :)
Funny, it deeply resembles the last episodes of "Silicon Valley"
More seriously, the generalized compression algorithm can be one of the keys or even one of the definitions of generalized artificial intelligence
And Kolmogorov’s complexity was present by a subtle reference in the very last episode ...
The wording could be improved to make it easier understand what is being offered more quickly.
This is using machine learning to predict on the given heterogeneous tabular data which compression algorithm will yield a higher compression rate.
There are multiple other ways to do this approximation.
From the data shown it's not obvious that some of the options would ever be best, would be interesting to see if any of the non csv options ever beat the csv options and for what type of data.
I can give you this... the larger the data the more useful parquet and compression on it will be...
From the brief description it seems like Parquet is just a weaker version of what a decent LZ style compressor will do anyway