Improved chess game compression (2018)

109 points by psuter 6 years ago · 35 comments

Reader

dzdt 6 years ago

This reminds me where many years ago I learned about the world record holder for computer optical character recognition (OCR) accuracy.

The computer scientists took as a target an eastern European chess journal which printed move-by-move reports of tournament chess matches. They incorporated a crude chess engine in the recognition step estimating the liklihood of next moves and combining that with the OCR engine liklihood estimate that the printed characters were particular glyphs. Despite very low quality of the printing, the journal had very high quality editing. The source material was self consistent. Completely illegible characters could mostly be filled in as the sensible game moves that were allowed. It took hundreds of hours of human review time to find a single OCR mistake from this process!

djaque 6 years ago

That's really awesome, but it feels like cheating to call it the OCR record holder. It's really OCR + context. However it would be interesting if you could apply the same idea at the word and sentence level of written language. I'm guessing there are people that already do this.
- chongli 6 years ago
  
  it feels like cheating to call it the OCR record holder
  This is how humans recognize text though. For the most part, humans don’t try to read languages we don’t understand. To deny a computer access to context is like asking a human to transcribe a language they don’t understand.
  - mlyle 6 years ago
    
    I'm a pretty fast typist but if you ask me to transcribe latin text of gibberish or of a language I don't know.... not so fast. A lot of it is how much I have to slow down to be accurate in recognizing characters.
    
    Someone 6 years ago
    
    I would guess your typing also gets significantly slower. Mine does; I’ve gone through periods where I had trouble typing the word in because my muscle memory turned it into int.
kaiabwpdjqn 6 years ago

> It took hundreds of hours of human review time to find a single OCR mistake from this process!
This stands out to me as improbable. Not in that the error rate could be that low, but in that they actually had humans spend hundreds of hours checking the accuracy of difficult character recognition. How did that happen?
- RandallBrown 6 years ago
  
  Put a handful of grad students in a room for a week and you have hundreds of hours right there.
- dzdt 6 years ago
  
  I searched out the article: "Reading Chess", 1990, HS Baird and Ken Thompson. (Yes, that Ken Thompson).
  http://doc.cat-v.org/bell_labs/reading_chess/reading_chess.p...
  It doesn't actually quantify the human proofreading time. I might have recalled incorrectly; I heard about this in the late 1990's as a war story from another OCR researcher.
  - PaulHoule 6 years ago
    
    It's an embarassing problem to have a system with "accuracy to high to measure"!

willvarfar 6 years ago

Hmm, a lot of effort to make documents small, but then storing it in mongodb?!?!

If size and performance are a focus, just store them in a normal sorted table with compression (e.g. leveldb, or mysql using rocksdb).

This means all these small documents can be compressed with repetition between games and not just within each game.

And probably much much faster and simpler etc.

Basically, the size taken by the database should be at the same kind of level as you get by just having a text file with one line per game, and gzipping the whole thing. I'd expect it to be an order of magnitude smaller than per-document compression.

hencoappel 6 years ago

I wonder if you could store them as a trie, I feel like that would give you quite good savings given the commonality of openings. Or at least use a trie for the openings and then apply the Huffman coding to the rest of the game and store that separately.
- corey_moncure 6 years ago
  
  Right. Every possible chess game as a sequence of legal positions can be uniquely enumerated. Per Wikipedia the game-tree complexity is around 10^123, so you'll need about a 124 digit number to represent them. This is as dense as it can possibly get.
  On the other hand, lichess supports alternative chess modes and starting boards that may or may not be reachable from the standard initial configuration and legal moves, so this won't work for those use cases.
giovannibonetti 6 years ago

Postgres TEXT columns are pretty good for compression as well [1]
[1] https://www.postgresql.org/docs/current/storage-toast.html
gowld 6 years ago

Games are quite small -- under 50 turns, each under 2 bytes in binary, but more if you start with simple text languge before compression. You need to compress each game individually. Would database-wide compression work?
- willvarfar 6 years ago
  
  You do not need to compress every game individually. Behind the scenes, your mainstream database is storing the data in pages, and these pages - containing many adjacent rows - can be compressed.
  The compression level achieved at a page level is much higher than compressing rows individually because there is lots of repetition between rows. It also speeds up io and other good things. It is almost always good, which is why the most modern database storage engines do it and do it by default.
  Consider a csv file, and compare it to the same data stored as json objects, one row per line. The uncompressed json file is going to be much bigger, as the columns are repeated in every line. But both files gzip to much the same size, because all those keys are repeated again and again and the two files have basically the same entropy.
  On the other hand, compressing each line in the file individually would be a poor choice, giving relatively poor gains.
  There were database engines that did row-level compression, but these performed poorly and I know of nobody who used eg innodb compression.
- SilasX 6 years ago
  
  I see several levels of compression:
  1) First, as you mentioned, the limited symbol table for expressing all the moves.
  2) But not every move is possible; you just need a way to order all the possible moves and then say the nth of those moves.
  3) Even better: not all of those moves are equally likely. You can define (n cycles of) a chess playing engine that ranks all moves and reserves shorter symbols for the best moves (as judged within that time frame). Players tend to select higher ranked moves, so that’s another regularity to exploit.
  Of course, 3) comes at the cost of compression/extraction time.

thom 6 years ago

I have always loved just how much of computer science you could learn by doing nothing but working on chess your entire life.

lihaciudaniel 6 years ago

Usually the earliest computers were made so that they can play chess and "crack" it
- saagarjha 6 years ago
  
  I thought they all did math for science or the military?
  - johannes1234321 6 years ago
    
    ... and pr0n

SethTro 6 years ago

The article doesn't say it anywhere but I sure hope all the games are backed up to disk/tape in txt format and that this is just used to keep in-memory DB/memcache size down.

Otherwise an interesting article about injecting domain knowledge to improve beyond what's possible with a normal compression algorithm.

apetresc 6 years ago

Indeed, you can download this backup yourself if you'd like: https://database.lichess.org/

jameshart 6 years ago

the basis of this algorithm is to rank the possible moves from the current position, then use that to choose a Huffman encoding. In essence, they use a very naive single-move-look ahead chess AI to quickly rank moves giving them a crude measure for how ‘surprising’ a particular move would be at that point in the game.

Interesting question: If you just generated the bit string that corresponded to taking the ‘most obvious move’ according to their heuristics, what game would that play out? In a way, that would be the ‘most obvious chess game’, or perhaps the ‘least creative chess game’...

In theory a more optimal version would be to use a more sophisticated AI to rank the next moves, and even to choose how aggressively to Huffman code the possible options.

In a sense this would measure the information content of a chess game (at least relative to the ‘knowledge’ of that AI) . I wonder what the variance in the number of bits needed to encode various games would tell you about the play style of different players, or the nature of the game, or even the relationship betweeen information theory and creativity....

boarquantile 6 years ago

The most boring chess game according to Lichess move compression: https://lichess.org/study/5jQeJXXb
- linkdd 6 years ago
  
  damn, that's some "obvious" moves right there

jhrmnn 6 years ago

This nicely illustrates the point of the Hutter Prize, that efficient compression comes from understanding the content on a nontrivial level.

https://en.wikipedia.org/wiki/Hutter_Prize

6510 6 years ago

The to slow approach is fun. If you freeze and use a reasonable engine a bit string could represent which moves match the engine move. (lets call it block 1) The missing moves could be 3 bit numbers like 000 for second best engine move, 001 for 3rd, 010 for 4th, 011 for 5th, 100 for 6th best move, 101 for 7th, 110 for 8th and 111 for other moves. (block 2) The other moves are simply put like E4, E24 or NF3G5.(block 3)

dmurray 6 years ago

For optimum compression with this approach, you'd use an engine that generated the most likely moves (perhaps given the time control and the players' ratings, since Lichess stores those anyway), not the strongest moves. That might not look much like a regular chess engine.
thegeomaster 6 years ago

In that case decompression might use significant CPU time, the cost of which can offset the storage savings.

steerablesafe 6 years ago

As I understand it uses the Huffman-code for all moves including the opening moves. Alternatively statistics could be gathered for all first moves, all second moves, etc..., then different Huffmann-code could be applied for opening moves.

I wouldn't be surprised if the statistics for the first few moves were significantly different to the moves deep into the game.

jameshart 6 years ago

Probably worth just having short Huffman codes specifically to encode particular opening sequences, before referring to per-move coding

gowld 6 years ago

> Our database runs on 3 servers (1 primary, 2 secondaries), each with two 480GB SSDs in a RAID 1 configuration

Algorithmically cool, but quite a lot of work to save <$500/yr in hardware for a handful of 1TB SSDs.

Converting the data to the new format cost more than upgrading disks.

hinkley 6 years ago

In the conclusion they mentioned rate of growth as well.
If rate of growth is disk usage is at or below the rate of growth in mid-tier SSDs, then yes, it's $500/yr. If you are growing faster than that, then an improvement might be saving you $350/yr + $150/yr², and a wall in the future might be pushed out for years.
With your Cloud provider, it might be hard to get more disk, memory, or CPU without paying for more of the other two. Also, many real organizations and vendor agreements are full of the most stupid artificial constraints. Just because someone else can build a 1PB SSD storage array doesn't mean that any of my coworkers will be allowed to build one any time soon. CAPEX austerity measures are among the most frustrating penny-wise pound-foolish policies we have to deal with.

SamReidHughes 6 years ago

Nice. A touchy balance between overengineering and too much overengineering. Maybe you could better compress combinations of moves, like exchanges.

If you also store move times, there’s not much of a win to this.

bumbledraven 6 years ago

The article includes a brief but clear explanation of Huffman coding. I didn’t understood how it works until now!

Settings

Improved chess game compression (2018)

Keyboard Shortcuts