Kolmogorov Complexity and Compression Distance (2023)

149 points by rgbimbochamp 2 years ago · 105 comments

Reader

tromp 2 years ago

> let’s assume that there exists a universal language U

Why not specify it?

> That gives us the true language-agnostic definition of Kolmogorov Complexity as follows:

Choosing the language of Turing Machines does not make the definition language agnostic.

Aiming for the simplest definition of description complexity, I instead based my definitions on the older computational model of lambda calculus in [1].

Unlike the assumed UTM above, the universal lambda machine is easy to describe in detail:

    (λ 1 1) (λ λ λ 1 (λ λ λ λ 3 (λ 5 (3 (λ 2 (3 (λ λ 3 (λ 1 2 3))) (4 (λ 4 (λ 3 1 (2 1)))))) (1 (2 (λ 1 2)) (λ 4 (λ 4 (λ 2 (1 4))) 5)))) (3 3) 2) (λ 1 ((λ 1 1) (λ 1 1)))

Furthermore, it allows almost identical definitions of various variations of descriptional complexity, namely

1) plain complexity

2) prefix complexity

3) monotone complexity

all of which have their application in Algorithmic Information Theory [2].

[1] https://gist.github.com/tromp/86b3184f852f65bfb814e3ab0987d8...

[2] https://homepages.cwi.nl/~paulv/kolmogorov.html

canjobear 2 years ago

The whole point of Kolmogorov complexity is that description lengths under different Turing-complete description languages (such as UTM and lambda calculus) are only different up to a constant that depends on the languages and not on the thing being described.
- uoaei 2 years ago
  
  The whole point of Kolmogorov complexity is that there exists some language for minimal description length of an arbitrary program and you compare optimal descriptions across languages. In other words, the point is to explicitly consider the choice of language as part of the encoding scheme that needs describing. That choice is included as part of the description whose length is being measured.
  - data_maan 2 years ago
    
    Is that really true? You could just take whatever you want to describe to be part of the language and arrive trivially at a description length of 1...?
    
    smilliken 2 years ago
    
    I have an encoding improvement that can reduce the length of that program by 100% to an unbeatable 0 symbols (patent pending).
    
    uoaei 2 years ago
    
    How long is the description length of the language...?
asplake 2 years ago

> let’s assume that there exists a universal language U such that it always gives us the shortest description length for all strings.
Read on a bit and it looks like proof by contradiction:
> However, let’s bring back the paradox we discussed above. According to that paradox, U cannot exist or U cannot provide shorter descriptions than every arbitrary L.
rhelz 2 years ago

There is another, more insidious, problem with trying to give a language agnostic definition: Different languages will have different symbols which are outputable.
If you have a Turing machine which can only print out binary digits, then it can't print out a chinese character, no matter how long the input program is.
Yeah, you can do something like unicode, and associate a binary string with each chinese character--but printing out that binary string is not printing out the chinese character. It's printing out a binary string.
In particular, your lambda-calculus based turing machine cannot print out chinese characters. It therefore cannot be used to define a universal complexity for any string.
- AnotherGoodName 2 years ago
  
  This is quite misguided as you seem to think the alphabet for Shannon entropy or Kolmogorov complexity is in any way what we think of as an alphabet.
  Did you know the best compression methods out all have a variable length (measured in bits) alphabet? eg. Dynamic Markov Coding will start with just '0' and '1' and then predict the next bit but as it see's more symbols it will extend this to single characters (so see 'a' or 'b' and predict the next bit). They'll then continue as they learn more of the file and their alphabet will essentially include common pairwise letters, then words and entire common phrases.
  This is actually a commonly missed aspect of Shannon entropy. A file of 0111101110111 repeated will give you a different result if you consider a 1 bit alphabet of 25% '0' and 75% '1' than a 4 bit alphabet of 100% '0111'. No one in the real world is using the character frequencies of english characters as a measure of Shannon entropy or Kolmogorov complexity. No algorithm expects that. They all work at the binary level and they will try to adjust the symbol lengths of the alphabet to common sequences to achieve the best result.
  This is in fact the reason Kolmogorov complexity is used rather than Shannon entropy. Shannon entropy doesn't tell you how to define an optimal alphabet. That part is actually undefinable. It just tells you what to do if you have that already. Kolmogorov complexity says more completely 'find the optimal alphabet and the symbol probabilities and make a minimal sequence from that'.
  Different human languages don't figure into this at all and are completely irrelevant.
  - rhelz 2 years ago
    
    > Different human languages don't figure into this at all and are completely irrelevant.
    Back to basics: A Turing machine is specified by a set of symbols it can read/write to a tape, and a state machine matching current state and read symbol to next state and actions.
    If that set of symbols is just {1,0}, then it absolutely, positively, cannot print out the string "ABC".
    > the best compression methods out all have a variable length (measured in bits) alphabet.
    This is a category error....if the compression algorithm reads and writes binary data, its alphabet is "zero" and "one." The symbols read and written by a Turing machine are atomic--they are not composed of any simpler parts.
    Sure, external to the turning machine, you can adopt some conventions which map bit patterns to letters in a different alphabet. But a binary Turing machine cannot print out those letters--it can only print out a string of 1's and 0's.
    The mapping from strings in the binary language to letters in the target language cannot participate in the calculations for the complexity of a string in another language. Because if it did, again, you could make the Kolmogorov complexity of any arbitrary string S you choose to be 0, because you could just say the null output maps to S.
    This is a subtle problem, often glossed over or just missed entirely. We are so used to encoding things in binary that it might not occur to us unless we think about it deeply.
    Nevertheless, it is a real, genuine problem.
    
    AnotherGoodName 2 years ago
    
    Just because a turing machine prints out 0 and 1 at each step doesn't mean the sequences to factor into the calculation of what to print out next can't be longer binary sequences.
    Pretty much all the best compression methods are language agnostic and work on bit wise sequences. They also pretty much all predict the next bit and feed that into an alogithmic encoder.
    Eg. look up dynamic markov coding which is commonly used by Hutter prize winners. The paper is short and readable. They dynamically create a binary tree and binary sequences are seen so if the pattern '01101000 01100101' comes in it walks down the binary tree. It'll probably predict the next bit as '0' as '0110100001100101' just so happened to be a common sequence in English that will likely have a next bit of '0' but the Dynamic Markov coding model has no idea of that. It just has binary sequences of bits and a prediction of the next bit given that.
    Likewise it can continue reading the bits of the file in and walking down its binary tree where history of the next bit are stored in every node and it see's '111001001011110110100000'. It makes the prediction of the next bit as a likely '1' that it feeds into an arithmetic coder that takes predictions and forms an optimally minimal bitsequence from that. That second binary sequence forms part of 你好.
    In both cases the turing machine doesn't care about that. It's also just writing 1's and 0's as per a turing machine. Eventually those 1's and 0's form sequences that happen to map to characters in various languages but it doesn't care about that.
    >The mapping from strings in the binary language to letters in the target language cannot participate in the calculations for the complexity of a string in another language. Because if it did, again, you could make the Kolmogorov complexity of any arbitrary string S you choose to be 0, because you could just say the null output maps to S.
    One other thing to address here is that Kolmogorov complexity explicitly includes any dictionary you use in it's calculation. A dictionary of the file you wish to compress would just blow out your Kolmogorov complexity to that size exactly. That's why Kolmogorov complexity is an excellent tool. You explicitly cannot cheat in this way.
    
    rhelz 2 years ago
    
    > the turing machine doesn't care about that. It's also just writing 1's and 0's as per a turing machine.
    But a Turing machine does not have to be restricted to just printing out zeros and ones. It can be any finite set of symbols. For example, he Soviets built a computer which used base 3--its symbol set was {-1, 0, 1}. It didn't have bits which could just store "1" or "0", it had trits which could store "-1", "0", or "1".
    And why the soviets built such a computer is germane to Kolmogorov complexity just because you can make shorter strings in base 3 than you can in base 2. The choice of symbol set absolutely impacts the length of strings and programs, and therefore impacts the Kolmogorov complexity of strings relative to the computer.
    With this in mind, please consider three Turing Machines: A, B, and C
    The symbols A prints out on its tape are {"A", "B", "C"}
    The symbols B prints out on its tape are {"0", "1"}
    The symbols C prints out on its tape are {"0", "1", "A", "B", "C"}
    Now consider two strings: "1000001" and "A". Turing machine B could print out "1000001". Turing machine A could print out "A".
    You might be tempted to equate "1000001" and "A". But consider the same two strings printed out by machine C:
    "1000001" and "A"
    Clearly, these are different strings. If C printed out "A", it did not print out "1000001", and vice versa.
    > Kolmogorov complexity explicitly includes any dictionary you use in it's calculation.
    Sure, but on a binary turing machine, like Machine B above, that dictionary is not going to be matching between binary strings and, say Roman letters. Its going to be mapping from one binary string to another binary string.
    Using Machine C, you certainly could write a program which inputed "1000001" and output "A". But you absolutely, positively, cannot write such a program in either machines A or B. Machine A cannot print the string "1000001". And B cannot print the string "A".
    Different strings, different things.
    
    AnotherGoodName 2 years ago
    
    All of those examples are exactly equivalent and convertable but where you are confusing yourself is that you're thinking of the differing states of a given n-ary system as having explicit symbols. It's best to think of it as simply numeric. Binary 010 is 2 if i were to write in decimal. Ternary 020 is the number 6 if i were to write it in decimal. Etc.
    Any actual symbol mapping to the numbers from an n-ary system to actual symbols like you are showing here is actually arbitrary and part of the calculation of space when measuring an algorithms Kolmogov complexity. The program size in kolmogorov complexity includes the dictionary of numerical to symbolic lookup which is what you are getting at here.
    
    rhelz 2 years ago
    
    > where you are confusing yourself is that you're thinking of the differing states of a given n-ary system as having explicit symbols. Its best to think of it as simply numeric.
    But what is "simply numeric"? How do we explain numeric things, and how to compute with them? Uh, we use Turing machines.
    We don't explain Turing machines using numeric concepts, we explain numeric concepts using Turing machines.
    We explain what computation is, by describing how a Turing machine can manipulate symbols on its tape. But that's where the analysis stops. The symbols themselves are the simplest entities, there's nothing simpler than they are to explain them with.
    Nobody who is objecting to me here has ever noticed that the symbols "0" and "1" are atomic, and not further explained. Because that's where they bottom out--explanations have to stop somewhere.
    But you don't have to bottom out at {0,1}. Trinary computers use a different symbol set, {-1, 0 1}. Saying that "A" can be equated with "1000001" is as silly as saying we could store any of the three values -1, 0 or 1 in a single bit of a binary computer.
    
    AnotherGoodName 2 years ago
    
    There just simply is no symbol set for any machine. That's literally an arbitrary mapping. You can literally call the two states whatever you want. You can call one state the contents of file A and another state the contents of file B if you want. But that mapping has a size it takes and kolmogorov complexity very directly says you need to include the size of that mapping. End of story.
    The reason to think numerically is that there are n^y states for an n-ary machine. You can count them. And you can map each on of those to any other n-ary system.
    Note that Turing in his paper talked about states. "simply stated that the machine had certain m-configurations (enumerated) and all m-configurations obtainable". There's no binary machine that outputs 0 or 1 directly except by a mapping of their two states to symbols. Kolmogorov knew this and explicitly stated that any dictionary be included as part of the size measurement.
    
    rhelz 2 years ago
    
    Let’s do a concrete example. B is a Turing machine which has 2 symbols, and uses base 2 to encode numbers. N is a Turing machine with n symbols and uses base n to encode numbers.
    Suppose that machine N can print out string s with a program k symbols long.
    Machine B cannot literally print out s, in general, because s can contain symbols which B can’t print.
    If I understand you correctly, you are saying “so what?” We can just encode the n symbols of N in the obvious way, using only the two symbols of B, and then we just program B to emulate N. Right? Do I get your point?
    If so, well how much longer is a program for the emulated N than the program encoded in N’s native language? I calculate it would be O(log(n)/log(2)).
    So, as n grows, the length of the translated program also grows. We can make the complexity measured by B to be as much larger than the complexity measured by N as we wish, just by increasing n, the number of symbols used by machine N.
    Now, if this were just a constant factor—if the discrepancy was just O(1) more, I would agree with you. Yeah, sure we need a map, but that’s just a constant bit more of information…oh wait, the size of the map is going to grow with at O(log(n)) as well…
    O(1) we can sweep under the rug. O(log(n)) we can’t, because this means there is a trade-off between the complexity of the Turing machine and the complexity it assigns to strings.
    A more complex Turing machine will systematically assign less complexity than a simple one does.
    I mean in retrospect, this is obvious: I can reduce the complexity of any number to 1, just by using that number as one of the symbols in my machine. Pi has infinite digits and can’t be printed out? No problem, just include “pi” as one of you symbols and you can print it in in one character.
    This is easy to miss, because very quickly most presentations of Kolmogorov complexity settle in on a binary language. Then they prove, say, an invariance theorem, saying, e.g., that any two Turing machines will assign the same complexity up to O(1)…
    But that is if you hold n constant. If you can vary n, then a tacit presupposition of those theorems is violated. Those theorems no longer hold.
    If you have a black and white monitor, you can’t display a green pixel. Sure, you can use dithering—assign different bit patterns to each color. But then you need a bigger monitor, because the dithered pixel patterns are more than 1 bit long.
    Similarly, a binary machine can only print two symbols. It can’t print 3 symbols, any more than a black and white pixel can display a green pixel.
    And even if we relax the requirement, and say it doesn’t have to be literally a green pixel, or literally a third symbol (“literally” is a most appropriate word here) we find we need more pixels, or longer programs. And the more colors we want to dither, the more pixels we need.
    I hope this clarifies; I’d be happy to answer any questions.
    
    kaba0 2 years ago
    
    You can trivially convert any such Turing machine to another simply by adding n new states to it, that map a letter of the previous alphabet to the new’s.
    With all due respect, you are misunderstanding something/bring something up with no relevance to complexity.
    
    rhelz 2 years ago
    
    chuckle Ok, give me the benefit of the doubt for 10 more minutes, and follow my argument like a leopard stalking its prey...
    Sure, you could use a binary turing machine, and have it simulate, say, a ternary or a decimal turing machine. You could do this just by mapping bit patters to the symbols the other turing machines have. E.g. "3" could be mapped to "00000011".
    But here's one thing nobody who has objected to me so far has noticed: why don't you feel it's necessary to explain how the symbols "0" and "1" on a binary machine work---say, by using another map to another Turing machine? Or by any other means? Why do people just stop there?
    Take the simplest possible question we could ask about "0" and "1": what explains why the symbol "0" is different from the symbol "1"?
    Cf. with the question why is "3" different from "4"? I mean, we could explain why "3" isn't the same symbol as "4", because a binary turing machine would map "3" and "4" to the binary strings "011" and "100" and these are different binary strings, because "0" is different from "1".
    But what explains why "0" is different from "1"? Absolutely nothing explains it!!
    On pain of infinite regress, explanations have to stop somewhere, and for Turing machines, this is where it stops. A set of symbols is specified, but those symbols are atomic, and not decomposable into any simper or more basic symbols, and are not further explainable.
    Everything else is explained in terms of these symbols and how they are manipulated by the Turing machine. There is nothing more simple, or more basic, than these symbols to explain these symbols.
    Try thinking about it going the other way: Suppose you were trying to explain how the symbols "0" and "1" behaved on a binary turing machine---by emulating it with a decimal turing machine!! You certainly could say something like "Well, the binary turing machine knows that "0" is different from "1", because we have this dictionary which maps "0" and "1" on the decimal turing machine to "0" and "1" on the binary turing machine. And since the decimal turing machine knows that "0" is different from "1", well, that's how the binary turing machine knows it....
    See what I'm getting at? Sooner or later you get to the axioms. The unexplained explainers. The atoms out of which everything else is constructed. I know you are wishing I'd get to the point, but please stick with me :-)
    If a turing machine does not have a symbol in its symbol set, it cannot print that symbol out on the tape, any more than a binary computer can store "-1" in a single bit. Just like bits and trits are the smallest, most elementary storage units for binary and ternary computers, the cell on the turing machine tape is the simplest, most elementary storage unit, and the only things it can store are what are specified in the symbol set.
    Now just like numbers written in decimal are systematically smaller than numbers written in binary, turing machines which use 10 symbols have systematically smaller programs than turing machines which use 2 symbols. This has obvious ramifications for Kolmogorov complexity.
    When applying Kolmogorov complexity, we typically do limit ourselves to an alphabet of {0,1}. But it is important to realize that this is a tacit assumption in most of the theorems. For example, you can prove that for any two turing machines, U and V, KU(s) <= KV(s) + O(1).....but this theorem relies on the fact that if U and V are universal turing machines, there is an O(1)-length program for V which emulates U, and vice versa. **but this is only true if U and V share the same symbol set**. KU(s) and KV(s) can be made arbitrarily different from each other if they don't have the same symbol set.
    And its not even just a theoretical, esoteric point: Knuth has argued for ternary computers, just because they can systematically express things using shorter strings of trits than a binary computer can with strings of bits.
    There is more than one "knob we can turn" when it comes to Turing machines: one "knob" is which program we feed it, another knob is the number of symbols it uses. By making one bigger, we can make the other one smaller. Program size isn't the only thing that matters (as my wife likes to reassure me...)
    
    tromp 2 years ago
    
    > cannot print out the string "ABC"
    When you write a program in any modern programming language to print "ABC", that program merely outputs the 3 bytes 01000001 01000010 01000011, ASCII codes which your console window (not the program you wrote) decides to display graphically as those characters.
    So any machine outputting bits can be said to print out "ABC" just as well.
    Furthermore, your comment above was transmitted by your browser to Hacker News not as English characters but as those very bits, and everyone reading your comment was receiving those bits. The way that the browser decides to display those bits does not change their character.
    
    rhelz 2 years ago
    
    > When you write a program in any modern programming language
    Think about it this way. Say you have 1 bit of memory. Can you store any of the three numbers -1, 0, 1 with that a bit?? No. The only thing a bit can store is 0 or 1.
    There have been trinary computers built. They don't use bits, they use trits. A Trit can store -1,0, or 1.
    A bit cannot. Saying that a turing machine, whose symbol set is {0,1} can print "A" on its tape, is like saying you can store "-1" >>>in a single bit<<<< on a binary computer.
    
    DemocracyFTW2 2 years ago
    
    > The way that the browser decides to display those bits does not change their character
    ... pun intended—?
- mxkopy 2 years ago
  
  I feel like a conversion from binary strings to Unicode/Chinese characters would be in PTIME, so adding a conversion machine would be a nonfactor for languages in most complexity classes.
  - smallnamespace 2 years ago
    
    The stronger result here is that any sort of conversion you can explicitly specify can be turned into a program.
    Since Kolmogorov Complexity is specified in terms of lengths of programs, that means the KC between two different pairs of encodings can differ at most by a constant amount (the size of the program that converts back and forth).
    The above is a bit handwavey, there are details you can tighten up (Is it the program size or something smaller? The program length in which encoding?), but heuristically that's why theorists can talk about "the" Kolmogorov complexity without getting bogged down in with pesky encoding details.
    It's also why we usually don't worry too much about the fine details of Turing Machines (alphabet, etc.), since you can generally emulate one sort of Turing Machine pretty easily with a short program written on another.
    
    rhelz 2 years ago
    
    > any sort of conversion you can explicitly specify can be turned into a program.
    If your Turing machine can only print out zeros and ones, there's no program which can get it to print out "ABC". So it cannot specify a conversion between a language whose symbols are {0,1} and a language whose symbols are {"A",B","C"}.
    It could specify a mapping between one binary string and another binary string, but it can't even print out "ABC" so how could it possibly specify a conversion?
    This is elementary guys.
    
    smallnamespace 2 years ago
    
    You pick an obvious encoding (such as binary) yourself, in the same way your computer is not outputting some platonic ideal "A" but a series of electrical impulses that your monitor plus your eyes and brain interprets as "A".
    Sure, you can object that the encoding is "outside" the TM, but for the purposes of discussing complexity these objections are pretty trivial, again for the same reasons (whatever encoding you pick the conversion process is a program you can write down, and once you write it down it means the Kolmogorov Complexity is the same between different TMs up to the length of whatever encoding/decoding program you come up with).
    Put another way, a TM with alphabet is {0, 1} is technically not the same as the TM with alphabet {A, B}. But it's obvious to us that the TMs are equivalent.
    
    rhelz 2 years ago
    
    Well, here's why it's not that simple....I can make any number N have a kolmogorov complexity number equal to 1--if the Turing machine has N+1 symbols :-) I just express the number in base N. (which will bet "10" for any base :-)
    It true that we typically limit ourselves to binary when we are proving theorems, etc in Kolmogorov complexity. You can prove that for any two turing machines U and V, KU(X) <= KV(X) + O(1).....but this relies on the fact that there is an O(1)-sized program which lets U emulate V, and V emulate U.
    And that is only true if U and V share the same symbol set. If they don't, then the kolmogorov complexity of the two machines can be made arbitrarily different from each other, just by changing the symbol set.
    
    cscurmudgeon 2 years ago
    
    It is not an intractable problem as you believe it is.
    E.g., make the machine print out pixel values for a large screen. The screen can display Chinese characters in canonical ways.
    
    rhelz 2 years ago
    
    This might sound a bit...but hang with me please...
    Are the symbols "0" and "1" also definable as pixel values on a large screen? If so, what is a pixel? It it further composed of smaller micro pixels? Or is it pixels all the way down?
    What I'm saying is that a binary computer cannot write "A" to its memory any more than you can write a green patch to a black-and white monitor's pixel.
    Sure, there are various work around...you could dither, and make some bit patterns represent red, green, etc...or you could put a picture of an apple onscreen with an arrow pointing to it which said "this is red".
    And just like a color monitor can display more information than a black and while monitor can, the programs of a Turing machine with more symbols can express more information in smaller strings than a binary turing machine can.
- canjobear 2 years ago
  
  Why is this a problem? No information is lost when characters (or graphics) are encoded in binary.
  - rhelz 2 years ago
    
    A turing machine is defined as a set of symbols which it can read/write to the tape, and a state machine which maps the current symbol read to the next state and some actions.
    The symbols of the Turing machine are atomic. They are not composed of any simpler parts. If one of the Turing machine's symbol is the letter "A", it's the letter "A". It is not, say, the ascii code (1000001)
    1000001 could be the Goedel number for "A", but its not the symbol "A". The two strings "A" and "1000001" are two different strings.
    Its a map-vs-territory kind of thing. If you are really good at programming--which is to say, you are really good at Goedel mapping your problem to integers--you might, by years of long familiarity, just start thinking of them as one and the same, but they are not.
    It might make it vivid to consider a turing machine whose symbols were {1, 0, A}. Clearly, the string "1000001" and the string "A" are two different outputs for this turing machine. The lengths of the strings "1000001" and "A" are different. They are composed of different symbols. They are absolutely, positively, not the same string, so they are not the same thing.
- tromp 2 years ago
  
  > It therefore cannot be used to define a universal complexity for any string.
  It defines a complexity for anything which can be represented in binary. Which in practice is all we want. Who wants to define a new complexity measure for every new alphabet of symbols?
- SimplyUnknown 2 years ago
  
  But Chinese (or mandarin) is not a context-free grammar whereas I believe that encoding a language on a turing machine implies a context-free grammar so this example doesn't hold.
  - rhelz 2 years ago
    
    Well, a couple of points: its not obvious that Chinese doesn't have a context-free grammar: see the talk by David Branner: "The Grammar of Classical Chinese is Very Close to Being a Context-Free Grammar".
    And a properly programmed turing machine can parse languages which are way more complex than context-free languages are.

arketyp 2 years ago

Richard von Mises (brother of the economist) formulated a definition of randomness as a sequence of data that, were you a gambler, you cannot by any strategy make money on betting on the outcomes. This was before computational calculus and was later developed by Kolmogorov and others in algorithmic complexity. The modern variation would be (Wiki) "considering a finite sequence random (with respect to a class of computing systems) if any program that can generate the sequence is at least as long as the sequence itself".

n4r9 2 years ago

> you cannot by any strategy make money on betting on the outcomes
What does "strategy" mean here? I might just happen to have a strategy which involves betting on the exact sequence of heads and tails in a given sequence. The analogy in terms of languages is that my language might just happen to have a short keyword that represents a given sequence of heads and tails.
I don't know much about Kolmogorow complexity so I'm certainly missing something here. Potentially there is a subtle clause in the technical definition that doesn't make it through to these articles.
- PartiallyTyped 2 years ago
  
  > What does "strategy" mean here? I might just happen to have a strategy which involves betting on the exact sequence of heads and tails in a given sequence.
  That's a very narrow program.
  > The analogy in terms of languages is that my language might just happen to have a short keyword that represents a given sequence of heads and tails.
  The sequence still needs to be generated "somehow". Either by executing the program and producing the sequence, or by explicitly stating it. Even if you have it "cached" and "represented" in your language, you still need to generate the sequence. The resources spent here is the Kolmogorov complexity.
  The easiest way to expand your little program is to say that you have a seed s.t. any consecutive generation results in a consecutive sequence that matches up to the period of the generator. Now it is more generic, but has a period. You can then expand this to accept multiple seeds and once it has reached a period, to simply take the next seed.
  Should this sequence be finite, you are in luck. Your program can have length O(generator + N/P) where N is length of sequence, and P is the period of your RNG.
  All this is is just compression which plays into the whole Kolmogorov complexity.
- inimino 2 years ago
  
  The idea is that you bet before the sequence is known. Nowadays we would say it is the distribution (or the process producing the random sequences) that can be truly random or not, and we recognize that saying "sequence [...] is random" is incoherent, same as the joke of the random int set to 4 with a comment in the source code that it was chosen by fair dice roll.
  If you know everything about the process and still can't beat chance at predicting it, that's the quality we are after. In this definition "random" just means unpredictable, which is another way to explain why it can only be a meaningful distinction when you don't yet know the result.
- canjobear 2 years ago
  
  > What does "strategy" mean here?
  Any function that outputs bets.
a_wild_dandan 2 years ago

Doesn't the modern variation break for programs with lossless encoding/decoding? At least, for sufficiently long sequences? A Huffman/byte-pair encoding would shred any trillion-bit+ sequence, for instance. But I intuitively expect many random trillion-bit sequences exist.
- chaboud 2 years ago
  
  There is no encoding that would shred "any" (read: every) trillion bit sequence. If that were true, some fundamentals of information theory and compressibility would break down.
  Lossless encoding works by taking advantage of the more commonly observed sequences of data having lower information entropy. For things like audio encoding, where discontinuous sequences aren't naturally observed (or pleasing to listen to), lossless encoding has a lot to work with.
- mxkopy 2 years ago
  
  For any fixed compression scheme, there is an input string that is actually lengthened by it rather than shortened.
  However Huffman isn’t a fixed compression scheme since it makes a different frequency tree for different corpora.
Ar-Curunir 2 years ago

The two definitions say different things. What von Mises said is closer to cryptographic definitions of pseudorandomness, and in particular to next-bit unpredictability.
- arketyp 2 years ago
  
  Yes, I agree. But I talked about an idea development and said variation, not necessarily addressing the same thing. The headline would be algorithmically random sequence.
  https://en.wikipedia.org/wiki/Algorithmically_random_sequenc...
canjobear 2 years ago

Do you have a citation? I didn’t know the idea went back that far.
- arketyp 2 years ago
  
  Thanks, I had to dig. I read about it in [1]. Mises was concerned about the formalization of probability theory. It seems the idea appears at least as early as in his 1919 paper [2].
  [1] An Introduction to Kolmogorov Complexity and Its Applications, M. Li & P. Vitnányi
  [2] Grundlagen der Wahrscheinlichkeitsrechnung, R. von Mises

rhelz 2 years ago

I think its bootless to try to define the "minimum possible" Kolmogorov complexity. Here's why:

1. Note, kolmogorov complexity is defined by the length of the shortest program which prints out the string. What counts is the number of instructions, and not the complexity of those instructions.

2. So say S is a very complex spring. We can always construct a turing machine which could print out S using a zero length program: it could just start in a state which prints out S when you turn it on, and then halts.

3. So there is no such thing as a turing machine which prints out every string shorter than any other turing machine prints it out, QED.

That's the bad news. The good news is we don't even need to do that. For any string S, say that M and N are any two universal turing machines. Without loss of generality, specify that KM(S) <= KN(S). Then there is always some C for which KM(S) <= KN(S) + C. The constant C being the length of the program required to emulate machine M on machine N.

We are used to abstracting out constant sums and constant factors like this. The strings we are dealing with (as a species) are growing in length exponentially--that's why we went from 8-bit, to 16bit, etc computers. So as the length of S goes to infinity, the difference between the its complexity for any two machines becomes negligible.

anonzzzies 2 years ago

Kolmogorov complexity is a lovely subject and one of the more influential ones in my life.

THE book https://link.springer.com/book/10.1007/978-0-387-49820-1 is absolutely a thing to read. It was for me 30 years ago and it aged well.

derbOac 2 years ago

That is a great book on the subject — the authors have published some important work in this area in papers as well.
woliveirajr 2 years ago

One of Vitany's student used it to create the NCD (normalized compression distance), and then I went on to get Master/PhD degree on using it to authorship attribution.
mnky9800n 2 years ago

I bought the book based on your, albeit anonymous, recommendation. Is there a python library you recommend for playing around with it?

wood_spirit 2 years ago

An excellent rabbit hole to dive into is the equivalence of compression and general AI. Every programmer should make a compressor (and, separately, a ray tracer)!

See http://prize.hutter1.net/

avmich 2 years ago

Some examples for a particular algorithm: https://rosettacode.org/wiki/LZW_compression
pixelpoet 2 years ago

Definitely with you on the ray/path tracer :)

JDEW 2 years ago

> It has been demonstrated that KC(x), can be reasonably estimated by the number of bits required to encode x using a compressor C (such as gzip)

Talk about a cliffhanger :)

Using [0] you get 32B for Alice and 40B for Bob.

[0] It has been demonstrated that KC(x), can be reasonably estimated by the number of bits required to encode x using a compressor C (such as gzip)

causal 2 years ago

Confused how the interesting number paradox proves KC cannot be computed.

Opocio 2 years ago

Me neither.
But how I see it is that for solving KC in full generality you'll have to:
- Start with the program that explicitly returns the original string. Let's say it has length N - run all possible programs that are shorter than N (just try all combinations of characters) - look at the results and pick the shortest program that compiles and outputs the original string
The problem there is that you have to wait for all programs to end, and you don't know if they will end or not. So you have a problem that's equivalent to the halting problem (and that's not solvable) (and the halting problem is loosely related to the interesting number problem).
(This is not a proof and I don't have a background in the field btw)
- causal 2 years ago
  
  That intuitively makes sense to me.
srcreigh 2 years ago

The author is referring to something called Chaitin incompleteness.
https://en.wikipedia.org/wiki/Kolmogorov_complexity#Chaitin'...
Of course trivially some KC can be proven, ex a language with 1 or 0 characters that is interpreted to a specific string. Or to prove KC(x) where the compressed value has length N and you can list out all the results for all strings of length less than N, and they don't equal x, proves KC(x)=N.
The interesting number paradox (Berry's paradox) is more related to Chaitin incompleteness.
Basically, given a language there’s some code which enumerates proofs that KC of a string is more than some constant L, and returns the first one it finds.
If the constant L is large enough, it becomes larger than the entire proof generating code. So the proof generating code will never find a proof of any KC larger than L.
It's interesting to think about that the language gets more complex, proofs for larger strings become possible. And what it would mean for the languages to keep getting more complex indefinitely.
it's a similar train of thought to busy beaver numbers and how systems of logic (PA,ZFC) become independent to values like BB(745), and what it could mean to have more and more advanced types of logic which don't become independent until some high target n.
- causal 2 years ago
  
  This seems to assume that KC can be infinite. That must have been proven at some point? Otherwise it may be that there is some upper-bound for L which happens to also be the KC for a KC-computer.
  - tromp 2 years ago
    
    KC(x) is always finite for any finite x. What we can say instead is that KC is unbounded. I.e. there is no finite bound on the value of KC(x).
  - srcreigh 2 years ago
    
    yes, it’s a pigeonhole principle argument. A bit tricky to actually enumerate everything. Imagine KC had a limit k. Then there’s a fixed number of strings that can be compressed to k characters. so considering how there’s an infinite number of strings of length greater than k, they can’t all be compressed to be at most k. Therefore KC has no limit
nyrikki 2 years ago

Impredicativity is the property you may want to dig into for formal proofs on why self references can be problematic.
There is an important difference between semantically complete and syntactically complete that may cause some barriers.
Gödels completeness theorem is about semantic completeness while his incompleteness theorems are about syntactic completeness.
From Wikipedia: > A formal system is syntactically complete if and only if no unprovable sentence can be added to it without introducing an inconsistency.
'This statement is false', which Gödel mapped to natural numbers is an example of that inconsistency.
If KC was computable, there would be an infinity of paradoxes like the interesting number paradox.
The Berry paradox that is linked to in the INP link in the page has a subheading that relates it to KC computability.
https://en.m.wikipedia.org/wiki/Berry_paradox
explaininjs 2 years ago

Similar to how the interesting number paradox relies on a "shortcut statement" to force-up the number of non-interest, If Kolmogorov complexity were computable you could create a "shortcut program" to force-down the shortest length of the program:
Given: TM length of a JS runtime is 1,000,000 cells.
Assume: KC is computable, and TM length of a `function KolmoglorovComplexity(string s)` is 4,000,000 cells.
Known: KC's of values grow infinitely large - only 2^n-1 possible values can ever be encoded by n bits.
Take: function Shortcut() { for (const s in generateEveryStringFromShortestUp()) { if ( KolomoglorovComplexity(s > 10,000,000) ) return s } }
You see that the Shortcut function is encoded in 5,000,135 cells (plus that string generator, but that's small/constant), but it computes a value of arbitrarily large complexity (rather, one cell increase in the program length causes 10x increase in the complexity). A contradiction.
- causal 2 years ago
  
  Still confused. What is contradictory about a simple program computing a more complex program? Randomly generating a more complex program does not make the complex program reducible to a random string generator.
  - Dylan16807 2 years ago
    
    > Randomly generating a more complex program does not make the complex program reducible to a random string generator.
    The complex program probably does something other than generate random strings.
    But the complex program is not actually more complex than the generator plus a smidge.
    Because anywhere you're using the complex(z), you can replace it with generator()(z).
  - basil-rash 2 years ago
    
    The complexity cannot be over 10,000,000 if that simple program generated it. That is the precise definition of complexity.
    I don’t understand what you mean by reducibility to random strings, randomness has precisely nothing to do with complexity, even if they do tend to go together.

davesque 2 years ago

Something I've always noticed with the notion of Kolmogorov complexity is that the question of determining the lowest level of computation is problematic.

For example, in the article, the author first defines the basic idea of KC. But then they correctly point out that the basic idea depends very much on the exact language that is chosen. So they describe how theorists have defined the notion of universal computation. But even this adjustment doesn't seem to escape the fact the we still depend on a system of mathematical symbols to describe the theory. And the notion of a Turing machine itself depends on other abstract concepts such as time and space, each with their own inherent, conceptual complexity. What sorts of minds (i.e. brains) are required to make sense out of the theory and what physical system is required for them to operate correctly? If the definition of KC includes a notion of how complex the Turing machine is that is required to compute a string, then the further down you go, the less the difference in complexity should be between any one string and another. After all, they all exist in the same universe!

I guess it just goes to show how much the idea of KC lives in the realm of theory. As soon as you pose the question of complexity so abstractly, you invite in all kinds of theoretical considerations that make the meaning more slippery. That's why KC really doesn't deserve to be compared to Shannon entropy as it often is.

But let me draw a comparison anyway like I said you shouldn't! Because Alice from the article could also have made a strong argument against Bob by just pointing out that the Shannon entropy of his string was lower, which is very relevant in terms of the number of heads or tails and the likelihood of seeing a particular count of them.

veerd 2 years ago

1. Choice of language only matters up to an additive constant (e.g. you could just write a simulator so language A can run language B).
2. If you want something with less physical grounding, you could use lambda calculus instead of Turing machines.
3. Kolmogorov Complexity and Shannon Entropy are compared with one another because they both are talking about the same thing: optimal compression. Kolmogorov Complexity talks about the compressibility of individual objects and Shannon Entropy talks about compressibility of streams of i.i.d. random variables.
Xcelerate 2 years ago

Many people seem to get hung up on questions related to the specifics about the implementation of computational models, but mathematicians don’t, because the fundamental aspects of the theory don’t change much when you swap out one particular model of universal computation for another.
As noted, Kolomogorov complexity depends on the specific UTM only up to a constant factor (this is known as the invariance theorem). But even the change in runtime, memory usage, and essentially anything else you might think are important are bounded by a factor that is either a constant or a “slow-growth” function (e.g. a polynomial) when you swap out one computational model for another. These small terms are generally dwarfed by the size of the data itself (even for small datasets) and the complexity of the algorithms used.
That said, I also share some of your confusion on the “specifics” when it comes to Solomonoff induction. I have yet to understand why the universal distribution uses negative exponentiated program size to weight the universal a priori probability of a particular string as opposed to some measure that involves program runtime or frequency over an equivalence class of programs that implement the same algorithm.
Solomonoff was careful to point out that his universal distribution is more of a class of distributions that have certain “universally optimal” convergence properties given a reasonable assumption on the underlying model of data generation: a deterministic algorithm with a short description that has access to a source of randomness. But I think many people since then have made the unwarranted leap that Solomonoff induction is the best induction scheme for all models of data generation, including data obtained via observation within our universe. I’m not sure that has been proven true. And if it has, I certainly haven’t come across the paper showing it.
AnotherGoodName 2 years ago

Shannon entropy and Kolmogorov complexity are absolutely literally the same thing though! They are both purely theoretical and you cannot calculate the minimum Shannon entropy any better than you can calculate the Kolmogorov complexity. In fact if you could calculate one you could calculate the other trivially but we don't have a way to do that.
For those now thinking about how to calculate Shannon entropy using the defined formula what are you using for the symbols? If you used one bit symbols of '1' and '0' and a probability of each appearing a file that was just 11101110... repeating would you would find a different Shannon entropy to someone using 4 bit symbols. Shannon entropy is literally uncomputable in the real world. You can only compute it if you are given a fixed alphabet and frequencies but in the real world the optimal alphabet for a given file to calculate the minimum Shannon entropy is actually unknowable.
That's where Kolmogorov complexity comes in. It states that "well we don't actually have a way to define the alphabet in Shannon entropy in the real world but if we pretend we have a system (the universal computation) we could calculate it". They then add in the size of the program length that does the calculation as well to prevent cheating by having a language that has a dictionary specific to the thing to encode and call that Kolmogorov complexity. But that's it. They are literally the same thing in essence.
Kolmogorov complexity is in fact better than Shannon entropy for real world usage. It's every bit as computable in the real world (ie. not at all but at the very least you can do the best compression you can and make a guess!) but it at least states that upfront.
For anyone wanting to claim that they had a CS assignment to calculate Shannon entropy and it's totally computable your teacher should probably have explained that the symbol frequencies for the alphabet given aren't actually computable like that in the real world as the optimal symbol lengths themselves aren't actually computable. You cannot in the real world just say "compute the Shannon entropy of an alphabet with two symbols - B 30% and A 70%" because you don't actually know if B and A are the optimal alphabet to define to minimize Shannon entropy. BBBAAAAAAA repeated has no entropy but it fits the definition of the question given and would give you a different result.

copx 2 years ago

>Bob claims that since the probability of getting both his and Alice’s sequence is the same (2−20 ), it proves that there was no foul-play involved.

..and Bob is 100% right.

>Bob credits his excellent luck. Alice is smart and cannot be easily convinced. She get’s back at Bob by claiming that probability cannot be used in this context as it reveals no information regarding the randomness of the obtained sequences. One can take a quick glance at the obtained sequences and easily point out that Alice’s sequence is more random than Bob’s sequence.

No, it is not. Given a perfectly random coin toss Bob's sequence is indeed just as likely as Alice's sequence and in no way "less random" because both sequences result from the same randomness with equal probability.

A nice example of human intuition being at odds with probability math, though. Bob's result seems less likely but it really is not. Which reminds me that I actually had to write my own computer simulation of the Monty Hall Problem before I was willing to believe the correct answer. I think (most?) human brains have a bug in the "understanding probability" subroutine.

ptero 2 years ago

Not quite. Specifically, assuming independent random tosses A and B sequences are equally likely. No objection here.
But the question posed is different: given a specific sequence, how likely it to have come from independent coin tosses? That is, how likely is it that Bob is cheating and his sequence was in fact not a sequence of a fair coin tosses.
And for this KC is a reasonable measure. My 2c.
Hunpeter 2 years ago

The "bug" in this case imo, is that we interpret A's sequence as "random garbage" without regard to the actual contents, whereas we interpret B's as "all Ts". The question our brain asks then is "is it more likely to get random garbage or all Ts?"
- nerdponx 2 years ago
  
  Right. It might be more interesting to consider the count of Ts and Hs instead of considering exact sequences.
Gimpei 2 years ago

Couldn’t you say that the distribution of tosses is less likely in the case of Bob?

mojomark 2 years ago

I'm going to keep reading (because I love the KC topic), but I'd appreciate anyone confirming if the following are errors in this article:

1.) Conflating usage of the term "random" and "complexity". After all, a set of "randomly" drawn sample permutations from an alphabet are all equally likely. However, their "complexity" may differ, which is basically the point of the article, but the term more or less "random" keeps being used to refer to permutations with more or less "complexity", which I think is probably going to perpetuate confusion on this topic.

2.) From the article: "Moreover, a string cannot be compressed if its KC(x)≥|x|". Shouldn't the expression accompanying this statement be KC(x)=|x| ?

tromp 2 years ago

Regarding point 1), one can easily show that with probability >= 1 - 2^{-k}, a randomly chosen bitstring x of length n must satisfy KC(x) >= n-k. After all, there are only 1+2+... 2^{n-k-1} = 2^{n-k}-1 shorter descriptions. So highly compressible strings are highly unlikely.
Regarding 2), No, most strings x do not satisfy KC(x) = |x|, since you need to use some bits to specify that you're giving x literally. See the first theorem of [1].
[1] https://gist.github.com/tromp/86b3184f852f65bfb814e3ab0987d8...
rhelz 2 years ago

re #1: the conflation is justified, but you couldn't guess that just from what was presented in the OP. There are some cool theorems which justify it tho---if you like Kolmogorov complexity you are in for a fun ride.
re #2: No. Basically the > part of it handles the case when the smallest program which prints out the string is actually LARGER than the length of the string. In that case, the string is still incompressible. Compression means mapping from larger strings to smaller strings.

pizza 2 years ago

I think maybe another way to put this is that Alice's number is in a typical set [0] of the distribution of bitstrings whereas Bob's might not be. Depending on the tolerance, the typical set can have near-total coverage of the distribution. Another way of making this about compression is that a random code that could encode typical set strings well probably will suffer some overhead when encoding Bob's, but most strings it will encode close to optimally.

[0] https://en.wikipedia.org/wiki/Typical_set

alfanick 2 years ago

A side question: is this taught in CS curriculum you know? It was at my uni (fairly good one, in a minor European country), and this experience biases me because I assume every CS knows Kolmogorov complexity.

sunshowers 2 years ago

At my university (IIT, top school in India and well-known around the world) this was covered in an elective you could take, not part of the core CS curriculum.
quibono 2 years ago

Yes, at least in the UK. From working through some US university curricula - it's also present there as well.

yamrzou 2 years ago

Well, I reached the end of the article (interesting btw), and still not convinced why bob can't claim that there was no foul-play involved and that his got his result due to excellent luck.

ComplexSystems 2 years ago

You don't need Kolmogorov complexity for this; simple hypothesis testing will do. The null hypothesis is that the coin is fair and the alternative is that it's biased. If Bob was correct, then there would simply never be any way to refute the null hypothesis of a fair coin, no matter what, since it can simply output anything at all with equal probability as anything else. In reality, that isn't how hypothesis testing works, and pretty much any standard technique (computing p-values, likelihood ratios, etc) will agree that 20 tails in a row is extremely unlikely given the null hypothesis in a way that 10 tails and 10 heads is not.

avmich 2 years ago

> Another thing to note is that Kolmogorov complexity of a string cannot be computed. There cannot exist a computer that will always guarantee the Kolmogorov complexity for all the strings.

Sounds a bit puzzling. Surely for a particular programming language we can enumerate all programs, ordered by length etc. and check which is the shortest one giving the given string. So what's uncomputable here? For long strings that could take long time, but - ?

floobertoober 2 years ago

With a Turing complete language, you can't know whether a given program eventually yields the string, or continues indefinitely
tromp 2 years ago

> So what's uncomputable here?
Deciding whether the universal machine will ever halt on a particular input. I.e. the good old halting problem.
MutualMisinfo 2 years ago

The programs we check might not halt.

robrenaud 2 years ago

If you want something like Kolmogorev complexity for molecules, check out assembly theory. I am a CS person, but there are interesting, related ideas here.

https://en.m.wikipedia.org/wiki/Assembly_theory

Ono-Sendai 2 years ago

Kolmogorov Complexity does not help with giving a universal measure of complexity or randomness: https://forwardscattering.org/page/0

AnotherGoodName 2 years ago

It's a lot better than the alternatives. Particularly the misused Shannon entropy.
The top rated answer for "how do i measure Shannon entropy" on stack overflow for example has an accepted answer of "count the probabilities of all 8bit sequences and then multiply the log of those probabilities together as per the equation". Which is a problematic answer. A file of all 8bit characters in sequence repeated many times over won't have any entropy but will have high entropy by this particular arbitrary measure. The problem with Shannon Entropy is that you have no way to define the optimal symbol lengths and frequencies for any given file.
Kolmogorov Complexity on the other hand at least gives some way for us to get a rough estimate. It's just as incalculable as Shannon entropy but at least by essentially explicitly stating "compress it using the best tool you have at hand and see how small it gets and also include the size of the compression program in the calculation to prevent cheating by using a dictionary" you can get some rough estimate.
Basically Kolmogorov Complexity is the best tool we have. It's not perfect because just like Shannon Entropy it's incalculable in reality but unlike Shannon Entropy we do have a good way to measure if one tool of calculating Kolmogorov Complexity is better than another tool. That measure is simply "does it compress better?".
It's literally the best way to measure randomness of an arbitrary file. Any other way is pretty game-able. If someone uses Shannon entropy to measure randomness just look at the alphabet they use for that measurement and repeat that alphabet sequentially over and over again and you'll have a high shannon entropy for a clearly non-random file. Likewise other measurements might be game-able with large dictionaries to lookup. Kolmogorov complexity includes the entire program so that game doesn't work here.
- Ono-Sendai 2 years ago
  
  Practically speaking, trying to compress a file is a nice way of measuring... something. I was more talking about the theoretical notion of complexity.
- Dylan16807 2 years ago
  
  I'd argue that for most compression methods, not including the decompression program size will give you a better rough estimate.
  - AnotherGoodName 2 years ago
    
    Including it just excludes the old 'program that returns a compressed size of 0 for a common test case because it contains the common test case within itself' scenario. There's variations of that too. Gpt is fantastic at predicting text and you could use that prediction with arithmetic coding to make a very minimal length for some common English text. But is that really fair? The gpt model essentially contains that English text. Including the program size has little overhead for most cases (code size is generally small) but it ensures we catch out cheating which is why kolmogorov complexity includes it.
    
    Dylan16807 2 years ago
    
    If you're cheating your own rough estimate, what are you even doing?
    Custom cheats and large language model predictors are outside the realm of "most compression methods".
    My thought is that lots of things you might want to compress are pretty small, so if you toss a 200KB not-ultra-size-optimized decompressor on top you get a very misleading number. Not having it will tend to understate a bit, but also current compression algorithms are far from perfect so that counterbalances the effect.
    I think I disagree with you about how big a blob of text is getting compressed in "most cases". (If we look at photographs or videos then it works better, but we still want to arrange a minimal decompressor instead of grabbing a typical implementation. And the lossy/lossless factor makes the entire thing far more complicated.)

marius_k 2 years ago

Sometimes I wonder what would be the smallest program to generate humans DNA. How many operations would it take and how would it compare to real world iterations of total evolution.

wood_spirit 2 years ago

Interestingly, dna programs are quite compressible
https://en.m.wikipedia.org/wiki/Compression_of_genomic_seque...
arketyp 2 years ago

Not sure what kinds of selection pressures there has been for shorter DNA strings, but presumably you could compress it a great deal putting it in a .zip file. Now imagine the havoc caused by random mutations on that format though.
amelius 2 years ago

If we ever find a perfect theory of physics, then that might be the smallest program to generate human DNA.

Settings

Kolmogorov Complexity and Compression Distance (2023)

Keyboard Shortcuts