Press enter or click to view image in full size
You toss a die, and it lands on 1. Surprising?
Hardly, with my luck. But even in general, I guess not that much — it must have landed on something, right?
What if I were to tell you that the die is loaded, and that 90% of the time it lands on 6?
Then I would be surprised that I got a 1.
What if you toss that die again, and now get a 6?
Considering it lands on a six 90% of the time, not surprising at all.
In general, if you observe an event happening, how surprised will you be?
That depends on the probability of the event. The less probable the event is, the more I will be surprised that it happened.
So, what would be a good way to quantify the surprise?
Something like -P(X) should work. The larger the probability, the smaller the surprise, and vice versa.
By your definition, what would be the value of your surprise when you rolled a 6 on that loaded die?
My surprise would be –0.9.
A surprise of -0.9. Doesn’t that seem weird?
Maybe. A negative value of surprise isn’t really intuitive.
What could be an alternative definition of surprise that would not have this problem?
Defining the surprise as 1 / P(X) should do the trick.
Why?
Less probable events are still more surprising, and additionally the surprise is never negative now.
Good. What if I were to generalize your definition and say that the surprise of an event X is
F(1 / P(X)), for any real functionF. Would that make sense?
In general, no. For example, applying F(a) = 1 / a to (1 / P(X)) is simply P(X). And that hardly satisfies our conditions for a proper surprise measure.
For which
Fwould the generalized definition make sense, then?
Only for those that are increasing. In that way, events with small probability are guaranteed to be more surprising that the more probable ones.
And what about our second condition, the one about not having a negative surprise?
Oh, right. So only functions that are increasing and whose values are nonnegative.
Good. Let’s say that I found such
F, and that I calculated that the surprise of you rolling a one is 50, and the surprise of you rolling a six is 10. What do you think is your total surprise from these two events?
I would say it should be a simple sum of those two surprises, that is, 60. But it’s not that simple, right?
It could be. But first, how do you calculate the surprise of observing two independent events?
The probabilty of me observing two independent events, X and Y, is equal to PX * PY and so according to our definition, my surprise would be F(1 / (PX * PY)).
Now, how would you phrase the condition of additivity?
I want F(1 / (PX * PY)) to be equal to F(1 / PX) + F(1 / PY).
Does such
Feven exists?
This question would be much easier to answer were it not for the reciprocals inside our formula.
How so?
Well, we would be looking for some F satisfying F(PX * PY) = F(PX) + F(PY). And logarithm is such function.
…
Now that I think about it, the logarithm also satisfies the equation with the reciprocals!
Does the base of the logarithm matter?
No, any base works.
The logarithm is also an increasing function, so it satisfies the first condition we set in the generalized definition for surprise. Does it satisfy the second one?
It does, sort of. Its values are negative only for inputs of 1 or lower.
Is that a problem for our use-case?
It is not. P(X) is less or equal to 1 by definition, and it follows that its reciprocal is greater or equal to one. Consequently, log(1 / P(X)) will never be negative which is what we need.
Tell me again, what have we gained by adding in the logarithm to the original definition of surprise as
1/P?
With the logarithm the surprise of observing two independent events is additive, meaning we can add up the surprise of each of the events and the result will be the surprise of the combined event.
And as an aside, have we lost something?
With the logarithm added it is no longer true that if some event is k times less likely to happen, we are also k times more surprised when it really does.
To sum up, what definition of surprise have we arrived at?
The surprise of an event X is log(1 / P(X)), or equivalently -log P(X). The base of the logarithm doesn’t matter.
And what properties does this surprise measure have?
- The more probable an event is, the smaller is our surprise when we observe it.
- Our surprise is always non-negative.
- The surprise when observing two independent events is equal to the sum of their individual surprisingness. (additivity)
Fantastic! And lastly, do you know where you can apply this newly-gained knowledge?
I think the amount of surprise has something to do with information and entropy. But I’m sure we’ll talk about it more in our next conversation.