Why There Are Randomness In Ranking Algorithm

Yandex Leak as An Example

Yandex leak gave us many insights about how search engines work. Alex Buraks (@ alex_buraks) and Micheal King (@ iPullRank) have nice summary on that. But one tweet took my attention, wondering why there is a random factor in ranking algorithm? And this is what I’m going to explain.

First, there is not only a single random factor, there are many including the following:

The way last one define is very interesting

Press enter or click to view image in full size

RandomFeature defined by UserID and (time in minutes)/5

It’s a random number between -2.0 to 2.0 that is assigned to each UserID and changes every 5 minutes. It’s tagged as “Query-Only Factor” and “Dynamic”. Don’t bother the “Unused” tag because PageRank is also marked unused too.

There are 3 types of ranking factors in Yandex MatrixNet:

Static Factors
Dynamic Factors
Search Factors.

The reason behind both RandomFactor and RandomBoostFactor in mentioned in the description to be “relevance rotation”

Press enter or click to view image in full size

RandomFactor reason is said to be “relevance rotation”

What is Relevance Rotation?

Think of it like this, if equally good pages systematically presented in some order, one page might get more “chance” to get more clicks and depriving equally good pages from this chance. Randomly boosting some pages rotate this chance between equally good pages. It’s like how a grocery store shuffle it’s shelves from time to time.

Tie, randomly untie then reasonably untie

Also if at some point in time there was a burst of traffic in some peak time, you don’t want a single page to be presented to all users. All fresh pages have equal clicks, and user engagements metrics which is zero because they are too fresh to have clicks. A breaking news on BBC, CNN and Aljazeera all of those pages have zero clicks yet, by presenting different people to different variations of ranking for 5 minutes. In the next run after 5 minutes randomness won’t be the key player because now you have clicks data.

The trick is to introduce ties (equal score) by having factors squashed into small number of buckets or discrete course steps for example freshness is measured like this

As you can see time is measured in months, and is capped at 10 years. As you can see time is not measured in seconds, but in months, and it’s not open-ended. so we only have 120 buckets only.

Summary

Tie inducing normalization of factors (course not precise, squash into buckets)
Tie breaking randomness
Randomness-dismissing Engagements Metrics (CTR, Clicks, bookmarks)