What is the Probability that a Horse Finishes Last?

This post introduces you to a new contest where you deploy an algorithm that predicts the rankings of horses in a five-horse race. The only information you are given is the market’s perception of the probability of each horse winning.

Press enter or click to view image in full size

Photo by Mathew Schwartz on Unsplash

The question of assigning rank probabilities to a finite number of variables is a basic statistical question likely to find many applications beyond horse racing. I find this problem re-occurs fairly frequently, whether I am looking at dealers in over-the-counter markets or algorithms being stacked together.

And although the challenge is rather different, the M6-Financial Forecasting Competition, to begin in Feb 2022, features a collection of five horse races between stocks, where you will also be asked to estimate rank probabilities. Consider this a warm-up.

How to participate in the contest

Open this notebook and start burning your new identity while you read on. (This is your ticket to the world of microprediction and if it is confusing, just bother us on slack.) When the WRITE KEY is created, paste it into the dashboard to reveal your nom de plume. For example:

Press enter or click to view image in full size

Your public ID is a hash of your private WRITE KEY. Your WRITE KEY is a Memorable Unique Identifier explained in this video.

Next, take a look at estado_sole.py. This is the example script you’ll run on a cron job once an hour— or otherwise schedule. All we need is a writer, which can be instantiated using your WRITE KEY.

While your KEY is burning, notice that there are ten different live streams that are of interest to us in this example (you can find the full stream list here, if you are interested) and they all relate to a sequence of five-horse races.

Press enter or click to view image in full size

For example, the stream horse_finish_4.json contains a record of finishing positions for the 4th horse in each race. The horses are different each time, but horse 4 is always the 4th most popular in the betting. As you can see it does not win very often.

Your mission is to estimate the probability that horse 4 finishes 1st, and also estimate the probability it finishes 2nd, and so on up to 5th place.

Actually, you are somewhat constrained because you cannot provide five probabilities exactly. You instead provide a list of 225 values in a vector. So for instance, if you think the horse has a 1 in 15 chance of winning, then 15 of those entries in your list should be 1’s. Perhaps 30 are 2’s, because it is more likely to come second, and so forth.

That’s why the last part of the example entry looks like this:

Press enter or click to view image in full size

Since someone asked, the stream sponsor Lady Damselfly in the sponsor list. Her public key is:

1a24da35e1f140d20e21ecbac7436d8e (aka LADY DAMSELFLY)

Information

The information you are provided with is contained in the price streams, such as horse_price_4.json shown below:

This uses a European/Australian convention for odds. The price $6.54 means that if you bet $1 you will collect $6.54 if the horse wins. Thus we can think of dividends as the inverse of market-implied probabilities. In the American tradition, the price would be quoted as +550 — similar to the bookmaker convention 11/2.

Press enter or click to view image in full size

Monitoring your performance

So long as you schedule the python script to run once an hour, your dashboard at www.microprediction.org will start to populate:

Press enter or click to view image in full size

If you click through to the transactions you can see why your credits increase or decrease — though I’d strongly suggest you read about the mechanics of scoring.

Techniques & Benchmarks

Perhaps the stack exchange question Given Every Horse’s Probability of Winning a Race… is a good place to discuss possible solutions, though feel free to comment here too.

I’ve provided a benchmark that uses techniques I recently published in an article Inferring Relative Ability From Winning Probability in Multi-Entrant Contests (SIAM J on Financial Mathematics, 2021 pdf). Although the motivations for that work lie mostly in much higher dimensional problems, it rekindled an interest in order statistics from a childhood fascination with the racetrack. The approach infers a performance distribution for each horse.

Press enter or click to view image in full size

I reckon you can beat it though, and even if you simply modify you’ll notice there are some free parameters. My final selection of values is also unacceptably lazy, as it uses Monte Carlo and thus fails to harness Jensen’s Inequality (longer discussion here).

As also noted in the paper (and the discussion) there is another benchmark that I frame as an application of Luce Axiom of Choice. I think most people come to this answer semi-consciously, without necessarily realizing the assumptions being made and very rarely realizing it is consistent with exponential running time distributions (whose plausibility is … well you decide).

But following Harville you can, if you wish, estimate probabilities for all orderings by assuming that once a horse wins, the conditional second place probabilities are merely renormalizations of their win probabilities (taking out the winner, of course). Then one continues, taking out first and second, and so forth.

A variation on the theme applies a power transform to the probabilities first. Another variation uses approximations by Henery for the normal case — although my repository provides a fast solution anyway as you can see from the benchmark example.

Getting Fancy

Though you don’t need to concern yourself with it if you don’t want to, what fascinates me is the possibility of inferring the market-implied copula for five-horse races. If you hunt and peck on the stream listing all the way at the bottom you might even find trivariate streams that can be used to infer whether there is implied dependence between the runners.

There are related questions that one might hope to answer, such as whether 2-margins are sufficient to reconstruct the joint distribution. It is a similar motivation to another ongoing study of cryptocurrency movements explained in the post How to Enter a Cryptocurrency Copula Contest. That’s probably a good place to start if you want to try to predict z2 and z3 streams related to these five horse races.

Need Help?

If this kind of thing interests you, you’ll always welcome to natter about copulas or order statistics in our slack channel (here).