Modeling for Sports Betting: Football Player Props

You have experience as a alpha producing quant and as a portfolio manager. What lead you to work on sports analytics?

Anthony Lage (CEO of The Crowd’s Line) and I worked at the same hedge fund for a couple years following the ‘08 financial crisis. He ran the trading desk and I did risk management for the hedge fund and manager selection / portfolio construction for their fund of funds. As you can imagine that was a stressful period and many of days when the closing bell rang at 4pm, Anthony would come plop down in my office to unwind and shoot the breeze.

Our topics would vary but we spent plenty of time discussing what we were seeing in the markets. Anthony would impress upon me what I would eventually realize to be the traits of all good traders: a flexible mind and a lack of ego. All the good traders I’ve known have these traits because it allows them to collect information and opinions from everyone around them (effectively ensembling a diverse set of signals) and the deftness to change their mind on a dime if new information comes to light.

Fast forward to last April. Anthony reaches out to me because he’s been laying the groundwork for an agreement with Genius Sports (official data provider of the NFL) to gain access to their real-time in-game data which includes telemetry data tracking all the players and the football. More importantly, with the recent advances in LLMs (a la ChatGPT), he sees an opportunity to build what he’s been wanting to build for a while: a versatile and flexible system for amalgamating all the different forms of rapidly shifting information and analysis that sports fans, fantasy players, and bettors would be interested in.

This includes making use of the in-game telemetry data – otherwise unavailable to the public – for in-game insights and analysis and eventually using it to build out our own predictive analytics. Anthony has a great network in the sports and sports betting world but less so when it comes to the tech and quant side of things. That’s where I came in.

At the time, I was on the backside of having two hedge funds go out of business on me in less than two years because of other groups losing money. I told Anthony that I’d help him get off the ground, start building out the data analysis ecosystem, and find the tech / quant talent he would need but I was probably looking to get back to trading. Then I kind of became obsessed with the problems we were solving.

Though I had kept up with neural network and machine learning research in a lot of areas, I had fallen behind on LLMs and so for the first few months after getting involved, I was mainlining research papers and getting myself up to speed the best one can when the research is moving as fast as it is. In late 2012, I stumbled upon Geoffrey Hinton’s Coursera class on neural networks and immediately realized the importance of what would become the second AI renaissance. But when all the talk started about AGI happening in our lifetime, I found myself on the more pessimistic side of outlooks (though it seemed to me one’s optimism on the matter was negatively correlated to the number of hours actually spent trying to train neural networks).

But when the 2022 version of ChatGPT came out, I found myself realizing again that this was going to change a lot so I was keen to add these tools to my kit and saw this as a great opportunity to make use of this rapidly advancing tech to build something that wasn’t possible before.

On the other side of the ball, I grew up in a little farm town in south Texas and as the book, movie, and TV series Friday Night Lights have pointed out, football is kind of a big deal there. Like ~90% of the guys in my school, I started playing in 7th grade and would continue to play through all four years at MIT (how many of you just learned MIT has a football team?). I’ve always been a student of the game and find it to be one of the more fascinating sports to watch.

Since the roles and responsibilities of positions are more segmented than other sports, this leads to many different battles occurring on every play. And given this complexity, I believe most sports fans would appreciate understanding the game and the ever-developing strategy chess match occurring on the field at a deeper level. That’s part of the vision we have for what we’re building: utilizing LLMs to help the casual fan see and understand more of the complexity that is occurring in front of them – to help them watch the game the way a coach would.

On the side I was working on building out a betting model for the NFL and when that started to show promising results, we recognized it as an exciting opportunity to differentiate ourselves and a catalyst to jump-start our user growth. We shifted focus towards its automation and enhancement. Our website (chatTCL.ai) now updates NFL bets every 15 minutes, integrating the latest player statistics and lines/odds from ten sportsbooks. We're now circling back to the LLM side of the problem, aiming to integrate the betting model into the chatbot alongside diverse data sources. By combining our data science and tech expertise with our extensive understanding of sports, we're moving towards a more informed and interactive approach to sports analytics and engagement.

You built a player prop betting model — what’s a player prop bet? What data are you using? How is the model constructed?

The most familiar sports bet is the spread. A sportsbook will set the line: for the Super Bowl [as of Monday] it is the 49ers winning by 2.5 points. As a bettor, you can bet on the 49ers “covering the spread” if you think they’ll win by 3 points or more or you can take the other side of that bet if you think they’ll win by 2 or less or you think the Chiefs will win. This same logic can be extended to the various player stats in a game.

Currently, our model covers the number of completed passes, number of pass attempts, passing yards, number of interceptions thrown, number of receptions, total receiving yards, number of rush attempts, total rushing yards, number of field goals, and if someone will score a touchdown. For each of these stats, the sportsbooks will have a line for the relevant players in the game. For example, the line for number of receptions Travis Kelce will have in the Super Bowl is 6.5 (the lines are usually in half-increments to avoid ties). You can bet “the over” if you think he’ll have 7 or more receptions or “the under” if you think he’ll have 6 or fewer.

The Crowd’s Line was originally using a third-party to determine the value of the bets and had integrated these into our chat bot such that you could get the top bets for a given game or for that week. After a few weeks of games, I looked into how the third party bets were doing and let’s just say that there was room for improvement. There seemed to be some signal there and as someone who has never met a signal he didn’t want to ensemble, the course of action seemed obvious: get more signals.

Over the course of the next few weeks, I found and started collecting more than 10 publicly available sources for player stat projections – mostly from fantasy football sites – along with various python and R libraries with historical data for every NFL stat you could possibly think of. Most of these stat projection sites do not leave up their old projections so you need to be collecting them each week to have the data.

The first important observation needed when setting up this model is that fantasy sites are predicting the mean of each stat since fantasy football points are proportional to the stat number achieved. However, when it comes to betting, you care more about the median projected outcome than the mean — since betting lines set by the sportsbooks are (mostly) priced like coin flips with a 50/50% implied probability of going over or under the line.

Further, given that player stats are mostly bounded by zero on the left, these distributions often have significant positive skew which causes a meaningful difference between the projected median and the projected mean. So, a crucial first step is to use historical data to figure out how to adjust these stat projections from mean to median. Then, using the median as your starting point, you can construct a probability distribution for each bet using historical data as your guide. Once you have a probability distribution, you can determine the probability of a stat ending up above or below any given line. Combining that with the bet payout, you have the expected value for each bet. You can also calculate the standard deviation of a bet given the probability of winning and the payout.

Using the expected value and standard deviation for each bet, you can construct a portfolio of bets with the highest expected return to risk ratio.

I would go on to also train CNNs on more traditional time series input data. It doesn’t seem to be too well-known but CNNs are actually pretty good with time series data as well. It’s still based on a chart approach in the sense that the open, high, low, and close prices are represented by their vertical position on a price chart given a certain lookback period (if the low is directly in the middle of a 30-day price chart, it’s represented as a zero, for example).

Given the flexible framework of neural networks, I trained a large number of CNNs that produced very diverse results. Each CNN predicts the future return for each futures contract being traded. Then, those return predictions are combined with a covariance matrix (that’s been cleaned using a methodology I developed that harnesses insights from random matrix theory) to form an optimal-Sharpe ratio portfolio. Then the submodels are combined with a weighted average.

None of the sub-models are good enough to trade on their own given slippage but the average inter-correlation of the sub-model returns is about 0.15 and given the volatility compression that occurs when combining them, there is effectively 20 uncorrelated quant strategies under the hood. That’s why this approach works.

When I think about what I built here, I don’t think of it as building a quant strategy. I think of it as building the machinery that can mass-produce a diverse set of quant strategies. I have mainly used this framework with futures but I also trained CNNs on cash equities and have traded those as well. The neural networks were trained on S&P 500 but it was able to generalize and work in all international equity markets.

You have access to a player position data feed from the NFL. What can you tell us about that data set - is it streaming to you live? Is it a two dimensional position?

Yes, the data streams live, which means it’s 15-20 seconds faster than your TV. There is an x, y, and z coordinate for each player along with velocity and acceleration for those axes. It’s a treasure trove of data that isn’t otherwise available. Unfortunately, I have been stretched way too thin with our other initiatives that I haven’t had the time needed to dig into this. But we have the data and see a lot of opportunity in using it once we have more resources.

You’ve had a really exciting career so far. For any undergraduate or highschool students reading this who might want to pursue a career in finance or sports analytics - what advice do you have?

First off, I’m incredibly jealous of all the information that is available to high school and undegrads on the internet these days. As a kid that grew up in Inez, TX and was academically bored out of my mind, I hope they realize how lucky they are to have so many resources readily available to learn anything they want to learn. And that’s the main thing: keep learning and keep exploring. There is so much exciting research going on and you can find inspiration and ideas in unlikely places.

Even more importantly, you want to build a solid foundation of basic stats. You can get a lot of mileage out of some very simple statistical methods but you need to break through abstraction layers and really understand what those stats mean, when they break, and how they can be altered to work with the data you have. Think in probabilities and distributions. Point-estimates don’t represent reality. Nothing is a point-estimate. We live in a probabilistic universe with tons of uncertainty and noise.

One concrete example of advice I give to someone building models is the results need to be stable with respect to all model parameters. If setting a parameter to 0.26 gives you a much different result than it being 0.25, you have a problem. Look for regions of stability. Interject noise. Stability of results is more important than absolute results.

One more: do you have a model for who will win the superbowl? Or how would you approach the problem of predicting a single game outcome? What’s your thought process for that kind of problem?

I haven’t tackled game outcomes yet. I would take a similar approach as the player prop and look to combine projections from a lot of sources. I’m not sure if that alone will be enough to work but I see an opportunity to use my player prop model to inform game outcomes as an extra signal to add in.

If the player prop model has a bunch of overs on receiving yards and passing touchdowns for one team and unders for the other, that should be a useful signal that my projection sources see the game going differently than the sportsbook. It’s another exciting area of research that’s on the long list of ideas to explore.

The NFL season is coming to a close. What’s next?

I’ve already begun work on the NBA player prop model and hope to have something working by the end of the month. We’re also refocusing on the LLM aspect of the business and looking to raise capital and expand the team.The team will be headed to the MIT Sloan Sports Analytics Conference on March 1st so feel free to reach out to us if you’ll be there and want to talk more about what we’re working on.