The Math Behind Moneyball

8 min read Original article ↗

Rich Moss, M.Sc. Analytics

Introduction

Moneyball is often referred to as a major inflection point for the use of analytics in sports. For those unfamiliar with the topic, GM of the Oakland A’s Billy Beane believed that teams evaluated a batter's success largely by slugging percentage (SLG), and undervalued a batter’s ability to draw a walk. If we consider this a case of economics, this essentially meant that the market for batters was inefficient, as players who could draw walks may be undervalued by the market and players who had a good slugging percentage but weren’t efficient with their hitting decision-making may be overvalued in the pre-Moneyball market.

While the A’s were able to outperform projections by a wide margin and many teams scrambled to understand how analytics could assist them, it wasn’t until Jahn Hakes and Skip Sauer tested the theory of Moneyball in their paper “An Economic Evaluation of the Moneyball Hypothesis” (https://www.aeaweb.org/articles?id=10.1257/jep.20.3.173), that the theory was truly tested. In this article, I will use R to recreate Tables 1 and 3 of this paper, and explain some of the findings in testing their hypothesis. The inspiration for this project comes from the University of Michigan’s Sports Analytics courses, which are an excellent resource for applying data science concepts to sports-based scenarios.

Hakes and Sauer Table 1

Press enter or click to view image in full size

The above table is pulled directly from the Haukes and Sauer article and is a set of four different regression models, each regressing win percentage on a set of variables. The first is on on-base percentage (OBP) for and against a team, the second on slugging percentage for and against a team, the third for all 4 variables above, and the fourth for the difference between OBP for and against and the difference between SLG for and against the team (a set of two coefficients attached to these differences as opposed to four coefficients attached to each predictor in model three.

For those unfamiliar with these baseball statistics, they can be defined as follows:

SLG = (Singles + 2*Doubles + 3* Triples + 4* Home Runs) / At Bats

OBP = (Hits + Walks + Hit by Pitch) / (At bats + Walks + Hit by Pitch + Sacrifice Flies)

The data I am using for this analysis comes from Retrosheet and can be found here: https://raw.githubusercontent.com/maxtoki/baseball_R/master/data/game_log_header. csv

and while this data does not have headers, the individual game logs with headers can be found here: https://www.retrosheet.org/gamelogs/index.html. Of note, we are only looking at games from 1999–2003 for this experiment.

Cleaning the Dataset

First, I created a variable to identify whether the home or away team won by comparing the home and visitor scores associated with each game as follows:

Press enter or click to view image in full size

Next, we can aggregate the statistics needed for SLG and OBP (ab, h, 2b, 3b, hr, sf, bb, hbp and bb), and define singles as all hits — 2b-3b-hr. To do this for the home team, we can manipulate the data frame as follows:

Press enter or click to view image in full size

and then simply repeat the process for the away teams to generate two datasets, each with our required variables. To get the team’s record over the season, we can simply merge the home and away data frames:

Press enter or click to view image in full size

Keep in mind, that we have matching variables in each data frame, so by default, those corresponding to the home team will have _x appended, and those related to the away team _y.

With this master dataset created, we can then calculate OPBFOR, OBPAGN, SLGFOR and SLGAGN as defined prior. I will show an example of OBPFOR below.

Press enter or click to view image in full size

Finally, win percentage can be calculated for each team in each year as follows:

Press enter or click to view image in full size

Press enter or click to view image in full size

Press enter or click to view image in full size

Creating the Regression Models

Running the first regression, which regresses win percentage on OBPFOR and OBP against in R, we see the following outcome:

Press enter or click to view image in full size

Repeating this for each other regression model, we can create a summary table that looks like this:

Press enter or click to view image in full size

This is essentially Table 1 from the Hakes and Sauer paper. The main conclusion we can derive from this table is that OBP is more significant in determining win percentage as compared to SLG, as when we look at the regression coefficients in Model 1 and 2 in isolation, the coefficients for OBP are much larger, indicating that a unit increase in each variable has a larger corresponding impact on winning for a team. When we combine the variables in model 3, we can see that again the magnitude to which winning is impacted by OPB is about twice that of SLG, and this is mirrored in the differences in model 4.

Recreating Hakes and Sauer Table 3

Where Table 1 focuses on the impact of SLG and OBP, in Table 3 Hakes and Sauer aimed to determine if the market was competitive and efficient — meaning, before the publication of Moneyball, were players with higher OBP adequately compensated. Knowing that other factors affected salaries, however, Hakes and Sauer also incorporated plate appearances, arbitration eligibility free agency, and fielding position into the model.

Press enter or click to view image in full size

When we look at the Hakes and Sauer Table 3 above, we can see that from the top two rows, SLG is statistically significant and affects the salary value of a player to a large degree every year. When we look at OPB however, before 2004 the OBP is statistically insignificant, however, in 2004, the associated coefficient is both significant (as the ratio of coefficient and standard error in parenthesis is larger than 2 in absolute value), but it also increases dramatically as compared to 2003, indicating a market-wide recognition that OBP was now valued highly by teams.

Of note, the salaries do not scale linearly, so we can take the natural log of salary for the regression in Table 3. This regression can be written as:

log(Salary) = b0 + b1*OBP + b2*SLG + b3*PA + b4*Arb + b5*Free + b6*Catcher + b7* Infielder

The salary data frame was provided by the University of Michigan and contains a year and team ID, a player ID, and a salary value. First, any players with salary values were dropped and the log of the salary column was taken. We also want to change the yearID column to be SalYear, since players get paid in the year after their performance in a season. The OBP, SLG, and PA values were simply pulled from Sean Lahman’s baseball database. To get the player's arbitration status and free agency eligibility, we can extract the player's debut year, and calculate their years of experience based on the season in which their statistics are recorded as compared to the debut year. Finally, we merge the player’s position, by grouping their position from the Lahman data frame as either Catcher, Infielder, or Other. Since some players can play multiple positions in a season, we only want to consider the position they play most frequently. Joining all this data together results in a master data frame as shown:

Press enter or click to view image in full size

To match the Hakes and Sauer paper, we can filter from 2000 to 2004, then run the regression as follows:

Press enter or click to view image in full size

Press enter or click to view image in full size

Press enter or click to view image in full size

Of note, when I compare my table to that of Hakes and Sauer, my coefficients are close but not identical. What this shows is that in the pre-moneyball era as a whole, the coefficient on slugging percentage indicates a much larger impact on the player's log(Salary) as compared to OBP, with a lower standard error and p-value. Running the regression only from 2000 to 2003 generates the following regression summary:

And again shows the value of SLG in determining a player's salary more than OBP.

Finally, to generate a table that matches that of Hakes and Sauer, we can use the following code in R:

Press enter or click to view image in full size

Press enter or click to view image in full size

Press enter or click to view image in full size

Press enter or click to view image in full size

What this shows is that in 2004, there was a major shift where OBP was statistically significant for the first time in determining salary, but also the coefficient on OBP is much larger than that of SLG (4.35 to 2.17). This shows that the market adjusted to some extent post-Moneyball, and other teams around the league began to value OBP correctly as in an efficient market when presented with sufficient information, teams will tend to make decisions that result in winning, and players who provide winning skills get paid as a result.

If you enjoyed this article, I highly recommend the course from the University of Michigan (found on Coursera), as it goes into further analysis of how the markets changed post-Moneyball and provides counter-criticism to the Hakes and Sauer paper, while also providing other interesting sports analytics projects to try.