Moneyball

Moneyball

In honor of everyone’s favorite baseball movie, my math senior project focused on assessing an important model used in Sabermetrics, the term for “baseball analytics” coined by Bill James. It is called the log5 model. After studying it for two quarters, I can justify the “log” part of the name: in one of its forms, the model boils down to a natural logarithm of a linear combination of variables. However, I am still unsure where the 5 comes from. Maybe there were four models before it that didn’t quite make the cut.

Anyways, here is how a model of paired outcomes like the log5 model  can be used in sports, besides baseball. Let’s take soccer for instance. How would you predict the winner between two soccer teams facing each other?  Well, a good start might be to analyze the outcomes of previous matches between the two teams.

Ah, but what if we run into the issue in which the two teams have never faced each other? Maybe they’re in different leagues?As you may know, this happens all the time in a sport like baseball, where teams are divided into two leagues: American League and National League.

In this scenario, a statistician would proceed by finding a set of statistics for each team that can significantly determine the most likely winner.  In soccer, this could be a team’s average number of shots on target divided by the total number of shots taken. But what if it is easier to make more shots on target in one league versus another? This likely scenario skews the process of determining the probability of an outcome between two teams from different leagues.  Therefore, we must take into account both league’s combined average rate of success in getting a shot on target.

In essence, what I just described is what a log5 model does. It compares the corresponding statistics between a pair, while taking into account the league average statistic, and it predicts the probability of success for an outcome.

In the case of my math senior project, I considered a pair to be a baseball batter and pitcher from different baseball teams teams. One of the outcomes I was interested in was the probability of the batter in question getting a “hit” off the pitcher. I also examined the probability of the batter hitting a “ground-out”, as well as a “fly-out.” These probabilities can also be thought of as the probability of the pitcher letting the batter get a hit, the probability of the pitcher inducing a ground out, and the probability of the pitcher inducing a fly out. Depends on which perspective you want to look at it.

So, in one form, the log5 model looks something like this:

(picture of equation), explanation of variables

Testing this model fairly with actual baseball data required a very careful and tedious process. I will spare the details for the sake of the length of this post, but for those following along, feel free to take a gander at the final presentation of this project which outlines, in detail, the methodology of our assessment of the log5 model.

One last note:

I implemented the methodology explained in the presentation at first using python’s pandas package with a local SQL-lite database of play-by-play data extracted from retrosheet.org

Some portions of this code took an hour or two to run for various reasons including inefficiently written code and the overall “slowness” of dealing with large pandas data frame operations.

Therefore, I redid my math senior project after taking a distributed computing course on Hadoop Map/Reduce and Spark. The Java code which performed a map/reduce to find a list of all the “testable” batter/pitcher combinations ended up running in less than 10 seconds on a Hadoop cluster with 9 nodes.  The actual log5 calculations and assessment of the model was written in Scala and took advantage of Spark’s RDD’s to run in under 2 seconds.

The results still confirmed the original results: the log5 model predicts probabilities significantly better for fly-outs and ground-outs, but not hits.

So how can these results be used in the real world of baseball? Well, if a coach is able to determine which of his players in the batting lineup would fair the best against a given pitcher,  the coach has the power to adjust his lineup to optimize his team’s scoring before the three outs are made. This usually means putting your best batter fourth in the lineup and putting players with a higher probability of hitting a ground-out early in the lineup so that they don’t cause a potential double or triple play by the defense.

$100 prize if someone can answer me this question: why does the log5 model I used work so well for ground-outs and not hits, and how can we fit it to work better for hits? I have a feeling I know the answer to the first question, but not the second question. Would love to have a chat with any statistically inclined individuals about this.