Predicting March Madness Winners with Bayesian Statistics in PYMC3!

So in the last two blog posts, I talked about how to do some bayesian inference in the realm of some linear models. And that is all well and good, but we can have even more fun in a bayesian framework. That’s because we are just sampling things from probability distributions. In this post I want to look at using PYMC3 to build a model to assess the strength of different NCAA basketball teams. Where we are going with this. Probably, not by the end of this blog post is a game simulator. Once this is built up, we will be able to use our game simulator to generate a probability that a given team will win a match up against another team. But this is going to be a lot of work. So I’m going to break it up over a few posts.

I know, I know. March Madness has been over for several months. The point is to get prepared for next year. And it would be interesting to take a blank bracket and to run my simulations over it to see how well it would pick up on upsets, and how it just does in general. Yay for historical data!

In this post, we’re going to be getting the broad strokes of the model together using just the final score of the games, and then in a future post, we’ll use the framework from this post to fill in some details for the model. That way, the model is using all of the wonderful stats that sports will throw off about a particular game. in yet another post we’ll use it to evaluate the 2017 march madness brackets. Since that has already happened, we can see how well our model does in predicting the final outcome. And then I will have a ready made model for filling out my bracket next year, which you better believe will become a post sometime in late February or early March!

By and large, the main model was inspired by a couple of blog posts that I have found on hierarchal bayesian modelling. Which focuses on Soccer, and Rugby. You can find them here and here. As I was reading them, I realized that I could use some of their ideas to build my algorithm for predicting March Madness next year.

So the first things first, where am I getting my data? I will be using data from Kaggle’s March Madness Competition 2017. This is one of the hardest problems out there and everyone has their own approach. As far as I know, no one has produced a perfect bracket, ever! That’s why I’m building it now, so that I’m not scrambling to come up with an algorithm next year. So if you want to follow along go right ahead.

I’ve Got Data, Now Let’s Clean It Up!

I’m not super thrilled about the way that this data is set up. Instead of cleaning it up, and providing you with a pristeen dataset (in my opion at least, other modelling choices demand a different structure), I wanted to show you how to munge this data for the analysis at hand, mostly, because this is what a data scientist does most of the time, modifying data to fit the analysis they need to do. It is the most important, and most time consuming part of data science (aside from collecting the data to begin with of course), that tends to be glossed over. So we’ll be doing some basic cleaning operations in this post to start out. We’ll start by importing pandas and pymc3 and any other libraries we think might end up being useful. Then we load the data.

import pandas as pd
import pymc3
import numpy as np
import theano.tensor as tt
import matplotlib.pyplot as plt



This get’s the data that we want into python, but it gets a whole bunch of other stuff that we don’t really care about too! Namely, I feel like although there is information in previous seasons, what really matters is this season. Okay, I know that I should incorporate older season’s data because coaches don’t change from year to year (very often), and players are on a 4-year cycle. And I’m sure there is some autocorrelation there that can be exploited, but remember, this post is broad strokes super simple. So let’s drop everything except the current season.

df=df[df['Season']==2017]


That leaves us with an astonishing 5395 records. Each corresponds to exactly 1 game. I had no idea there were this many. And each game is roughly 2 hours in duration (if you count timeouts and half-times, etc.), give or take. That means that every year there is an astonishing 10,790 hours of basketball played. Which equates to 450 days of basketball. There is so much basketball in a single year that you literally couldn’t watch all of the basketball in a year! Anyway, I digress.

The big problem is not which season you are looking at, but the fact that scores are separated into winning and losing teams. This is done, because the teams can play on neutral courts, but it is murder for the way I want to run the algorithm. So we need to morph the data into home and away team scores. If they are on a neutral court, we will assume the winning team is the home team. I hate to do that, but I have to make a decision on what to do with that situation and it seems like the least bad option. I will explain why in a second. So we want to run some code that will clean up this data.

We convert this dataset into Home vs. Away Teams like this:

df['Home Score'] = [obj2 if obj1 == 'H' or obj1 == 'N' else obj3 for obj1,obj2,obj3 in zip(df['Wloc'],df['Wscore'],df['Lscore'])]
df['Away Score'] = [obj3 if obj1 != 'H' or obj1 != 'N' else obj2 for obj1,obj2,obj3 in zip(df['Wloc'],df['Wscore'],df['Lscore'])]
df['Home Team'] = [obj2 if obj1 == 'H' or obj1 == 'N' else obj3 for obj1,obj2,obj3 in zip(df['Wloc'],df['Wteam'],df['Lteam'])]
df['Away Team'] = [obj3 if obj1 != 'H' or obj1 != 'N' else obj2 for obj1,obj2,obj3 in zip(df['Wloc'],df['Wteam'],df['Lteam'])]


Ahhh! Now we’re starting to get there, at least now we’re not organized by which team won the game. What’s nice is that we can now get whether or not the game was played on a home court. So that we can control for home court advantages, which will be useful for the tournament which is typically played on a court that is neutral. That is why I think this is the least bad option, we can simply turn off the home court advantage parameter and thus control for that down stream.

At this point, we’ll reindex the team numbers. This isn’t strictly necessary. I’m doing it for two reasons. First, I’m following pretty closely the structure of the blog posts that inspired me. And I don’t want to increase my cognitive burden while developing the model, (I can code, but it wasn’t what I was trained in primarily). Second, it gives me a more convenient index to work with than what is in the original data comes in. The burden associated with that is that if I want to link to team names, I will have an extra step. Oh well, it’s all good, because that extra burden comes with slightly more readable code, which is always nice.

teams = df['Home Team'].unique()
teams = pd.DataFrame(teams, columns=['team'])
teams['i'] = teams.index

df = pd.merge(df, teams, left_on='Home Team', right_on='team', how='left')
df = df.rename(columns = {'i': 'i_home'}).drop('team', 1)
df = pd.merge(df, teams, left_on='Away Team', right_on='team', how='left')
df = df.rename(columns = {'i': 'i_away'}).drop('team', 1)


Next we’ll want to create some helper variables so that the code is a little bit more readable. Again this isn’t necessary. But as I was going through the other blogs, I found this helpful when we get to the modelling code to make it readable, so I’m following suit and doing the same thing for your sake. Just realize that everything that I am doing in this code block and the last are not strictly necessary, but they do make the model clearer and more readable, I think.

observed_home_points = df['Home Score'].values
observed_away_points = df['Away Score'].values

advantage = np.array([1 if obj=='H' or obj=='A' else 0 for obj in df['Wloc']],dtype='int64')

home_team = df.i_home.values
away_team = df.i_away.values

num_teams = len(df.i_home.drop_duplicates())
num_games = len(home_team)


Enough Data Munging, Let’s Build a Model

So the model that we’re going to be playing with is a hierachal bayesian model. These models are cool, because they allow data to flow from the end of your model through as many latent variables that you want. This is similar to the way that we calculated WTP in bayesian logistic regression, but it is much better. Instead, of just doing a calculation for a variable that does not affect the model at all, this one allows us to build latent variables directly into the model that effect the outcome.

Since these two blog posts that cover the math and the logic behind this type of model, I thought I would start by giving you the code, and then sort of explaining the reasoning behind the model, without going too deeply into the weeds on the math.

model = pm.Model()
with pm.Model() as model:
# global model parameters
home = pm.Flat('home')
sd_att = pm.HalfStudentT('sd_att', nu=3, sd=2.5)
sd_def = pm.HalfStudentT('sd_def', nu=3, sd=2.5)
intercept = pm.Flat('intercept')

# team-specific model parameters
offs_star = pm.Normal("offs_star", mu=0, sd=sd_att, shape=num_teams)
defs_star = pm.Normal("defs_star", mu=0, sd=sd_def, shape=num_teams)
offs = pm.Deterministic('offs', offs_star - tt.mean(offs_star))
defs = pm.Deterministic('defs', defs_star - tt.mean(defs_star))

# derive the scoring intensity for a game
home_theta = tt.exp(intercept + home*advantage + offs[home_team] + defs[away_team])
away_theta = tt.exp(intercept + offs[away_team] + defs[home_team])

# likelihood of observed data
home_points = pm.Poisson('home_points', mu=home_theta, observed=observed_home_points)
away_points = pm.Poisson('away_points', mu=away_theta, observed=observed_away_points)


Alright let’s go through the model. We start off with what I think of as the model primitives. Specifically, these parameters, are what everything else builds off of. You can think of them as the foundation that the model rests on. These include standard deviations for the normal distributions that are coming down the line, the model intercept and the effect that playing on your home court has.

The next step up the chain is the offensive and defensive strength of every team. These parameters are latent, and take information from the model primitives in the form of the standard deviation. We do something a little funny, we let the posterior on these go where they want to go, but in order to do that we have a second one that is recentered around the group average. My thought on this is that it will help to alleviate the shrinkage on the model, and give us a more accurate version of how relatively strong one team is over another.

The next step up the model takes the offensive and defensive strength of the different teams and uses that on a game by game level. This let’s us build a parameter for the home team and the away team that gives us the scoring intensity for that team in that game. If a team has a really high defensive strength, it will sap any mojo that a team has. Likewise if they are really good offensively, they will get a boost as well. Finally, the home team gets a boost from being the home team, which we turn off if the game was played on a neutral court, via the advantage dummy variable that we created beforehand.

The scores for both teams are modelled as coming from individual poisson distributions, with their parameters set to the scoring intensities. And the model is thus calibrated using the observed scores. And that is our basic model. The big difference in my model from the other two blog posts is that I can turn the home court advantage on and off.

Let’s run the model with this code:

with model:
trace = pm.sample(1000, tune=1000)


Again when you do this go find something else to do, unless you are running this on a GPU, because it is going to take some time. On my old laptop (read ancient laptop built in 2009) this took about 15 minutes to run.

When it does finish running for you, you can view the model parameters with the following code:

pm.traceplot(trace)
plt.show()


That will produce the following result:

This image shows the posterior distribution for all of the model parameters. Notice that the offensive strength, and defensive strength parameters look like spaghetti. That’s because of how many teams there are. But you can see that there are a few stand out teams. Remember that lower is better for defensive and higher is better for offensive. We could have corrected this in the model above by tweaking the scoring intensities to subtract the defensive score instead of adding to the scoring intensity but at the end of the day you’d get to the same place. There are definitely teams that are stronger than others. In a future blog post we’ll look into modifying these scoring intensities, or rather augmenting them with more data.

We can also view the scoring intensities in a more readable way so that we can easily compare the individual teams. We do that with a forest plot. Here’s the code:

pm.forestplot(trace, varnames=['atts'], main="Team Offense")
plt.show()
pm.forestplot(trace, varnames=['defs'], main="Team Defense")
plt.show()


And here’s what it looks like:

Again, this has the same information that you saw in the traceplot above. It just breaks it down so that you aren’t trying to interpret a bowl of spaghetti. It just let’s you digest the information a little bit better.

I’ve got a Posterior Distribution, How Does This Help With Filling Out a Bracket?

So I have a feeling that this is going to be the subject of an entire blog post by itself, but I would feel like something was off, if I didn’t at least include a blurb on how this model was useful for filling out a bracket. Since I have an entire set of probability distributions, I can use them to simulate the outcomes of match ups, even if I have never seen the teams play before!

The basic idea is to draw parameter values from the posterior and use that to simulate the outcome of a game. We can then take as many draws as we want from the posterior distribution and average over them to get an idea of how likely a given outcome will be. We can even counterfactually see what would happen if we tweak some model parameters, like say, whether or not the meet up on team A’s, team B’s, or a neutral court.

Here’s some code to do just that, I pulled two teams at random via a random number generator at random.org, they happened to be team 3 and team 172:

teamA_home_wins_=[]
teamB_home_wins_=[]
neutral_A_wins_=[]
for i in range(1000):
draw=np.random.randint(0,1000)
home_=trace['home'][draw]
intercept_=trace['intercept'][draw]
atts_=trace['atts'][draw]
defs_=trace['defs'][draw]
home_theta_=np.exp(intercept_+home_+atts_[3]+defs_[172])
away_theta_=np.exp(intercept_+atts_[172]+defs_[3])
home_scores_=np.random.poisson(home_theta_,1)
away_scores_=np.random.poisson(away_theta_,1)

teamA_home_wins=[1 if obj>0 else 0 for obj in home_scores_-away_scores_]
teamA_home_wins_.append(np.mean(teamA_home_wins))

home_theta_=np.exp(intercept_+home_+atts_[172]+defs_[3])
away_theta_=np.exp(intercept_+atts_[3]+defs_[172])
home_scores_=np.random.poisson(home_theta_,1)
away_scores_=np.random.poisson(away_theta_,1)

teamB_home_wins=[1 if obj>0 else 0 for obj in home_scores_-away_scores_]
teamB_home_wins_.append(np.mean(teamB_home_wins))

home_theta_=np.exp(intercept_+atts_[3]+defs_[172])
away_theta_=np.exp(intercept_+atts_[172]+defs_[3])
home_scores_=np.random.poisson(home_theta_,1)
away_scores_=np.random.poisson(away_theta_,1)

neutral_A_wins=[1 if obj>0 else 0 for obj in home_scores_-away_scores_]
neutral_A_wins_.append(np.mean(neutral_A_wins))

print('Probability Team 3 Wins at Home: {0}%'.format(100*np.mean(teamA_home_wins_)))
print('Probability Team 3 Wins Away: {0}%'.format(100-100*np.mean(teamB_home_wins_)))
print('Probability Team 3 Wins at Neutral: {0}%'.format(100*np.mean(neutral_A_wins_)))


Which gave me an output of:

Probability Team 3 Wins at Home: 48.4%
Probability Team 3 Wins Away: 18.89999999999999%
Probability Team 3 Wins at Neutral: 32.1%


It doesn’t look good for team 3. It looks like it would be pretty much a toss up if they play at home, but with a slight edge for team 172. If they played on 172’s court, they would only win about 1 in 5 times, and about 1 in 3 times on a neutral court. Your results may vary, because I didn’t set a random seed.

Here’s the full code:

import pandas as pd
import pymc3 as pm
import numpy as np
import theano.tensor as tt
import matplotlib.pyplot as plt
#%%

df=df[df['Season']==2017]

df['Home Score'] = [obj2 if obj1 == 'H' or obj1 == 'N' else obj3 for obj1,obj2,obj3 in zip(df['Wloc'],df['Wscore'],df['Lscore'])]
df['Away Score'] = [obj3 if obj1 != 'H' or obj1 != 'N' else obj2 for obj1,obj2,obj3 in zip(df['Wloc'],df['Wscore'],df['Lscore'])]
df['Home Team'] = [obj2 if obj1 == 'H' or obj1 == 'N' else obj3 for obj1,obj2,obj3 in zip(df['Wloc'],df['Wteam'],df['Lteam'])]
df['Away Team'] = [obj3 if obj1 != 'H' or obj1 != 'N' else obj2 for obj1,obj2,obj3 in zip(df['Wloc'],df['Wteam'],df['Lteam'])]

teams = df['Home Team'].unique()
teams = pd.DataFrame(teams, columns=['team'])
teams['i'] = teams.index

df = pd.merge(df, teams, left_on='Home Team', right_on='team', how='left')
df = df.rename(columns = {'i': 'i_home'}).drop('team', 1)
df = pd.merge(df, teams, left_on='Away Team', right_on='team', how='left')
df = df.rename(columns = {'i': 'i_away'}).drop('team', 1)

observed_home_points = df['Home Score'].values
observed_away_points = df['Away Score'].values

advantage = np.array([1 if obj=='H' or obj=='A' else 0 for obj in df['Wloc']],dtype='int64')

home_team = df.i_home.values
away_team = df.i_away.values

num_teams = len(df.i_home.drop_duplicates())
num_games = len(home_team)

model = pm.Model()
with pm.Model() as model:
# global model parameters
home = pm.Flat('home')
sd_att = pm.HalfStudentT('sd_att', nu=3, sd=2.5)
sd_def = pm.HalfStudentT('sd_def', nu=3, sd=2.5)
intercept = pm.Flat('intercept')

# team-specific model parameters
atts_star = pm.Normal("atts_star", mu=0, sd=sd_att, shape=num_teams)
defs_star = pm.Normal("defs_star", mu=0, sd=sd_def, shape=num_teams)

atts = pm.Deterministic('atts', atts_star - tt.mean(atts_star))
defs = pm.Deterministic('defs', defs_star - tt.mean(defs_star))
home_theta = tt.exp(intercept + home + atts[home_team] + defs[away_team])
away_theta = tt.exp(intercept + atts[away_team] + defs[home_team])

# likelihood of observed data
home_points = pm.Poisson('home_points', mu=home_theta, observed=observed_home_points)
away_points = pm.Poisson('away_points', mu=away_theta, observed=observed_away_points)

with model:
trace = pm.sample(1000, tune=1000)
#%%
pm.traceplot(trace)
plt.show()
#%%
pm.forestplot(trace, varnames=['atts'], main="Team Offense")
plt.show()
pm.forestplot(trace, varnames=['defs'], main="Team Defense")
plt.show()

#%%
#Simulate a game
teamA_home_wins_=[]
teamB_home_wins_=[]
neutral_A_wins_=[]
for i in range(1000):
draw=np.random.randint(0,1000)
home_=trace['home'][draw]
intercept_=trace['intercept'][draw]
atts_=trace['atts'][draw]
defs_=trace['defs'][draw]
home_theta_=np.exp(intercept_+home_+atts_[3]+defs_[172])
away_theta_=np.exp(intercept_+atts_[172]+defs_[3])
home_scores_=np.random.poisson(home_theta_,1)
away_scores_=np.random.poisson(away_theta_,1)

teamA_home_wins=[1 if obj>0 else 0 for obj in home_scores_-away_scores_]
teamA_home_wins_.append(np.mean(teamA_home_wins))

home_theta_=np.exp(intercept_+home_+atts_[172]+defs_[3])
away_theta_=np.exp(intercept_+atts_[3]+defs_[172])
home_scores_=np.random.poisson(home_theta_,1)
away_scores_=np.random.poisson(away_theta_,1)

teamB_home_wins=[1 if obj>0 else 0 for obj in home_scores_-away_scores_]
teamB_home_wins_.append(np.mean(teamB_home_wins))

home_theta_=np.exp(intercept_+atts_[3]+defs_[172])
away_theta_=np.exp(intercept_+atts_[172]+defs_[3])
home_scores_=np.random.poisson(home_theta_,1)
away_scores_=np.random.poisson(away_theta_,1)

neutral_A_wins=[1 if obj>0 else 0 for obj in home_scores_-away_scores_]
neutral_A_wins_.append(np.mean(neutral_A_wins))

print('Probability Team 3 Wins at Home: {0}%'.format(100*np.mean(teamA_home_wins_)))
print('Probability Team 3 Wins Away: {0}%'.format(100-100*np.mean(teamB_home_wins_)))
print('Probability Team 3 Wins at Neutral: {0}%'.format(100*np.mean(neutral_A_wins_)))