Yep, we’re going there. Which football team is the best? Well, ask any fan and they are going to tell you that it is their personal favorite. But we are data scientists. Surely, we can do better!
And of course, we can.
That’s why in this post we will be learning how to use Networkx to figure out which team is the best, from a purely scientific perspective. We won’t bias the outcome with our personal beliefs. Math won’t lie and we will find out who is the best.
However, I will readily accept that I have bias. There was one college football season that makes me very upset. It is when I was a sophomore in college. I was attending the University of Utah. That year, they went undefeated, the only team to go undefeated. And we got a solid national ranking. We ended in the number 2 position right behind Florida. But did we deserve so much more. If you ask any U of U fan at the time, the BCS robbed Utah. We weren’t even given a chance at the title.
I’m sure that you hear similar arguments every year about team after team. So the question becomes, can we prove that our team deserved to be number 1?
In this post, we’ll be looking at some graph theory via the Networkx interface to see if we can prove it. We’re doing stone cold data science to either confirm or reject my personal bias.
Step 1, let’s import some libraries that we know we’ll need to do this analysis.
import pandas as pd import networkx as nx import matplotlib.pyplot as plt %matplotlib inline
Getting the Data
That takes care of that. Now we’ll need some data. I scraped some data from Wikipedia, and cleaned it up a little bit (manually in excel). At the time, I wasn’t thinking about sharing that. However, now that I am writing this blog post, I think that I should have done the cleaning in python. As a result of this oversight, I provided the data that I am using in my github repo instead of going over the scraping and cleaning process. My apologies on that.
So let’s load this data into a pandas dataframe, and take a look at it.
df = pd.read_csv('./ncaa2008.csv',usecols=['Winner','Loser'],encoding='iso8859_15') print(df)
Which will give you this output:
Winner Loser 0 Ball State Northeastern 1 Buffalo Texas-El Paso 2 Central Michigan Eastern Illinois 3 Cincinnati Eastern Kentucky 4 Connecticut Hofstra 5 Eastern Michigan Indiana State 6 Georgia Tech Jacksonville State 7 Iowa State South Dakota State 8 Miami FL Charleston Southern 9 South Carolina North Carolina State 10 Stanford Oregon State 11 Troy Middle Tennessee State 12 Vanderbilt Miami OH 13 Wake Forest Baylor 14 Rice Southern Methodist 15 Temple Army 16 Air Force Southern Utah 17 Alabama Clemson 18 Arizona Idaho 19 Arizona State Northern Arizona 20 Arkansas Western Illinois 21 Arkansas State Texas A&M 22 Auburn Louisiana-Monroe 23 Boise State Idaho State 24 Boston College Kent State 25 Bowling Green State Pittsburgh 26 Brigham Young Northern Iowa 27 Cal Poly San Diego State 28 California Michigan State 29 Central Florida South Carolina State .. ... ... 775 Southern Mississippi Troy 776 Texas Christian Boise State 777 Notre Dame Hawaii 778 Florida Atlantic Central Michigan 779 California Miami FL 780 Florida State Wisconsin 781 West Virginia North Carolina 782 Louisiana Tech Northern Illinois 783 Missouri Northwestern 784 Rutgers North Carolina State 785 Maryland Nevada 786 Oregon Oklahoma State 787 Rice Western Michigan 788 Houston Air Force 789 Kansas Minnesota 790 Louisiana State Georgia Tech 791 Oregon State Pittsburgh 792 Vanderbilt Boston College 793 Georgia Michigan State 794 Iowa South Carolina 795 Nebraska Clemson 796 Southern California Penn State 797 Virginia Tech Cincinnati 798 Kentucky East Carolina 799 Mississippi Texas Tech 800 Utah Alabama 801 Connecticut Buffalo 802 Texas Ohio State 803 Tulsa Ball State 804 Florida Oklahoma [805 rows x 2 columns]
Again note that you will not see Utah in the Loser column. No not a single time. They were really good this season, remember that! Anyway, this data is actually pretty simple. Each row represents a game that was played, and all that we do is record who won and who lost.
Starting with Networkx
So where are we going with this data? We need to start analyzing this data with Networkx, of course! Therefore, we are going to use this data as an edgelist and import it into Networkx to create a graph object.
G = nx.from_pandas_edgelist(df,source='Loser',target='Winner',create_using=nx.DiGraph())
So that sounded like something really complicated to do, but with Networkx’s helper functions to get data out of a pandas dataframe this was really a painless process. You will notice that we wanted a directed graph, so we told Networkx to use its DiGraph() object. Furthermore, since it is directed, we wanted to know who “points” to whom. I set this up so that the loser points to the winner.
Conceptually, I like the idea that the loser points towards the winner. The idea is that we are asking each team who is better than you. That team then points to teams that beat them. Essentially, we should be able to follow this route to the greatest team in the NCAA for this season. In fact, this turns out to be how google decides how important a web page is before it shows the results to you.
Wait a gosh darn minute, are you saying that this problem has already been solved really well by one of the most prolific companies in the world?
Yep. And we can just use their algorithm to decide which team is truly the best team from this season.
Before we get there, aren’t you at least a little bit curious as to what this network looks like. Each team pointing to at most 16 other teams? I know that I want to know what my network looks like.
That returns the following graph:
So in this layout the closer to the center that you are, the stronger your team is. I love that we can sort of make out conferences as the major clumps in the network. It would be fun to add labels to this data, but I felt like that would just cause the graph to get even more messy than it already is. That being said, it does look like a bowl of spaghetti.
Determining Which Team is The Best
So let’s let google’s algorithm let me know how good my Utah team really is.
sorted(nx.pagerank(G).items(), key=lambda kv: kv, reverse=True)
Let’s break this code down before I show you the results. First of all, we are getting the pagerank (google’s algorithm for determining the greatness of a webpage). We’ll sort the teams by that in descending order, a higher pagerank means you are greater. And we’ll print the name of the team along with its score. In my github page, I do a number of different algorithms. Here on the blog, I’m just going to do pagerank, since it is probably going to be the most accurate for this problem due to its symmetry with the problem that google solved so perfectly.
That’s it. Now on to the results. Due to the fact that I have like a bazillion teams, I’m only printing out the top 25, essentially, the teams that should be ranked.
[('Mississippi', 0.05172757043959218), ('Florida', 0.04084328159842189), ('Oklahoma', 0.027284759730861738), ('Texas Tech', 0.026434005259661438), ('Utah', 0.02576817646552803), ('Wake Forest', 0.02345857061925446), ('Oregon State', 0.023264366225787057), ('Alabama', 0.0228803383725877), ('Texas', 0.02273737788275593), ('Vanderbilt', 0.02033229498781814), ('Virginia Tech', 0.01984625073202384), ('Southern California', 0.018436191424897896), ('Boston College', 0.01809162315826114), ('Georgia Tech', 0.017080006946476094), ('South Carolina', 0.01678613947150378), ('Florida State', 0.014641678798074867), ('North Carolina', 0.014432727743850961), ('Maryland', 0.014030885365276724), ('Texas Christian', 0.013625117440031892), ('Georgia', 0.013286504646774114), ('Penn State', 0.01299169133011859), ('North Carolina State', 0.012554064609177313), ('Miami FL', 0.01222998408958102), ('Virginia', 0.01204515708259813), ('Iowa', 0.011799965588289696), ... ]
We can see here that my number 2 ranked Utah actually was given a quite generous rating. It looks like they should have ended the season at Number 5 not Number 2. So they got more than they probably deserved. The real surprise is Mississippi totally got robbed that year. The thing is that this method accounts for the strength of your schedule, so it looks like Mississippi had a really strong schedule, and they performed really well. And when they lost, they lost to really really strong teams. So they should have had the title in this season.
What do you think? Is my method for ranking college football teams more or less accurate than the BCS? Let me know in the comments section below.