Using Networkx in Python to Find the Best Team Via a Directed Graph

Yep, we’re going there. Which football team is the best? Well, ask any fan and they are going to tell you that it is their personal favorite. But we are data scientists. Surely, we can do better!

And of course, we can.

That’s why in this post we will be learning how to use Networkx to figure out which team is the best, from a purely scientific perspective. We won’t bias the outcome with our personal beliefs. Math won’t lie and we will find out who is the best.

However, I will readily accept that I have bias. There was one college football season that makes me very upset. It is when I was a sophomore in college. I was attending the University of Utah. That year, they went undefeated, the only team to go undefeated. And we got a solid national ranking. We ended in the number 2 position right behind Florida. But did we deserve so much more. If you ask any U of U fan at the time, the BCS robbed Utah. We weren’t even given a chance at the title.

I’m sure that you hear similar arguments every year about team after team. So the question becomes, can we prove that our team deserved to be number 1?

In this post, we’ll be looking at some graph theory via the Networkx interface to see if we can prove it. We’re doing stone cold data science to either confirm or reject my personal bias.

Step 1, let’s import some libraries that we know we’ll need to do this analysis.

import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib inline

Getting the Data

That takes care of that. Now we’ll need some data. I scraped some data from Wikipedia, and cleaned it up a little bit (manually in excel). At the time, I wasn’t thinking about sharing that. However, now that I am writing this blog post, I think that I should have done the cleaning in python. As a result of this oversight, I provided the data that I am using in my github repo instead of going over the scraping and cleaning process. My apologies on that.

So let’s load this data into a pandas dataframe, and take a look at it.

df = pd.read_csv('./ncaa2008.csv',usecols=['Winner','Loser'],encoding='iso8859_15')

Which will give you this output:

                   Winner                   Loser
0              Ball State            Northeastern
1                 Buffalo           Texas-El Paso
2        Central Michigan        Eastern Illinois
3              Cincinnati        Eastern Kentucky
4             Connecticut                 Hofstra
5        Eastern Michigan           Indiana State
6            Georgia Tech      Jacksonville State
7              Iowa State      South Dakota State
8                Miami FL     Charleston Southern
9          South Carolina    North Carolina State
10               Stanford            Oregon State
11                   Troy  Middle Tennessee State
12             Vanderbilt                Miami OH
13            Wake Forest                  Baylor
14                   Rice      Southern Methodist
15                 Temple                    Army
16              Air Force           Southern Utah
17                Alabama                 Clemson
18                Arizona                   Idaho
19          Arizona State        Northern Arizona
20               Arkansas        Western Illinois
21         Arkansas State               Texas A&M
22                 Auburn        Louisiana-Monroe
23            Boise State             Idaho State
24         Boston College              Kent State
25    Bowling Green State              Pittsburgh
26          Brigham Young           Northern Iowa
27               Cal Poly         San Diego State
28             California          Michigan State
29        Central Florida    South Carolina State
..                    ...                     ...
775  Southern Mississippi                    Troy
776       Texas Christian             Boise State
777            Notre Dame                  Hawaii
778      Florida Atlantic        Central Michigan
779            California                Miami FL
780         Florida State               Wisconsin
781         West Virginia          North Carolina
782        Louisiana Tech       Northern Illinois
783              Missouri            Northwestern
784               Rutgers    North Carolina State
785              Maryland                  Nevada
786                Oregon          Oklahoma State
787                  Rice        Western Michigan
788               Houston               Air Force
789                Kansas               Minnesota
790       Louisiana State            Georgia Tech
791          Oregon State              Pittsburgh
792            Vanderbilt          Boston College
793               Georgia          Michigan State
794                  Iowa          South Carolina
795              Nebraska                 Clemson
796   Southern California              Penn State
797         Virginia Tech              Cincinnati
798              Kentucky           East Carolina
799           Mississippi              Texas Tech
800                  Utah                 Alabama
801           Connecticut                 Buffalo
802                 Texas              Ohio State
803                 Tulsa              Ball State
804               Florida                Oklahoma

[805 rows x 2 columns]

Again note that you will not see Utah in the Loser column. No not a single time. They were really good this season, remember that! Anyway, this data is actually pretty simple. Each row represents a game that was played, and all that we do is record who won and who lost.

Starting with Networkx

So where are we going with this data? We need to start analyzing this data with Networkx, of course! Therefore, we are going to use this data as an edgelist and import it into Networkx to create a graph object.

G = nx.from_pandas_edgelist(df,source='Loser',target='Winner',create_using=nx.DiGraph())

So that sounded like something really complicated to do, but with Networkx’s helper functions to get data out of a pandas dataframe this was really a painless process. You will notice that we wanted a directed graph, so we told Networkx to use its DiGraph() object. Furthermore, since it is directed, we wanted to know who “points” to whom. I set this up so that the loser points to the winner.

Conceptually, I like the idea that the loser points towards the winner. The idea is that we are asking each team who is better than you. That team then points to teams that beat them. Essentially, we should be able to follow this route to the greatest team in the NCAA for this season. In fact, this turns out to be how google decides how important a web page is before it shows the results to you.

Wait a gosh darn minute, are you saying that this problem has already been solved really well by one of the most prolific companies in the world?

Yep. And we can just use their algorithm to decide which team is truly the best team from this season.

Before we get there, aren’t you at least a little bit curious as to what this network looks like. Each team pointing to at most 16 other teams? I know that I want to know what my network looks like.


That returns the following graph:

The network of all NCAA football teams

So in this layout the closer to the center that you are, the stronger your team is. I love that we can sort of make out conferences as the major clumps in the network. It would be fun to add labels to this data, but I felt like that would just cause the graph to get even more messy than it already is. That being said, it does look like a bowl of spaghetti.

Determining Which Team is The Best

So let’s let google’s algorithm let me know how good my Utah team really is.

sorted(nx.pagerank(G).items(), key=lambda kv: kv[1], reverse=True)

Let’s break this code down before I show you the results. First of all, we are getting the pagerank (google’s algorithm for determining the greatness of a webpage). We’ll sort the teams by that in descending order, a higher pagerank means you are greater. And we’ll print the name of the team along with its score. In my github page, I do a number of different algorithms. Here on the blog, I’m just going to do pagerank, since it is probably going to be the most accurate for this problem due to its symmetry with the problem that google solved so perfectly.

That’s it. Now on to the results. Due to the fact that I have like a bazillion teams, I’m only printing out the top 25, essentially, the teams that should be ranked.

[('Mississippi', 0.05172757043959218),
 ('Florida', 0.04084328159842189),
 ('Oklahoma', 0.027284759730861738),
 ('Texas Tech', 0.026434005259661438),
 ('Utah', 0.02576817646552803),
 ('Wake Forest', 0.02345857061925446),
 ('Oregon State', 0.023264366225787057),
 ('Alabama', 0.0228803383725877),
 ('Texas', 0.02273737788275593),
 ('Vanderbilt', 0.02033229498781814),
 ('Virginia Tech', 0.01984625073202384),
 ('Southern California', 0.018436191424897896),
 ('Boston College', 0.01809162315826114),
 ('Georgia Tech', 0.017080006946476094),
 ('South Carolina', 0.01678613947150378),
 ('Florida State', 0.014641678798074867),
 ('North Carolina', 0.014432727743850961),
 ('Maryland', 0.014030885365276724),
 ('Texas Christian', 0.013625117440031892),
 ('Georgia', 0.013286504646774114),
 ('Penn State', 0.01299169133011859),
 ('North Carolina State', 0.012554064609177313),
 ('Miami FL', 0.01222998408958102),
 ('Virginia', 0.01204515708259813),
 ('Iowa', 0.011799965588289696),

We can see here that my number 2 ranked Utah actually was given a quite generous rating. It looks like they should have ended the season at Number 5 not Number 2. So they got more than they probably deserved. The real surprise is Mississippi totally got robbed that year. The thing is that this method accounts for the strength of your schedule, so it looks like Mississippi had a really strong schedule, and they performed really well. And when they lost, they lost to really really strong teams. So they should have had the title in this season.

What do you think? Is my method for ranking college football teams more or less accurate than the BCS? Let me know in the comments section below.