So I’m writing this little post about Empirical Bayes and how you can use it to build risk scores in python. This is a topic that is very useful for a number of disciplines. I tend to use it to look for abnormalities with incentive programs at a financial institution, but you can apply it anywhere that you are measuring percentages over a number of groups of objects. The classical example is in baseball statistics, where you want to see if someone had a higher batting average, but you are skewed towards people that have had few at bats. The idea is the same, no matter what the field that you are looking at is. So, if you have people attempting different amounts, you will have this problem.

Empirical Bayes is just a correction for people/entities that haven’t had as many opportunities for success. Here’s an example, let’s say you are trying to monitor employee cold call success rates to figure out which employee is the best cold caller. So you monitor the number of phone calls made, and the number of sales made. Each employee will have a certain percentage. Let’s say the first employee has a success rate of 100% but only made 1 call. The second employee made 26 sales on 50 calls for a success rate of 52%. The third one has made 1000 calls resulting in 519 sales obviously for 51.9%. The question is, which is your most successful employee by percentage. Basically what we are going to do is figure out what the group average is, and bias you towards that group average if you haven’t had a lot of attempts.

Our intuition suggests that employee 1 may have gotten lucky. Employee 2 is probably alright, and employee 3 clearly stands out given the raw numbers, though the true percentages would rank them in the reverse order.

The Math Behind Empirical Bayes

The math is quite simple actually. It doesn’t even require a very deep understanding of probability theory. However, for the nerds out there, I am going to be doing this mostly wrong. The right way to do this is to estimate parameters based on the data using something like the maximum likelihood of a beta distribution. I’m not going to do that in this post. You can read about how to do maximum likelihood in this post. And if you want a more in depth explanation of why the beta distribution is so important in Empirical Bayes, and why you would want to do use it you can read about that here. I’m going to bastardize the math a little bit here for an answer that is technically, close enough, and much faster to code up.

The beta distribution requires two parameters. We are going to estimate them by simply taking two averages. The first average is the average number of successes that were achieved per person. This average is going to be our alpha parameter. The next average is going to be the number of attempts per person. This will be our beta number. Note that this is technically wrong, on so many levels, but again, we are going for a down and dirty “correction” not the right answer.

Now that we have those two averages we just take the same numbers we used to compute them, and use the following formula:

{x+\alpha}/{y+\alpha+\beta}[\latex]

This is how I computed the example above. I think that this gives about the right answer. But it isn’t going to be the right answer. For one, we just estimated based off of what would give an approximately right answer. Again, you will want to compute alpha and beta using maximum likelihood, or something else like method of moments. My method is close-ish to method of moments so we’ll run with it for now. Technically, I should have used mean and variance of the percentages and then solved a system of equations to back out alpha and beta, but nobody ain’t got time for that! A neat hack that will actually get you closer to the right distribution is to simply subtract the average number of successes from the average number of attempts and call that beta.

Some Python Code

It turns out that this is fairly simple. First import the things that we need, which is just pandas and our data:

import pandas

df = pd.read_csv('path_to_your_data')

Now we just need to compute some basic statistics to get our estimated alpha and beta parameters.

a = df['Successes'].mean()
b = df['Attempts'].mean()-a

With an estimate for our parameters, we can compute a biased percentage like this:

df['eb_pct'] = (df['Successes']+a)/(df['Attempts']+a+b)

And that is all that there really is to it.

Bonus: Risk Scoring with Empirical Bayes

With one final bit of python, you can compute a risk score for each person.

from scipy.stats import beta
df['score'] = 100*beta.cdf(df['eb_pct'],a,b)

With that in place, you now have a risk score for each individual that ranges from 0 to 100. Isn’t that nice! Basically, we are just asking what percentage of individuals would we expect to have a lower score than this individual. Which in my world of fraud detection, comes in kind of handy.