Anyone who has taken a slightly rigorous statistics class will have bumped into these probability distributions before. What I want to focus on is giving an overview of each of these probability distributions, when they are useful, and what you can do with them with some real life examples. This post is meant to be sort of a cheat sheet reference, and not so much of a tutorial. So be sure to bookmark this page, it is sure to be useful.
Bernoulli Distribution
This is probably the simplest probability distribution that you can think of. I’m not even using hyperbole here. I really, truly can’t think of a distribution that is simpler.
There are 2 outcomes. With some sort of probability of observing either outcome. That is it.
This is the probability distribution that you use when you are talking about flipping a coin. There is a certain probability of heads or tails. In a fair coin, it is 50/50. In a biased coin it is anything so long the two probabilities add to 1. So another way of putting that is that it is probability p, and 1-p of heads and tails respectively. That’s it.
Categorical Distribution
The categorical distribution is another one of those distributions that we all carry around in our heads but that we don’t have a name for, until someone gives it a name for you. The categorical distribution is just a straight forward generalization of the Bernoulli distribution so that you have more than 2 possible outcomes. If you have 6 possible outcomes, then the categorical distribution describes the probability of rolling a die. Again depending on the probabilities, you will either have a biased or fair die.
In principle you can extend the categorical distribution to be as big as you want it to be. However, at some point other probability distributions are likelier to be easier to work with.
These two distributions are foundational, and can be used to derive all other probability distributions. I won’t do that, but you can build up all the other probability distributions from bernoulli or categorical distributions, using sums, differences, products, or quotients, or any combination of them. These aren’t very interesting though, but I will say they do make good likelihood functions and you see them in say logistic regression type models.
Binomial Distribution
The binomial distribution is what you get when you string a bunch of Bernoulli trials together, and ask what is the probability that I will see heads X number of times out of Y number of trials, where 0<=X<=Y. So when is this actually useful?
Suppose that you want to measure some performance aspect of an employee, for concreteness let’s say that you have a customer service rep that is supposed to give some sort of disclaimer when people call in. You want to monitor compliance of a single employee.
If you have some baseline probability for giving that disclaimer, you can get a sense for whether this employee is performing better, worse or on target in a sample of phone calls to their peers. In fact, you can give a probability that you observe as many disclaimers as you did as an objective measure of someones compliance.
Geometric and Negative Binomial Distributions
I lump these two together because they are so very similar. They both deal with the number of trials that have to happen before a certain number of successes. The geometric distribution is the negative binomial distribution with a single required success. The negative binomial is how many it takes to see a number of successes.
As such, these distributions are really good for counting. How many do I need? Let’s look at an actual business case. You are looking for fraudulent accounts. You want to know if you randomly select accounts to audit, how long will it take you to find a fraudulent account?
This is an example of a geometric distribution. You can spit out the probability that you will find a fraud on the 2nd account you audit. Also, if you have a decent estimate of the probability of fraud in your dataset, 1 over that probability will tell you the average number of accounts that you will look at. So for example, if fraudulent accounts happen at a rate of 0.0002, or 1 in 5000, then you have to look at 5000 accounts on average to find 1 fraudulent account. Good luck with that.
The negative binomial will tell you how many accounts you need to look at to find 2 or more fraudulent accounts on average, and the probabilities associated with all the possibilities, 1 to infinity.
Hypergeometric Distribution
The hypergeometric distribution is the distribution to use when you don’t want to have replacement in of your data. The canonical example is the whole pulling black and white balls out of an urn.
That isn’t very interesting. Suppose that you want to form a commitee of employees. You want it to be a truly random sample of individuals. You want to know how fair your random sample is according to some criteria, say man/woman split. Is your random sample representative of the population of your company, team, etc. Or did you get too many men, too many women.
This distribution will tell you the probability of having that many women on your commitee. If it is a high enough probability accept the committee composition. Otherwise draw again. You could also use this argument to defend/monitor your hiring, promoting/ firing practices, etc.
Poisson Distribution
The poisson distribution is about how many times something happens in some interval of space-time. It is directly modelling the number of times that something happens. This may sound like geometric and negative binomial distributions. It should, this is one of those things that looks like them.
In fact there is some nice relationships between the poisson and the negative binomial. The thing is that there is a difference. We can model count data very nicely with this one directly. We can model count data sort of indirectly with negative binomial data.
This direct modelling is interesting because it comes at a cost. That cost is that the mean has to equal the variance. Otherwise, the model is strictly speaking wrong. If we generalize the poisson so that the variance is not equal to the mean, we get a model that is equivalent to the negative binomial. We have to do some rejiggering with the parameters to make it match exactly, but it can be done.
So yeah, poissons are great for count data, but you have to be careful on when and how you use them.