So I wanted to write a little tutorial on quantile regression. What it is and how it works. Then I wanted to show you how to utilize it to great effect within python. This is one of my favorite statistical models, and I feel like it is very underutilized. So do refer back to this tutorial often.
What is Quantile Regression?
Unlike regular ordinary least squares regression, quantile regression isn’t trying to fit the best line through the middle of your data. In fact, it tries to pass the best fit line through a certain quantile of your data. At first glance, you may be tempted to say that seems pointless. We are trying to predict what will happen, and you are giving me a purposefully biased estimator. It is a little bit counter-intuitive. However, I want you to stop thinking in terms of the mean, and start thinking in terms of full probability distributions.
Intuitively, you know that there is some distribution of your data around that best fit ordinary least squares line. Typically, for a given X-value. We assume a normal distribution around our predicted Y-value. If the data fall somewhere in the expected range of the distribution, we’re happy and say that the regression is successful.
But what if I told you that not all dependent variables are normally distributed. Also, it is often the case that you care about what is happening at the extreme ends of the distribution, rather than what is happening in the middle. Here’s an example that I think you can appreciate: climate change. Average global temperatures have only risen a little bit, which is scary in and of itself, but what about the temperature at the poles. Arguably, that is a more meaningful and impactful thing to study. Sure, you could restrict your dataset to be the temperatures at the poles, but you’d be throwing away good data about what is happening at the equator. This is where quantile regression shines. You can incorporate information from every sensor, and then you can examine what pushes extreme temperatures world-wide. What causes an unusually hot day to occur, controlling for latitude? What about an extremely cold day? When you care more about the tail than the mean, quantile regression is the way to go.
Why Should You Care About Quantile Regression?
What I said above means that you can study things that are not normally distributed, and that don’t necessarily have a linear relationship. That makes this technique powerful. You are no longer constrained to a world where everything is built by bell curves. Some things can only take on positive examples, like the dollar amount of a transaction. The normal distribution extends past zero and into the negative numbers. For some applications this is completely fine because “the bell” of your bell curve is so far away from going negative that it is reasonable to assume that the mass that goes into the negative territory is negligible. But then again, that also depends on variability. Do you want to take that risk, you can handle it easily with quantile regression, because you are not going to make any inherent assumptions.
Another reason to care about quantile regression is that it allows you to use a linear model to estimate non-linear effects. That is a huge advantage. The reason is that it lets you estimate the an approximation of the conditional probability distribution. You aren’t so uncompromising on utilizing a point estimate. Most of us know that predicting is hard, especially forecasting the future. Yet when we build models, we predict a value. Doesn’t it make sense to predict a range of outcomes?
It is also a good idea to talk about risk estimation. I’ve talked about doing VaR calculations before, but you can easily apply quantile regression to do much of the same as what we did in that post. Imagine that you have a model that predicts the 90th percentile of some amount of financial risk. You can just set up a simple linear model to do exactly that. Quantile regression will spit out an answer for you.
How does Quantile Regression Work?
Quantile regressions work by estimating the parameters at a certain quantile of the distribution. Particularly, they are simple regressions, but instead of using Mean Squared Error as a loss function, they use the quantile loss which is slightly different than a typical loss function. The difference is that you must specify a parameter for the loss function, in particular, you must specify that the quantile that the loss function will be looking at. The typical way that you will see this written in python is:
def quantile_loss(q, y, f):
# q: Quantile to be evaluated, e.g., 0.5 for median.
# y: True value.
# f: Fitted (predicted) value.
e = y - f
return np.maximum(q * e, (q - 1) * e)
Now you can use the minimize routine in scipy.optimize to determine the parameters of the regression given this loss function. However, this isn’t the full power of this type of model. It turns out that you may be interested in a particular quantile, but in my experience, it is actually very useful to estimate multiple quantiles. Essentially, what you want to do is to estimate an array of the quantiles. This will allow you to get a full distribution of the data. You can then generate Conditional Cumulative Distribution Functions (CCDF) for the data. Essentially, a cumulative distribution function for the variable of interest in your dataset, but this distribution of the output is dependent on your input variables. That leads me to:
Things you can do, that you can’t do with regular regressions
Some fun things to try and do with quantile regressions is impute data.
- Now that you are armed with a full distribution for your data at every single possible point, if you are missing an observation, you can draw from a uniform distribution and with the data that you know pick a point to fill the missing value.
- You can estimate power user responses to a policy change, i.e. the 90th percentile of your distribution may react differently than say your average user would.
- You can estimate likely ranges for a response variable. Instead of saying, hey the outcome is going to be 513 users, you can say things like there is a 90% probability that we will have between 423 users and 619 users. Which is subtly different from saying I’m 90% confident that the true number of users is between 413 and 613 users. (but that discussion is another blog post for another day)
- You can estimate the median regression instead of a mean regression.
- You can compare full distributions under 1 set of inputs against another set of inputs and compare them statistically to see if they are different (ks-tests)