So I wanted to write a little tutorial on quantile regression. What it is and how it works. Then I wanted to show you how to utilize it to great effect within python. This is one of my favorite statistical models, and I feel like it is very underutilized. So do refer back to this tutorial often.

## What is Quantile Regression?

Unlike regular ordinary least squares regression, quantile regression isn’t trying to fit the best line through the middle of your data. In fact, it tries to pass the best fit line through a certain quantile of your data. At first glance, you may be tempted to say that seems pointless. We are trying to predict what will happen, and you are giving me a purposefully biased estimator. It is a little bit counter-intuitive. However, I want you to stop thinking in terms of the mean, and start thinking in terms of full probability distributions.

Intuitively, you know that there is some distribution of your data around that best fit ordinary least squares line. Typically, for a given X-value. We assume a normal distribution around our predicted Y-value. If the data fall somewhere in the expected range of the distribution, we’re happy and say that the regression is successful.

But what if I told you that not all dependent variables are normally distributed. Also, it is often the case that you care about what is happening at the extreme ends of the distribution, rather than what is happening in the middle. Here’s an example that I think you can appreciate: climate change. Average global temperatures have only risen a little bit, which is scary in and of itself, but what about the temperature at the poles. Arguably, that is a more meaningful and impactful thing to study. Sure, you could restrict your dataset to be the temperatures at the poles, but you’d be throwing away good data about what is happening at the equator. This is where quantile regression shines. You can incorporate information from every sensor, and then you can examine what pushes extreme temperatures world-wide. What causes an unusually hot day to occur, controlling for latitude? What about an extremely cold day? When you care more about the tail than the mean, quantile regression is the way to go.

## Why Should You Care About Quantile Regression?

What I said above means that you can study things that are not normally distributed, and that don’t necessarily have a linear relationship. That makes this technique powerful. You are no longer constrained to a world where everything is built by bell curves. Some things can only take on positive examples, like the dollar amount of a transaction. The normal distribution extends past zero and into the negative numbers. For some applications this is completely fine because “the bell” of your bell curve is so far away from going negative that it is reasonable to assume that the mass that goes into the negative territory is negligible. But then again, that also depends on variability. Do you want to take that risk, you can handle it easily with quantile regression, because you are not going to make any inherent assumptions.

Another reason to care about quantile regression is that it allows you to use a linear model to estimate non-linear effects. That is a huge advantage. The reason is that it lets you estimate the an approximation of the conditional probability distribution. You aren’t so uncompromising on utilizing a point estimate. Most of us know that predicting is hard, especially forecasting the future. Yet when we build models, we predict a value. Doesn’t it make sense to predict a range of outcomes?

It is also a good idea to talk about risk estimation. I’ve talked about doing VaR calculations before, but you can easily apply quantile regression to do much of the same as what we did in that post. Imagine that you have a model that predicts the 90th percentile of some amount of financial risk. You can just set up a simple linear model to do exactly that. Quantile regression will spit out an answer for you.

## How does Quantile Regression Work?

Quantile regressions work by estimating the parameters at a certain quantile of the distribution. Particularly, they are simple regressions, but instead of using Mean Squared Error as a loss function, they use the quantile loss which is slightly different than a typical loss function. The difference is that you must specify a parameter for the loss function, in particular, you must specify that the quantile that the loss function will be looking at. The typical way that you will see this written in python is:

``````def quantile_loss(q, y, f):
# q: Quantile to be evaluated, e.g., 0.5 for median.
# y: True value.
# f: Fitted (predicted) value.
e = y - f
return np.maximum(q * e, (q - 1) * e)``````