Bayesian Auto-Regressive Time Series Analysis in PYMC3

I have written a lot of blog posts on using PYMC3 to do bayesian analysis. And I have a few where I have even dealt with Time-Series datasets. To name a one, I have done one on time varying coefficients. In this post, I want to explore a really simple model, but it is one that you should know about. That is the AR(1) model. Typically, you will see this model in a frequentist setting. It is really about patching up the errors of a model so that they are normally distributed. Really, what is going on under the hood is a partial differencing of the time-series to achieve stationarity. You can read more about my thoughts on the subject here. So today we’ll explore the Bayesian Auto-Regressive model.

Anyway, the nice thing about this model is that it is already available in the form of a PYMC3 distribution. So we just need some data that we can plug into the model and it should be as simple as running it as is. There is no special coding needed to do the the analysis fit the data.

So we will use a familiar dataset. The data is the Prussian Calavry horse kick data set. We used it previously in the post about doing A/B testing on count data when we have multiple categories, or versions/groupings of the counts. You can of course get the count data from here.

Cleaning the Data

As always the first step in doing an analysis is getting the data into a useable format. So we need to do some cleaning on this dataset. We will collapse the dataset into yearly counts, instead of having counts for every unit per year. But in order to do that we need to import some of our favorite libraries and get our data into a pandas dataframe.

import pandas as pd
import pymc3 as pm
import matplotlib.pyplot as plt

df=pd.read_csv('/home/ryan/Documents/HorseKicks.csv')

As advertised the code above just imports libraries and loads the data. If you ran this code, you would find that the data isn’t in a very useable format. So we need to change the data into a simple time series by aggregating on the year. To do that we need to make the year the index variable, and then sum up the columns. This is pretty easy to do with the code below.

df.index = df['Year']
df.drop('Year',axis=1,inplace=True)
df = df.sum(1)

Let’s have a look at what this data looks like, shall we?

df.plot()
plt.show()

That will give you the following plot:

It looks pretty stationary to me. But there may be some autocorrelation going on. So I think this model would be good to use, just eyeballing it. You can of course run all kinds of statistical tests like we did to look at whether ARIMA was a solid model, I won’t drag you through that again though and we will proceed with the technique.

So we will need the model to do an AR(1) likelihood. Fortunately for us, PYMC3 already has that likelihood prebuilt we just have to use it. So the code below develops the full bayesian model.

with pm.Model() as model:
    k_=pm.Uniform('k',-1,1)
    tau_=pm.Gamma('tau',mu=1,sd=1)
    obs=pm.AR1('observed',k=k_,tau_e=tau_,observed=df)
    trace=pm.sample()

You will notice that I am breaking with my traditional approach of using a flat prior and using a Uniform prior. I’m doing that, because AR coefficients should be bound between -1 and 1. Which you can see is the prior that I placed on the parameter k. k is the coefficient on the AR process.

Running this will give you output that should look like this:

Auto-assigning NUTS sampler...
Initializing NUTS using ADVI...
Average Loss = 77.368:   5%|▍         | 9773/200000 [00:03<01:05, 2913.42it/s] 
Convergence archived at 9800
Interrupted at 9,800 [4%]: Average Loss = 369.97
100%|██████████| 1000/1000 [00:03<00:00, 291.06it/s]

But that doesn't let you analyze anything so you need to have a look at the trace for this model which you can do using the following code:

pm.plot_posterior(trace,'k')
print(pm.summary(trace))

That will give you the following output:

k:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  0.842            0.095            0.007            [0.662, 0.995]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  0.633          0.780          0.848          0.916          0.989


tau:

  Mean             SD               MC Error         95% HPD interval
  -------------------------------------------------------------------
  
  0.038            0.011            0.000            [0.018, 0.061]

  Posterior quantiles:
  2.5            25             50             75             97.5
  |--------------|==============|==============|--------------|
  
  0.020          0.030          0.037          0.044          0.063

This output shows that there is a 50% probability that the coefficient on the AR(1) term is greater than 0.848. And its most likely value is 0.842. That makes a lot of sense given what I typically see in frequentist statistics on these AR terms, so I'm going to throw this into the reasonable pile.

The code above should also display this figure:

And that shows the posterior distribution of the coefficient on the AR(1) term. So you can see that it is fairly wide but it looks about like what we would expect. And that's it with the Bayesian AR(1) model.