Bayesian Panel Data Cigarette Tax Analysis

Cigarette Taxes

TL;DR Raising the cigarette tax by one cent in every state will save us about $1 billion per year. In the grand scheme of health care costs, it isn’t that much money.

So with this post, I wanted to get back to my roots as an economist. I decided that I wanted to write a blog post similar to the economics papers that got me so gosh darn interested in the field of economics. When I was a tender undergraduate majoring in physics, I ended up taking a class called “The economics of drugs, sex, and crime”. It was supposed to be my social science general credit. Unfortunately, that class derailed me from my path as a physicist. Oh well, I never really got an intuitive grasp on Maxwell’s equations so probably for the best.

At any rate, the thing that fascinated me about this class was how it blended all sorts of thorny things together. You had politics, in bed with rational policies, making out with culture, and crime, all wrapped up in the comfortable blanket of mathematics. It was like a drug for me, I couldn’t turn away. I had to have more. The thought that something as simple as increasing or decreasing a tax changing people’s behavior so that they smoked, or not. Or perhaps, how much money was to be had in the drug trade is an illusion, it was more about indebtedness and aspirational goods. It makes these kids selling dope on the street corner look more like the middle class than you might think. Just delicious.

So I started down this path, and it became fascinating to me to see how policy levers can be pulled to affect actual real life behavior. And that’s where this analysis falls. Today, I want to answer the question, “What would happen to health expenditures if we raised cigarette taxes by 1 percent?”

So I jumped onto the CDC’s website and pulled down the data related to cigarette taxes, and tobacco related health expenditures. You can get this data from here. And you can follow along at home.

Effect Identification Strategy

So this is a blog post and not an academic article. I’m not trying to pretend that it rises to the level of rigor that would be required of a peer reviewed academic article, but I see a lot of blog posts out there that I look at and go, well that’s probably not the right answer. To that end, I want to take some time and talk about the panel data method that I am using and why I think that it will get us pretty close to the right answer when it comes to this question.

The basic idea for why panel regressions give the correct answer comes down to quasi-experiments. The idea is that you have some entity, a person, state, whatever that you are going to measure at various points in time. Concretely, in our example, we’ll be looking at states every year between 2005 to 2009. It isn’t a huge dataset, but again, I want to approach believability, not academic peer review rigor, so this will do fine. So each of these entities acts as a potential control for the others. Also, looking over several time periods act as controls within a state. How? Basically, we can turn variables up or down over time and see what effect that they have independent of the state by washing out any state-level fixed effects, and any temporal-fixed effects.

I think a picture is worth a thousand words to explain what I’m talking about here. More on how to generate this figure below.

Each line in this figure represents a state. You will notice that some states played around with their cigarette taxes during this time. Other states kept their taxes exactly the same. Nature has essentially handed us a perfect experiment. Some of our states are going to act as a control group, the treatment group is the set of states that made changes to their cigarette tax laws during this time period. Moreover, some of the treatment group lowered the tax, some increased it, others did both. So we should be able to see what happened to their health expenditures over time. This is the essence of why I think that this method will work to get me the right answer.

Clean the Data

So let’s start by cleaning the data. It came in two separate files, so we know there will be merging involved. Also, some of our numerical values are encoded as strings, so we’ll need to convert those to numbers, which may involve some fancy footwork with regular expressions, as some of the numbers have words such as “per pack” appended to the end or other strings like that, we’ll also do some grouping because the tax is being recorded every quarter, but health expenditures are only recorded for the year. So we’ll take the average tax for the year. If this were an academic paper, we’d be examining the assumption that the average over the year was appropriate, since it is not, we’re just going to say that we’re good. We’re also going to drop missing values, that’s probably a bad idea too but let’s go with it because it makes life easier for a humble blogger.

import pandas as pd
import re
import numpy as np
import matplotlib.pyplot as plt
import pymc3 as pm

temp1 = pd.read_csv('~/Documents/tobacco/Expenditures.csv')
temp2 = pd.read_csv('~/Documents/tobacco/Taxes.csv')

temp3 = temp2[(temp2['ProvisionDesc'] == 'Cigarette Tax ($ per pack)')]
tax = []
for obj in temp3['ProvisionValue']:
    try:
        tax.append(re.findall('^\d*[0-9](|.\d*[0-9]|,\d*[0-9])?$',obj)[0])
    except:
        tax.append(np.nan)
temp3['tax'] = tax
temp3['tax'] = temp3['tax'].astype('float64')
temp4 = temp3.groupby(['Year','LocationAbbr'])['tax'].mean().reset_index()
df = pd.merge(
                temp1[temp1['Variable']=='Total'],
                temp4,
                how='inner',
                on=['Year','LocationAbbr']
            )
df = df[['Year','LocationAbbr','Data_Value','tax']]

Great! Now we have data in a usable format. Let’s take a look at some preliminary visualization to see if things make sense. I already introduced this chart, and why it is important, so I won’t belabor that point any longer. I will, however, show you the code that generates it:

for state in df['LocationAbbr']:
    plt.plot(df[df['LocationAbbr']==state]['Year'],df[df['LocationAbbr']==state]['tax'])
    plt.title('Variation in Tobacco Tax By State')
    plt.ylabel('Tax ($ per pack)')
    plt.xlabel('Year')
plt.show()

Here is a similar plot, this time for our response variable. The interesting thing is that we see a clear upward trend on all variables. Some do trend upward faster than others, but it points to the fact that we’ll need to control for time somehow, so we’ll use temporal fixed effects, in addition to state-level fixed effects.

It was generated very similarly, so similar that I probably should have written a function. I’ll kick myself for not doing that later.

for state in df['LocationAbbr']:
    plt.plot(df[df['LocationAbbr']==state]['Year'],df[df['LocationAbbr']==state]['Data_Value'])
    plt.title('Variation in Tobacco Tax By State')
    plt.ylabel('Expenditures (Millions of Dollars)')
    plt.xlabel('Year')
plt.show()

So now all we need to do is generate our fixed effects, and then get started building a model.

df['year_index'] = pd.factorize(df['Year'])[0]
df['state_index'] = pd.factorize(df['LocationAbbr'])[0]
df['tax'] = [obj if obj!=0 else 0.0001 for obj in df['tax']]
df['logtax'] = np.log(df['tax'])
df['logExpense'] = np.log(df['Data_Value'])
df.dropna(inplace=True)

I did some stuff here. The year and state indices that I created using pandas’s factorize method are going to let us get our fixed effects in place when we build the model. Also I want to take the log of taxes and expenditures. This will allow me to estimate the elasticity of health expenditure with respect to taxes, heretofore the elasticity. This is also where I drop the missing values.

Now let’s build a model.

with pm.Model() as model:
    state_fixed_effects=pm.Flat('State_Fixed', shape=len(df['LocationAbbr'].unique()))
    time_fixed_effects=pm.Flat('Time_Fixed',shape=len(df['Year'].unique()))
    tax_beta = pm.Flat('beta')
    lm = pm.Deterministic('mu',tax_beta*np.array(df['logtax'])+state_fixed_effects[np.array(df['state_index'])]+time_fixed_effects[np.array(df['year_index'])])
    sigma = pm.Flat('sigma')
    sigma2 = pm.Deterministic('sigma2',sigma**2)
    obs=pm.Normal('Observed', mu=lm, sd=sigma2, observed=df['logExpense'])
    trace = pm.sample(1000, tune=1000)

This will set up and run the model. I used flat priors so that I wasn’t making too many assumptions. We could have set stronger priors, or we could have even set up a hierarchical model. But that is just adding fanciness to this model that doesn’t do much for us unless our prior is stronger to the point of affecting the outcome.

We can take a look at the posterior distribution for the elasticity by looking at the trace plot. We can do that with a simple line of code.

pm.traceplot(trace,varnames=['beta'])
plt.show()

And that should get you something that looks like this:

This figure clearly shows that the elasticity is somewhere around -0.0035. That would mean that a 1% increase in the cigarette tax should lead to a 0.0035% decrease in expenditures related to tobacco. That’s pretty inelastic, but then again, think about it, smoking is an addiction, we should expect it to be inelastic.

A one cent increase in the tax, on average is about 2.3% increase in the tax per pack. So given this information on average, raising the cigarette tax by one cent should decrease expenses by something like 0.0081%. Which turns out to be about $19.29 million per state per year, or $964.57 million nationally per year. That’s a huge number, but its basically rounds up to be about a billion dollars per year for a one cent per pack increase nationally in the cigarette tax. This is, of course, a back of the envelope calculation, and we could improve it significantly.

Now, I don’t know how much consumer surplus you would lose under such a plan, and I am not sure how much tax revenue this would raise. It is quite possible the net present value of this hair-brained scheme of imposing a national 1 cent tax on every pack of cigarettes could be negative. Also when talking about health care costs a billion dollars annually just isn’t that much. The center for medicare and medicaide tell me that we spent $3.3 trillion in 2016. $1 billion dollars amounts to a decrease in spending on health care of about 0.03% of all spending. You could easily see that get gobbled up by inflation next year. So we aren’t talking about a huge gain here, but it could help a little. What do you think? Leave me a comment below.