The inaugural podcast episode for “The Data Nerd Podcast” is finally here!
In this episode:
I talk about my fun little exercise in dynamic time warping. If you don’t know what that is, it is a new drug to hit the streets for data nerds everywhere. Actually, dynamic time warping is this nifty little algorithm which I like to think of as the time series equivalent to Levenshtein distance, which is used for strings, it computes the number of changes that you need to make to one string so that it is identical to another string. In the same way dynamic time warping, changes one time series until it looks identical to another time series. It does this by using a transition matrix to literally warp the time dimension of the time series. Hence the name, dynamic time warping.
Anyway, I launch into talking about dynamic time warping early on in the episode.
But what I was really proud of was the fact that I used dynamic time warping to embed website data into 1 dimension. This dimension was in similarity space and I used a single prototype time-series. This gave me a feature in similarity space that I could work with. Essentially, I embedded each of the time-series that I was working with into a similarity space.
The cool thing about doing this transformation is that it turns out that my dataset was linearly separable in this space. So I dusted off my logistic regression, and built a simple classifier. This classifier could then distinguish between two different types of time-series.
And now for the big reveal, what was I trying to classify?
I know that it is kind of a silly application, but it had some real business value. On Super Bowl Sunday, we had all kinds of alerts going off saying that traffic was down, and that transaction volume was down. It turns out that nobody was visiting because of the super bowl.
So, I pulled down some data from google analytics. In fact, I pulled down about 8 hour by hour accounts of went on in the website for several Sundays, including a couple of Super-Bowl Sundays. First I normalized the data; I did that because of growth in usage over several years of Super-Bowl Sundays, that way I was dealing with a common scale for the problem. A simple plot showed that Super-Bowl Sunday was highly anomalous when compared to other Sundays.
In the figure above, several Sundays are plotted in blue, but Super-Bowl Sunday is plotted in red. You can tell that it is different than all of the recent Sundays. It had slightly higher traffic around noon, with much worse traffic in the afternoon. Now at this point, all I had done is confirmed that website traffic was anomalous that day, not that the Super-Bowl caused the alarms to go off. For that I needed to check whether or not this pattern looked the same as other Super Bowl Sundays.
So I grabbed data for the last 3 Super Bowl Sundays, and produced a similar plot.
Seeing this figure, I was pretty convinced that the alarms were being set off by a typical website pattern for Super-Bowl Sunday. It was actually nothing to be worried about. Now a lesser data professional would have stopped here, but no. I am a true data nerd. So I applied dynamic time warping to this problem.
What I discovered was that like these pictures indicate, Super-Bowl Sunday 2017 was much closer to previous Super-Bowl Sundays. In fact, I got really nerdy, and ran a t-test to prove that there was a statistically significant difference between the distance of Super-Bowl Sunday to normal Sundays and to other Super-Bowl Sundays. It is hard to interpret the effect size but it was about 50 units closer, whatever that means.
Anyway, you would think that would have been good enough. But I really am a big ol’ data nerd. And I thought to myself. I can build a classifier to detect whether or not the Super-Bowl occurred. Looking back on the effort that got put into this, I think I may have been going a little to deep down this rabbit hole. But meh, data nerd…
So I built a logistic regression to tell me the probability that a sample came from a Super-Bowl Sunday or if it just came from a regular Sunday. The nice thing was that the data was linearly separable in this similarity space. So embedding the data in this space is a really useful thing to do.
The end result on my hold out dataset of 2 observations, the data from this Sunday, and the other from a regular Sunday. I could almost perfectly predict that a sample came from a Super-Bowl Sunday. And for those of you that think that I was wasting my time. Well, I got to report that the alarms were being set off by the Super-Bowl, and that the probability that the Super-Bowl caused the weird patterns we were seeing was 99.99999%. Boom! Data Nerds for the win! Actually, that statement is wrong. P(x|y) does not equal P(y|x), so forget about it.
Almost as impressive as the Patriot’s comeback.
For those of you that want to see my messy scripting, check out this github repo.