Over the past few months I’ve had a thought coalescing in my brain. It has been swirling around banging up against the inner portions of my skull. With each collision with my head, and with other ideas, and situations I find myself, this idea has been taking shape.
It has been forming into the seeds of a ranting post about data science. I realize that this is going to grow into a pretty healthy rant. There just isn’t a whole lot that I can do about it. I need to get this out of me.
In these blog posts that we tend to write, we’ve been doing data science wrong.
Here’s the problem:
We don’t focus enough on infrastructure. The sad fact is that we focus on technique, but usually where the rubber meets the road at least in my opinion is getting predictions in front of an end user. We don’t talk enough about getting the data to the person. Often we settle for a hand wavy statement where we say “and then you display this output to someone” or “and then it would kick off an automated process”.
I have no intention of fixing this in my own blog, especially when I am talking about some sort of technique that I want to talk about. But at the end of the day, what we need to do is talk about how to get a model somewhere that it can be used.
I have had quite a bit of success deploying models to heroku as APIs. So I may do a post on that in the near future. But the truth is that it requires infrastructure to do machine learning in the first place. You have to have the ability to do A/B tests. My career has been littered with good ideas that have never been implemented because of a lack of infrastructure, so maybe we should focus on teaching new comers to this field about developing infrastructure and negotiating to make that happen, rather than on cool algorithms.
Anyway, there is my rant. What do you think? Am I totally right, or am I off base. Leave your thoughts in the comments.