So you want to be a data scientist. That’s great. But you may be wondering where to start. Sometimes it feels like you are trying to do the impossible like get up to the speed of light. Things get awful heavy as you start going faster. And incremental increases in speed seem to be harder and harder. Good news to you! It isn’t that hard!
Remember, data science may seem like a daunting discipline to learn, but it is a fairly new discipline. Sure it borrows from many other disciplines like statistics, machine learning, database storage, information retrieval, all of which have long histories. But remember, they were created by other human beings. That brings me to the main point. Another human being created the technology. The concepts do really fit inside of a human brain. You can understand these things.
So what do you need to be a good data scientist? From my experience, in a totally subjective, not-data-driven opinion, it’s coding chops.
I think that is what separates a good statistician from a data scientist waiting to happen. I remember in Graduate School, when somebody had something that needed to be coded that wasn’t a canned in a statistics package, they turned to me. You need to pull data off of a thousand pages on this website, and then run that data through some analysis. No problem, give me an afternoon and a thank you in your paper. That is the sort of coding chops that I am talking about. How about an estimator that isn’t supplied in the standard library for my statistics program? Write down the math and I will crank it out in an hour or two. Caution, it may run slow, until I dig into the guts of the math. How about estimating the solution to an ordinary differential equation given these initial condition that has no closed form solution? No problem, I wrote a program that did that sort of thing last week.
That is the kind of coding chops that you should have to be a data scientist. If you can make yourself that sort of a resource, then you stand a chance to become a data scientist. I suggest learning to code in python. If you want to get started with python, check out my course that teaches the basics of python.
Next on your list of things to learn to be a data scientist is some statistics. Sure you can learn all sorts of things great things, but in my experience. The descriptive statistics that I produce tend to be the best understood analysis that I have done. Also these sorts of analysis tend to be the ones that make the most impact on my organization. That isn’t to say that the other algorithms aren’t important, its just that for the most part I tend to do lots and lots of A/B testing. So we get to do things like take the average number of clicks per impression and see if one is statistically different from the other. This is just something that you have got to know.
But statistics is really what you need to know. Imagine my shock when I got my first job, and my manager (a non-technical business type) said, “Yes, this is the mean and standard deviation. I get that, but I just wish that there was someway to actually <i>see</i> the distribution that you are talking about. I am a very visual person.” I quickly produced a histogram, and boom almost instant celebrity status.
Then you need to move on from that. You need to know some advanced algorithms. They come in handy. I’ve had to figure out how to compare multiple time series to observe the differences between a normal day, and an abnormal one before. This sort of analysis gives you similarity matrices. Check out this podcast episode to see how to do this.
These things can get quite unruly so you need to know linear algebra and how to apply that to a sparse matrix. Again, this is going to require coding chops, but it also implies a certain comfort with higher-level mathematics. Remember, humans came up with these mathematical ideas. So they are intelligible, in fact, often they are quite beautiful, if you know how to look at them. I wouldn’t bother with Calculus, at least not initially. Find a book on linear algebra that you like. I would curl up with that thing and just keep playing around until almost everything that you see, starts to look like a matrix or a vector. Once you have started to think like that, you have started to think like a data scientist.
In short to be a data scientist is simple. Write more code, to do more things than a statistician would. And do more statistics than a coder would. When you hit this happy medium, guess what? You are a data scientist.