Data science is a huge and ever expanding field. It deals heavily with math, computers, statistics, and all sorts of other disciplines. Jumping into data science often feels overwhelming, because frankly, it is. I often hear questions similar to this one from my buddy from grad school, Nii Amon Neequaye, PhD: “… do u have some good data science resources or materials to recommend(?)” Now Nii is no laggard when it comes to statistical thinking, reasoning, calculation, etc. In fact, I count him as one of the smartest people that I have ever met in my life. No joking, he is crazy smart. So if he is feeling a little lost in the subject, then just know that you are in good company.
I think that the biggest problem is there is so much to learn. And that you need to get a sort of mental map of how different concepts fit together to create the world of data science. Unfortunately, to sort of commit my own personal mental map of what data science is to a blog post would do a great disservice. Mostly because my own personal map is constantly in flux. Concepts come together and pull apart to accommodate new knowledge, techniques, and ideas. Instead, here are some resource that I have used to pick up the knowledge that I have (minus grad school).
Also, I want to point out that when I get a question like this, I usually suspect that the person is really asking me about the sexy parts of data science, that being machine learning. They usually don’t want to hear about where the real work happens in this field, which is data cleaning and preparation. And in terms of people like Nii, who already deal with data all the time, I know that he knows how to prepare data for analysis. So the resources I point them to revolve around that.
1. SciKit-Learn’s Documentation
I’m not even joking, the documentation on scikit-learn is fantastic. It is comparable in terms of breadth and depth that you would expect from a professional, charged statistical packages documentation. If you have ever dealt with STATA, and its manual, the similarities are shocking.
What is great about the scikit-learn documentation is that it shows you examples of how to code what you are looking at, combined with advice for when it might make sense to use a particular algorithm, combined with links to the original papers that published the algorithms.
I like that sort of combination of things.
First, it shows you how to actually run the algorithm. This is important, because if you can figure out how to run the algorithm you can blindly use it. That is mostly bad, but some good. You can expose yourself to it, and sort of get a feel for its properties. I don’t think that it makes a lot of sense to do that with anything that needs to go into production, but in terms of familiarizing yourself with an algorithm, nothing beats hands on experience.
Second, it gives advice. I use the term advice with caution. Mostly because the documentation doesn’t explicitly say things like “use logistic regression if your problem has properties X and Y but not Z. If it has property Z then use an SVM instead”, rather it comes in the form of things like benchmarking one algorithm against another on a particular task. This is surprisingly helpful for figuring out whether or not a particular approach is likely to work.
Last, they link back to the original papers where an algorithm was first discovered. I really like this. The reason is that you can really dive deep into the guts of an algorithm and figure out what is going on under the hood. You don’t need to implement it from scratch, but once you really understand it, and what is going on. You can make informed decisions about which algorithm you should use.
That’s why reading scikit-learn’s documentation is a great way to get a feel for what is going on in data science. You will also obviously learn new techniques. I’m in there almost daily, reading that stuff.
2. Coursera’s Machine Learning Course
Andrew Ng, made this course and it is amazing. I took this back in 2013, I think. Anyway, I was in grad school back then and I was working on my dissertation. I used it as an excuse to procrastinate. It worked, but it also taught me new techniques that economists just don’t use like regularization, support vector machines, tree algorithms, etc. That seemed like machine learning folks were bastardizing maximum likelihood estimates that I had spent the last few years studying and thinking about. Later, I just realized that I was just adding constraints like a Kuhn-Tucker condition to the model, and that eased my stomach a little, since I can reasonably come up with some sort of economic/ intuitive meaning for the regularization.
That being said, the nice thing about this resource is that it sort of forces you to write these algorithms from scratch. That way you get a pretty deep understanding of how the different algorithms perform. I think that is crazy valuable and can be missed. Also if, you know some linear algebra you can really impress people by doing some of the exercises in a single line of code.
3. Machine Learning Mastery
I really like this website by Jason Brownlee! In fact, many of my own blog posts have been patterned after some of his. What I really like about this blog is that Jason takes the time to show you how to wire up neural networks, without getting into the weeds on the math, and stochastic gradient descent, and the theory. His basic approach is, let’s grab some data, wire this algorithm up and run it.
He doesn’t worry about tuning the algorithms, or wring his hands about architecture, or even the data whether it could be modeled a different way. His main goal seems to be show you how to do things on your own. I really like that. He also let’s computers do what they are good at, compute numbers. That way you just get what you came for which is to learn how to wire things up and use them in practice, even if it isn’t a google, amazon, ibm, level project, and a neural net might be overkill.
If you want to just learn how to wire up a neural net, you just can’t beat this resource.
4. Deep Learning Book
This is a really neat book. It was meant as a text book which could be used by graduate CS students interested in machine learning, and specifically in deep learning. As of writing this post, I am working my way through this book. You can pick up a copy at amazon.
But this book is pretty spendy. So if you don’t want to buy a copy, I would highly recommend reading it online for free. That’s right, it is all 100% online and FREE to read. All you have to do, is go to deeplearningbook.org. So where Jason Brownlee seems to focus on just wiring these things up, Ian Goodfellow et al. seem to go in the other direction. So if you want to do a deep dive into neural networks, this is the essential guide.
Note, I’m not reading the book online, I like the physicality of having a book. So I picked up a copy, just a thought.
5. Becoming a Data Scientist
Renee Teate is amazing. She interviews data scientists about what is like to make a transition into the field. She also posts tutorials and guides on data science and machine learning. I’m a super fan. I found her through her podcast. And then I quickly got interested in her learning group,data science learning group. Her datasciguide offers turorials and guides that she finds useful. She tweets regularly and you can follow her @BecomingDataSci. Frankly, where does she get the energy to produce so much content!
What she’s putting out there is pure gold. I think that you would do well in your career to follow Renee around and learn to be a data scientist with her. She will set you on a good course of action, and you will have a great career if you just sort of jump into what she has produced.
There you have it. 5 simple resources that you can access to learn data science. My only caveat is that if you really want to learn this stuff, then you should really just dive in and start to do it. I think that learning by doing is the only way to make progress in this field. You can check out my other blog post on how to get started in a data science career for further thoughts on how to go about learning to be a data scientist.