This post is all about causality. In data science we are often concerned with simple correlations. As someone trained in econometrics, as any one can tell you, the key thing that you should be concerned with is causality. We are often worried not about correlations; correlations lie.
We want to get at causality. Causality is the basis of prediction. If you have a causal relationship, thing A causes thing B to happen. You now have a prediction engine for thing B. Assuming that one would want to predict thing B.
In data science, and mainly in machine learning, what we care about more often than not is some form of prediction. So your strategy should always be to derive a causal relationship. “If I increase advertising spend, sales will go up.” That is a causal relationship. The noisy world though, you need to ask not only direction but magnitude to be useful. This is where things get interesting. That’s because causal relationships imply a correlational relationship, and correlations lie. So mis-measuring the size of the causal relationship becomes very difficult indeed.
By construction, a regression (or classification type regression such as logistic regression) is designed to assume a causal relationship. This is why we get away with the correlational prediction that they supply. We may be able to predict with these models, but we can’t trust individual parameters to be their correct (i.e. causal) values. They just don’t get disentangled enough from their dependent variables.
So you can do 2 stage least squares, panel regressions, time-series regressions, etc. to derive a causal relationship. But those methods each present complexity, and problems.
So here is some advice on strategy. Define clearly what your stakeholders want from their data science project. Are they looking for knowledge, or are they looking for predictions. I think that is why you have started to see Kaggle competitions for not only predictions but for the best notebooks that generate the most unique insights. From a strategic point of view, you need to understand what type of output you expected from your project.
Therefore, the first question I ask when starting a new project is: “Are you trying to learn something about your users, or are you trying to predict something about your users?” We need much more sophisticated tools than the blunt instrument of machine learning to come up with a satisfactory solution when you want to learn something.