Learning XOR with Only Linear Regression (Yeah it’s Possible)

Alright, so that title isn’t very clickbaity unless you are a real data nerd. Linear models do terrible at learning XOR. Simply because XOR is highly non-linear. It requires a non-linear model to learn it. And my model is non-linear too. But the only model that I use is a linear regression, straight up OLS. My trick is that I use an ensemble of OLS regressions.

I know, that’s like cheating to say that I’m saying that I am “only” using Linear Regression. I suppose the more appropriate way to say it is that I am only using Linear Regressions combined using stacking. But I digress.

So let’s start out with what is XOR anyway? XOR is a type of logic gate that you need to have to build a computer. Computer scientists have known about XOR for quite some time, like 60 or 70 years at this point. It takes two inputs, both of them being bits. In other words, both of the inputs can be a 0 or a 1. It also outputs a 0 or a 1. So there are only 4 distinct inputs, and 2 distinct outputs. Here it is in mathematical terms, the function f is the XOR function.

f(0,0)=0\\f(1,0)=1\\f(0,1)=1\\f(1,1)=0

And that’s it really. That is the whole XOR function. Two bits are in the off or on position output a bit that is off, if exactly 1 input is on, output a bit that is on.

Let’s try to model it with Linear Regression

Okay, let’s model it. First we need to create the data, and load the libraries that we’ll need. First off, we’ll be using sklearn (a couple of parts of it anyway), numpy, and pandas. So here’s the code to get them into python.

from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.metrics import confusion_matrix
import numpy as np
import pandas as pd

And now we need to get some data. Fortunately, there are only 4 data points that describe the entire function, so we can hand code them. We’ll also throw them into a pandas dataframe just so that we can use them later.

X = np.array([[0, 0], [1, 0], [0, 1], [1, 1]])
y = np.array([0, 1, 1, 0])
X = pd.DataFrame(X)
X['depvar'] = y

Great, now we have our dataset, and all the libraries that we need to do a linear regression. Let’s run OLS against this data and see what we get.

model=LinearRegression()
model.fit(X[[0,1]],X[‘depvar’])
print(model.predict(X[[0,1]]))

So it ran just fine. And here is the output:

[ 0.5  0.5  0.5  0.5]

Well, that sucks. The way that I interpret this is that the model decides there is a always a 50% probability that the output should be on. That’s not very helpful when you are trying to make decisions, essentially, you don’t know. Flip a coin. It’s going to be just as accurate. Sad day for OLS, it can’t do what it is supposed to do. It is just too weak for this problem, it can’t learn the non-linearity.

Ensembles Will Make it Better

So the approach that I am going to take is that of ensembling. There are a number of other ways that you can approach this problem. Introducing non-linear terms, a cubic would be able to fit XOR, for example. You could apply some sort of kernel to do an SVM type thing, maybe. You can run it through a neural network. But all of those take us away from simple models. Let’s give ensembling a try.

First we need to separate the data set into a number of folds. I could cheat a little bit and learn 4 folds, one for each output, but that’s boring. We’ll just do 2 folds. The code below, will separate the data into 2 random folds.

kf = KFold(n_splits=2,shuffle=True)
kf.get_n_splits(X)
i=0
for train_index, test_index in kf.split(X):
   for j in test_index:
       X.loc[X.index == j,'fold'] = int(i)
   i+=1

This code adds a column to the dataset which indicates which fold an observation belongs to. This will be important when it is time to fit models for stacking. Okay, now in order to do proper stacking we’ll need to create a copy of the dataset where we can store estimates from our models in the stacking procedure. We’ll call it meta_X:

meta_X = X.copy()
meta_X['model1']=None

The model1 variable is where we’ll store the output from the linear regressions that we’re going to be using as base models.Now we need to loop over all of our folds, and split the data into training and test sets. We’ll train a model on data in the training set and use that trained model to predict the test set. These predictions, the ones on data that the model in that fold hasn’t seen will be used to fill in the model1 variable in the test set. I feel like this paragraph has been crazy convoluted. Hopefully, some code will clear things up if I managed to confuse you. So check out this for loop, it is essentially what I was trying to describe in this paragraph:

for idx in X['fold'].unique():
    train=X[X['fold'] != idx]
    test=X[X['fold'] == idx]
    model1 = LinearRegression()
    model1.fit(train[[0,1]],train['depvar'])
    print(model1.coef_)
    meta_X.loc[test.index,'model1'] = model1.predict(test[[0,1]])

Okay, so I’ve filled in the model1 variable. All is looking good, except that I still don’t have predictions from the model. The way that I solve this problem, is by running a linear regression over the whole dataset. But instead of using my original data, I use the predictions from my base models.

model_all = LinearRegression()
model_all.fit(meta_X[['model1']],meta_X['depvar'])
print(model_all.predict(meta_X[['model1']]))

Which gives the following output:

[ 0.  1.  1.  0.]

Hey, that’s not too shaby.

What Is Going On Here?

I actually look at this problem, and it looks a lot like a multi-layer perceptron to me. I have a hidden layer, an input layer and an output layer. The only difference is that instead of running the algorithm from end to end, I was in the loop making decisions about how the weights and layers would work together. I think that I was manually building out a neural network, and building out activations, etc. But it works, and I would call it distinctly different from a neural network. But it does make a nice example of how an ensemble of linear models can produce very non-linear outputs.