So its time for a fun little announcement, my wife and I are having a baby. And we’re so excited! Having a baby presents a novel problem for every parent every time. And that is: “We’re going to have a little human being running around, and that human is going to need a name.” It strikes me as a little weird every time that I have to give a name to anything. Just think about it, this is the name that this person is going to be known by forever. It is an awesome and somewhat ridiculous responsibility. It is a thing that we do out of necessity. After all, we have to distinguish ourselves from one another. Here’s the thing, as a parent, I’m pretty much clueless about what name would be a good baby name.
That’s why I decided to help myself in the way that only a data scientist would think to help themselves out. I decided to build a neural network that predicts the popularity of a name. This is actually a really silly way to come up with names, because it won’t actually tell you whether or not the name is good, it will just tell you whether or not the name has the characteristics of a popular name. But hey, who cares, running the network is a fun way to distract myself from the fact that I will soon be a sleep deprived daddy. So we’ll build out a network.
Before we go hog wild crazy building a neural network that will tell us whether or not a name is good or not, we’ll need data that can be used to train such a network. Fun fact, the government maintains this exact data. They even brag that they have a 100% sample of every name on social security applications for newborns. So this is the data that we’ll use to train our network. You can get it here.
Okay Let’s Code
So the first thing that we’ll need to do is to import keras. Now my keras is using Theano. I have it set up that way because my laptop’s operating system is 32-bit ubuntu, and for some reason TensorFlow doesn’t like that. So anyway, we need to import keras into the workspace. It will also import its backend as well. We’re also going to need pandas and numpy.
from keras.layers import LSTM, Dense, Dropout from keras.models import Sequential import pandas as pd import numpy as np
Next we’ll import the data. From the link above I just took the most recent year, but you can go crazy and discount popularity add the year, whatever you want to do. We’ll keep it simple for this tutorial though. I’m also going to mix up the entries because they are in order from most popular to least popular for girls and then most popular to least popular for boys. Mixing it up will help the neural network learn faster. Otherwise it takes a long time to learn anything but the average. I’m not sure why that helps but it does.
df = pd.read_csv('/home/ryan/Desktop/yob2016.txt',header=None) df = df.sample(frac=1)
So we have names and their popularities at this point, what features are we going to extract to train the model. My thought was that we could use the letters in the name itself. This is kind of a fun idea, because later we can have a network that learns the next appropriate letter and generates names from a seed. But that isn’t what we’re doing today. I’m thinking that we can do that to generate a name, and then use this network to tell us how good that name is.
So let’s write some functions to prepare the data to do that. The first thing that we need to do is to tell the neural network which letters comprise a valid alphabet. This is a surprisingly difficult task when you are coding a neural network for say imitating Shakespeare. That’s because you need to include all of the punctuation, and numbers for acts and lines. Fortunately, we have English names which do not have punctuation so we can get a standard set of English letters, and a space as a padding the sequence.
chars = sorted(list(set('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqqrstuvwxyz '))) char_to_int = dict((c, i) for i, c in enumerate(chars))
Finally, we need to use this alphabet to encode each of the names as a sequence for the neural network. We do this with a helper function. This function encodes a string as a sequence of 15 characters. If there are not enough characters in the string, it will pad the end of the string with blank spaces until we get to 15 characters.
def namer(x): num = 15-len(x) return [char_to_int[char] for char in x+' '*num]
Now that all the pieces are in place, we’ll run this against our data to clean it up so that it is presentable to the neural network.
X = np.array([namer(obj) for obj in df[0]]) X = X.reshape((len(df),15,1)) y = np.array(df[2])
Okay the Data is Clean, Build a neural network!
I used a really dumb architecture. There is an LSTM layer, with a dropout layer attached. We’ll randomly drop 20% of the connections each iteration. This is done as normalization procedure to keep the network on its toes about what it is learning. We then feed that into a hidden layer, which determines the output. Here’s the code:
model = Sequential() model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=False)) model.add(Dropout(0.2)) model.add(Dense(256)) model.add(Dense(1))
I train the network to minimize the mean absolute error. I like this loss function because it is very interpretable. Essentially, it says on average how far off our my predictions. In the case that we’re looking at, on average by how many people giving their child that name am I off with my predictions. That last sentence isn’t a great sentence, but its the best that I could do, English teachers forgive me. We then fit the model.
model.compile(loss='mae',optimizer='adam') model.fit(X,y)
I let this bad boy run for a couple of days letting it go for like 32,000 iterations. And it started to produce some reasonable results. I think it needs more. It would go faster on GPU, but I am cheap and running this on my old laptop. I will package it up and post something that you can play with later.
Here’s The Full Code
from keras.layers import LSTM, Dense, Dropout from keras.models import Sequential import pandas as pd import numpy as np df = pd.read_csv('/home/ryan/Desktop/yob2016.txt',header=None) df = df.sample(frac=1) chars = sorted(list(set('ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqqrstuvwxyz '))) char_to_int = dict((c, i) for i, c in enumerate(chars)) def namer(x): num = 15-len(x) return [char_to_int[char] for char in x+' '*num] X = np.array([namer(obj) for obj in df[0]]) X = X.reshape((len(df),15,1)) y = np.array(df[2]) model = Sequential() model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=False)) model.add(Dropout(0.2)) model.add(Dense(256)) model.add(Dense(1)) model.compile(loss='mae',optimizer='adam') model.fit(X,y)