Since pandas 1.0 was released earlier this year, I thought that it would be a good opportunity to discuss some of the nicer features in pandas that you might not know existed that I have found very useful. One of the most useful things that I have learned about pandas is that it has robust tools for dealing with data stored as a string. At first this may seem like a very trivial thing for a data scientist to concern themselves with, however, most of data science is moving data around, transforming it, and just in general preparing the data for analysis. In fact, in my career, I have found that these skills are, in general, more sought after in job interviews than advanced neural network techniques. The ability to take a string and extract meaningful data from that string is by far a skill that I see most aspiring data scientists lack when I post a job as the hiring manager. Hopefully, this post goes a long way toward helping you learn how to deal with strings in your dataset.
I want to take the time to give you 3 things that pandas can do to make your life easier as a data scientist when you need to deal with strings.
Turning strings into categories
By far the most common way to deal with a string is to turn the string into categories. Often categorical variables are stored as a common string, however, they should not be placed in a dataframe for analysis as a string. That is a big problem, namely because categorical variables may be treated very differently by a given algorithm than say a string.
A good example of this is category embedding in a neural network. Until I learned this technique my neural networks generally suffered as a result of my trying to one-hot encode a high cardinality categorical feature. This procedure obviously generated an extremely sparse feature matrix that made my neural networks very brittle.
Furthermore, if new categories presented themselves that were not in my training set, I either had to go back and retrain my model, or ignore the input entirely. Obviously, that is not an ideal situation to be in. By learning to embed categories into a dense feature vector, I immediately saw a large boost in the performance of these models. The trick though is that most neural network libraries like keras, tensorflow, or pytorch all sort of require you to index using integers for the categories. Until I learned this trick in pandas, I would do silly things like build a dictionary that mapped strings to integers, and then use a lambda function to create a new pandas series using that dictionary.
No longer do I do silly things like that because I found, and fell in love with the elegant “.cat()” function. Here is how simple it is to use this function. You have a dataframe with a column of strings that represent a category. All you have to do is something similar to:
df[‘myCategoryVar’] = df[‘myStringVar’].str.cat()
It is faster to write and it is easier to maintain, it is also far more understandable when you read the code than trying to figure out what your lambda function was doing. I spent some time last year and fixed my entire codebase to use this handy little feature of the pandas library.
Extracting a pattern from a string
I have found that one of the most useful things that I have learned during my career as a data scientist has been the ability to use regular expressions. It is kind of funny to me that there are a lot of different problems where the data that you need is being stored half-way through a long string of text. You see this all the time. A good example that I came across recently was the account number of a customer was tucked away in the invoice being stored as an html page in the database. I kid you not, the full html! Oh well, the account number is stuck in the middle of the page, I guess that it was gone forever.
Regular expressions to the rescue. Each invoice had the exact same structure. The lengths may have differed as some invoices had more products than another, but the structure was the same. Now, my mission if I was to accept it was to extract all of the requisite data from the entire html string. And turn that into a nice dataset. These are the things that no one tells you about when you are signing up to be a data scientist. So rather than go through hell trying to get SQL to parse these html strings, I simply pulled them into a nice pandas dataframe. Pandas has some really good features for extracting parts of a string using regular expressions. The chief amongst them is the extract function. You can use it like this:
df[‘numericalVar’] = df[‘myStringVar’].str.extract(‘some RegEx pattern’).astype(‘float’)
Look at how simple that is! Let me break this down for you about what this little snippet does. You simply use the string selector like you did to turn things into categories but this time you call the extract method instead of the cat method. The extract method needs a valid regular expression pattern. It is whatever matches this pattern that will be stored in your new variable. You can then change the typing to something more appropriate than another string like turning into a float. It will do this for every single string stored in your variable.
It is such a useful tool. I actually probably find a place to use this little featurre in pandas almost every week. It isn’t that the data I’m working with is horrible. Usually, it is great data. There are just so many times that a small snippet is something that you need. I’ll give you another example. The database that I was working with had a description for products. Some of these products had a time frame associated with them. Think memberships that would expire after a certain period of time. The description contained things like “1 month membership”, and “36 month membership”. To calculate how long someone had until their membership expired we had to extract the numbers from this string. It was a product database that no one even thought that we would need to calculate membership expiration dates from down the road.
I don’t know a battle tested, true, data scientist that doesn’t have a least a few stories exactly like this one.
But going back to the invoice problem for just a few moments. How do you go about extracting each line item’s data from a mess of a string? Pandas comes to the rescue once again with a handy feature similar to extract. The feature that I am talking about is the ever useful, extractall method.
It is so similar in fact that I hesitate to include it in this article at all. It does the same thing, but will return an array for each string with all of the matches that the regular expression pattern found. The difference is, of course, that the extract method will just pull the very first match that it finds. The extractall will get every last occurrence. It is a much less greedy approach, and you have to take the time to deal with the fact that it gives you an array of matches back to you, not just a single number.
Usually, I just handle this problem with a simple for loop. But sometimes depending on your situation, you may need to get fancier than that, or even possibly simpler. I really wouldn’t be able to say without looking at a specific circumstances.
Using contains to build labels
This trick is really useful when I am trying to build labels for a machine learning model. Often, I see systems where a user is able to leave a note. I think this is pretty common. In my line of work at my day job, I use data science and machine learning to detect fraud. Often times the notes that a front line employee leaves will give clues as to whether I should mark an account as fraud. This doesn’t always work.
Sometimes, I may see notes like, “I verified that this was a case of identity theft fraud. So I did not open the account.” These are valuable cases that need to be included in my model. The fraudster has supplied me with rich data to detect him, or people like him, in the future. As such, I want a way to mark these records as being frauds.
The contains method helps out enourmously. All that you have to do in this situation is to call it:
df[‘fraud’] = df[‘note’].str.contains(‘fraud’)
Now this method of creating labels is not perfect. You can imagine any number of counter-examples where this will fail you in the most spectacular fashion. So you need to be careful. I’ll give you one example. Suppose that you ran the above line of code, but you had a note that read, “I’m pretty sure this is not fraud”. You would of course mark this record as being fraudulent. Negation in natural language is such a horrible thing to try to deal with. A simple way to deal with this problem is to continue using the contains function. You could write something like:
df[‘fraud’] = (df[‘note’].str.contains(‘fraud’) & ~df[‘note’].str.contains(‘not’))
This will work, but the astute reader will notice that it breaks down once again. This time it breaks on my original example. You will see that my original example also contained the word “not”. The point that I am trying to make is that although it is possible to use contains in this way, you need to be very careful about how you are using it.
My general rule of thumb for using the contains method is to never use this method on a series of strings that were input by a user as free-form text. For example, the strings are a bunch of letters that encoded different properties of a product, or if it made a reference to a time-frame, and that was system generated, contains works great. In the case of user input from free-form text, if you start to use the contains method you need to be careful. You can quickly go down a rabbit hole of counter-examples of counter-examples of counter-examples. My recommendation would be to stay away from those.
These three methods of dealing with strings within pandas are incredibly powerful. I use these methods all of the time in my day-to-day work. Talking about these things isn’t the most glamorous data science that you will do. However, out in the real-world, these little tips and tricks, at least in my experience, are what separates really great data scientists from neophyte data scientists. So let me end this article on a little bit of a soapbox.
An experienced data scientist will spend most of their time, getting the data right. They will ignore the siren call of complex esoteric machine learning models to improve their models. Most of the time, they spend their time, dealing with all of the crap in the database. They clean the data, groom the data and get the data ready. In a lot of ways, data science is like a dog show. Most of the work happens long before the main events. Cleaning, grooming, and teaching others the star of the show how it needs to behave. That way you can put the best face forward when it is time to be judged. For dogs, a human will determine which is the best specimen. Similarly, for data science, the algorithm will determine which data points are relevant. So spend your time learning to clean and groom your data.