You know, usually, I sit here and I just take things for granted. I can import data into pandas to be manipulated. It is something that just is. Well, here’s the thing. Somebody wrote that code. Frankly, they are a hero. I couldn’t imagine doing my job without pandas and if I had to jump through hoops to make it work, that would be awful. So here are a couple of ways to get data into a pandas dataframe.
Scrap It!
Yep, if there’s a website out there with the data formatted in a nice table, you can probably just scrap it directly from the web. The way you do this is by using the read_html method. This one is great because you literally just have to pass a url to the function and it will lift every html table from that page, and return them to you as nice pandas dataframes. You might have to figure out which of the tables is the one that you want, but you don’t need to deal with messy data downloads, javascript impersonation, or anything too hard. It is almost as easy as point and click.
Read From a Text File
This is probably the most common thing that I do. Once you’ve got a text file, all that you really need is to figure out how it is formatted. Text files come in all different styles. The most common is a csv (comma separated), followed by tsv (tab separated), there’s also whitespace separated, pipe separated, etc. But that is all just a variation of the csv theme. For these types of files, use the read_csv method. The sep parameter let’s you choose the type of delimiter that your file is in.
There is another more insidious file type that is technically text based. The fixed-width file type. In my experience these tend to come from legacy systems where memory was at a premium. They are rigid, and break all the time. But they exist and get used regularly for when you can’t spare any extra characters for separators. Use the read_fwf method for these tricky files. You will, however, need to specify how many characters go into each column.
Do you have a json file. This file type is very popular with web developers these days and their structure is very similar to python dictionaries. You can turn them into dictionaries and manipulate them into a form that can be read into pandas, but it really is just simpler to use read_json instead.
SQL and Pandas
It turns out that you can write sql and execute that within pandas. I do this daily, it is a good way to pull down data to be manipulated. A good example is read_sql_query. If you pass a sql query to this little function along with a connection string to the sql server, you can pull in whatever data from the server that you wish and drop it directly into pandas.
If it is an entire table that you want, then read_sql is sufficient. Though, to date, I can’t think of a single time that I wanted to do that. Usually, I want to at the very least join and filter the table down in some way. So I write a query instead of pull an entire table.
Custom Formats
It turns out that people will share files in all kinds of different formats. I started my life as a stata user. It turns out that stata is fantastic if you are an economist. It really is the de facto language of modern empirical economics. But it has its own custom data format. Pandas can handle it with read_stata. That is not all there are readers for SAS, SPSS, feather, parquet and so many more! Whatever form that the data is given to you in, rest assured there is probably a way to read it into python using pandas.