5 awesome tips about Pandas

10 minute read

Introduction

Pandas is probably one of the most important python libraries for Data Science, and honestly it has saved my life many times: I really love it, and I tend to use a lot it in any project I am directly involved. This is to say that it is always present in any requirements.txt file I typically have to write when I set up a project.

The great thing about pandas is that it is simple and it comes with a great documentation, and its growing popularity within the Data Science community has been a driving force for ever more efficient code production.

However, whenever I analyse pandas-based code, what I observe is the fact that people tend to use standard pandas techniques to perform operations within (and between) data frames. This is really fine, but I think most of them are not aware that these methods could be incredibly inefficient as the row size of the dataset gets bigger and bigger.

The purpose of this post is to reveal some advanced tips that any data science developer should know to run efficiently more complex operations with pandas. So, let’s dive into them.

Data

We consider two source of data, namely sales_df and prices_df, which have been synthetically created for illustration: the former contains daily information related on the sales of fruits, whereas the latter contains price information on those fruits. If you are curious on the synthetic generation of those datasets, please check the source code here. The sales_df is shown here:

Lemon Apple Orange Banana Mango Total
2020-10-01 17 12 28 6 10 73
2020-10-02 20 14 11 11 14 70
2020-10-03 5 24 29 12 18 88
2020-10-04 8 23 29 19 13 92
2020-10-05 8 9 17 22 14 70

The prices_df is given below for the sake of completeness:

item price
0 Lemon 1.2
1 Apple 0.5
2 Orange 0.4
3 Banana 0.7
4 Mango 1.5

1. The .eval() method

Although this function is still not very popular among Data Scientists, it is worth to notice that this is used in the backend of many of the most popular pandas functions. But what does it perform? Well, it is used to evaluate arithmetic expressions involving operations between columns of the given dataframe. Typically, we pass to the .eval() method a string, which contains the expressions we wish to evaluate on our dataframe. The reason is that this method is based on the numexpr backend.

As an example, we might want to compute the total daily revenues per item. To do so, we can employ the .eval() on the sales_df as follows:

for row in prices_df.itertuples():
    col_name = row[1] + ' ' + 'Revenues'
    sales_df[col_name] = pd.eval("sales_df[row[1]]*row[2]")

The above snippet basically appends to the sales_df a revenues’ column for each fruit item involved. In particular, we see that the eval() takes a string made of the expression "sales_df[row[1]]*row[2]": since, by default, the parser is pandas, this operation will be applied row-wise.

A simple inspection of the new columns confirms our expectations.

sales_df.loc[:, 'Lemon Revenues':]
Lemon Revenues Apple Revenues Orange Revenues Banana Revenues Mango Revenues
2020-10-01 20.4 6.0 11.2 4.2 15.0
2020-10-02 24.0 7.0 4.4 7.7 21.0
2020-10-03 6.0 12.0 11.6 8.4 27.0
2020-10-04 9.6 11.5 11.6 13.3 19.5
2020-10-05 9.6 4.5 6.8 15.4 21.0

Note that .eval() can be applied directly to the dataframe. In this case, you need to pass the columns’ names inside the method. So, for instance, in case you wanted to compute the total sales per day - which has been already present in the dataframe, namely the Total column - you just need to sum the columns’ name as follows:

sales_df.eval("Total = Lemon + Apple + Orange + Banana + Mango", inplace=True)

2. The .query() method

This is my favourite one: it is based on the eval() method, and it is used to filter the dataframe based on a given expression. Since the backend is still the numexpr, we need to pass the expression to this method as a string.

For instance, we might want to filter out all the rows that do not match the condition "Total >= 80": this can be easily done as follows:

sales_df.query("Total >= 80")
Lemon Apple Orange Banana Mango Total
2020-10-03 5 24 29 12 18 88
2020-10-04 8 23 29 19 13 92

The fact that I really like this method is that it allows you to pass global variables inside the expression (this is inherited from the eval () method):

condition = 80
sales_df.query("Total >= @condition")

We can even pass multiple conditions in the same expression, as follows:

condition = 50
sales_df.query("Total >= @condition & Mango >15")
Lemon Apple Orange Banana Mango Total
2020-10-03 5 24 29 12 18 88

Note that we could have written sales_df[(sales_df['Total']>=50)&(sales_df['Mango']>15)] to perform the same operation. Honestly, apart from the performance issue, I found this one not easy to read, especially when the number of conditions increases. With the .query() method, everything gets simpler and more efficient.

3. The .lookup() method

Let us consider a slightly different dataset, which basically consists of the sales_df with an extra column, named "Best": this column represents the category of fruits that was sold the most in that particular day.

Lemon Apple Orange Banana Mango Best
2020-10-01 17 12 28 6 10 Orange
2020-10-02 20 14 11 11 14 Lemon
2020-10-03 5 24 29 12 18 Orange
2020-10-04 8 23 29 19 13 Orange
2020-10-05 8 9 17 22 14 Banana

Suppose you are interested in getting the maximum quantity sold for each day. We have many ways to do that. One of the fastest one is by means of the .lookup() method. This basically requires two arguments: row_labels and col_labels. While the former is typically the index of our dataframe, the latter must be of type object, since the lookup is going to be performed on that particular value - in our case, the column "Best".

sales_best_df['Best Quantity'] = sales_best_df.lookup(
      sales_best_df.index, sales_best_df["Best"]
)

The expected result is shown in the following table:

Best Best Quantity
2020-10-01 Orange 28
2020-10-02 Lemon 20
2020-10-03 Orange 29
2020-10-04 Orange 29
2020-10-05 Banana 22

4. The .where() method

This method can be extremely useful in the initial (or exploration) phase of a dataset, especially if you want to get both the total non-null values and their specific values, given a specific condition. Indeed, the .where() simply replaces with a null value the rows for which the condition is not satisfied, whereas it retains the original values when the condition is True.

For instance, if you want to verify when the column "Best Quantity" is greater than 25, then you can simply use the .where() method, which requires a boolean condition to be verified.

sales_best_df.where(sales_best_df['Best Quantity'] > 25)
Lemon Apple Orange Banana Mango Best Best Quantity
Date
2020-10-01 17.0 12.0 28.0 6.0 10.0 Orange 28.0
2020-10-02 NaN NaN NaN NaN NaN NaN NaN
2020-10-03 5.0 24.0 29.0 12.0 18.0 Orange 29.0
2020-10-04 8.0 23.0 29.0 19.0 13.0 Orange 29.0
2020-10-05 NaN NaN NaN NaN NaN NaN NaN

You can also replace the null values with another condition: for instance, if we can pass to it an integer - say 0 - or a more complex expression. As an example, take the following one:

sales_best_df.where(sales_best_df['Best Quantity'] > 25, 0)['Mango']
    Date
    2020-10-01    10
    2020-10-02     0
    2020-10-03    18
    2020-10-04    13
    2020-10-05     0
    Name: Mango, dtype: int64

The opposite of .where() is .mask(): as you can see below, it actually takes the complementary part of the result obtained by .where() on the dataframe.

sales_best_df.mask(sales_best_df['Best Quantity'] > 25, 0)['Mango']
    Date
    2020-10-01     0
    2020-10-02    14
    2020-10-03     0
    2020-10-04     0
    2020-10-05    14
    Name: Mango, dtype: int64

5. The .select_dtypes() method

This is a nice one. Typically, we have to decide whether to include or not variables of type object inside our analysis. Obviously, one way to convert those variables into a proper form is via One-hot Encoding, which is performed via the pd.get_dummies() function. However, sometimes we just want to remove them, since they would not bring us useful information. To do so, we can control the columns’ type to include via the .select_dtypes() method.

For instance, our dataset has a column of type object, as shown below:

Lemon             int64
Apple             int64
Orange            int64
Banana            int64
Mango             int64
Best             object
dtype: object

Since that columnn is redundant, we might want to remove it. In standard pandas, we would have performed this operation has follows:

sales_best_df.loc[:, sales_best_df.dtypes == np.int64]

In addition to that, we can use the .select_dtypes() method, which allows us to either include or exclude particular types of variables with a simpler synthax. For instance,

sales_best_df.select_dtypes(include=['int64'])

will remove all columns that are not of type int64. Alternatively, if you want to be more conservative and say you want to keep both int64 and float64 columns, you can explicitly tell which type to remove, ie all columns of type object. This can easily be done as follows:

sales_best_df.select_dtypes(exclude=['object'])

Conclusions

In this post, we have seen five different methods that you can now use in your pandas-based Data Science project. I strongly recommend all of them, since they have been thought to enhance your code performance. If you have enjoyed this post, and you want to get more, please feel free to subscribe here!