5 awesome tips about Pandas
Introduction
Pandas is probably one of the most important python libraries for Data Science, and honestly it has saved my life many times: I really love it, and I tend to use a lot it in any project I am directly involved.
This is to say that it is always present in any requirements.txt
file I typically have to write when I set up a project.
The great thing about pandas is that it is simple and it comes with a great documentation, and its growing popularity within the Data Science community has been a driving force for ever more efficient code production.
However, whenever I analyse pandas-based code, what I observe is the fact that people tend to use standard pandas techniques to perform operations within (and between) data frames. This is really fine, but I think most of them are not aware that these methods could be incredibly inefficient as the row size of the dataset gets bigger and bigger.
The purpose of this post is to reveal some advanced tips that any data science developer should know to run efficiently more complex operations with pandas. So, let’s dive into them.
Data
We consider two source of data, namely sales_df
and prices_df
, which have been synthetically created for illustration: the former contains daily information related on the sales of fruits, whereas the latter contains price information on those fruits.
If you are curious on the synthetic generation of those datasets, please check the source code here. The sales_df
is shown here:
Lemon | Apple | Orange | Banana | Mango | Total | |
---|---|---|---|---|---|---|
2020-10-01 | 17 | 12 | 28 | 6 | 10 | 73 |
2020-10-02 | 20 | 14 | 11 | 11 | 14 | 70 |
2020-10-03 | 5 | 24 | 29 | 12 | 18 | 88 |
2020-10-04 | 8 | 23 | 29 | 19 | 13 | 92 |
2020-10-05 | 8 | 9 | 17 | 22 | 14 | 70 |
The prices_df
is given below for the sake of completeness:
item | price | |
---|---|---|
0 | Lemon | 1.2 |
1 | Apple | 0.5 |
2 | Orange | 0.4 |
3 | Banana | 0.7 |
4 | Mango | 1.5 |
1. The .eval()
method
Although this function is still not very popular among Data Scientists, it is worth to notice that this is used in the backend of many of the most popular pandas functions. But what does it perform? Well, it is used to evaluate arithmetic expressions involving operations between columns of the given dataframe. Typically, we pass to the .eval()
method a string, which contains the expressions we wish to evaluate on our dataframe. The reason is that this method is based on the numexpr
backend.
As an example, we might want to compute the total daily revenues per item. To do so, we can employ the .eval()
on the sales_df
as follows:
for row in prices_df.itertuples():
col_name = row[1] + ' ' + 'Revenues'
sales_df[col_name] = pd.eval("sales_df[row[1]]*row[2]")
The above snippet basically appends to the sales_df
a revenues’ column for each fruit item involved. In particular, we see that the eval()
takes a string made of the expression "sales_df[row[1]]*row[2]"
: since, by default, the parser is pandas
, this operation will be applied row-wise.
A simple inspection of the new columns confirms our expectations.
sales_df.loc[:, 'Lemon Revenues':]
Lemon Revenues | Apple Revenues | Orange Revenues | Banana Revenues | Mango Revenues | |
---|---|---|---|---|---|
2020-10-01 | 20.4 | 6.0 | 11.2 | 4.2 | 15.0 |
2020-10-02 | 24.0 | 7.0 | 4.4 | 7.7 | 21.0 |
2020-10-03 | 6.0 | 12.0 | 11.6 | 8.4 | 27.0 |
2020-10-04 | 9.6 | 11.5 | 11.6 | 13.3 | 19.5 |
2020-10-05 | 9.6 | 4.5 | 6.8 | 15.4 | 21.0 |
Note that .eval()
can be applied directly to the dataframe. In this case, you need to pass the columns’ names inside the method. So, for instance, in case you wanted to compute the total sales per day - which has been already present in the dataframe, namely the Total
column - you just need to sum the columns’ name as follows:
sales_df.eval("Total = Lemon + Apple + Orange + Banana + Mango", inplace=True)
2. The .query()
method
This is my favourite one: it is based on the eval()
method, and it is used to filter the dataframe based on a given expression. Since the backend is still the numexpr
, we need to pass the expression to this method as a string.
For instance, we might want to filter out all the rows that do not match the condition "Total >= 80"
: this can be easily done as follows:
sales_df.query("Total >= 80")
Lemon | Apple | Orange | Banana | Mango | Total | |
---|---|---|---|---|---|---|
2020-10-03 | 5 | 24 | 29 | 12 | 18 | 88 |
2020-10-04 | 8 | 23 | 29 | 19 | 13 | 92 |
The fact that I really like this method is that it allows you to pass global variables inside the expression (this is inherited from the eval ()
method):
condition = 80
sales_df.query("Total >= @condition")
We can even pass multiple conditions in the same expression, as follows:
condition = 50
sales_df.query("Total >= @condition & Mango >15")
Lemon | Apple | Orange | Banana | Mango | Total | |
---|---|---|---|---|---|---|
2020-10-03 | 5 | 24 | 29 | 12 | 18 | 88 |
Note that we could have written sales_df[(sales_df['Total']>=50)&(sales_df['Mango']>15)]
to perform the same operation. Honestly, apart from the performance issue, I found this one not easy to read, especially when the number of conditions increases. With the .query()
method, everything gets simpler and more efficient.
3. The .lookup()
method
Let us consider a slightly different dataset, which basically consists of the sales_df
with an extra column, named "Best"
: this column represents the category of fruits that was sold the most in that particular day.
Lemon | Apple | Orange | Banana | Mango | Best | |
---|---|---|---|---|---|---|
2020-10-01 | 17 | 12 | 28 | 6 | 10 | Orange |
2020-10-02 | 20 | 14 | 11 | 11 | 14 | Lemon |
2020-10-03 | 5 | 24 | 29 | 12 | 18 | Orange |
2020-10-04 | 8 | 23 | 29 | 19 | 13 | Orange |
2020-10-05 | 8 | 9 | 17 | 22 | 14 | Banana |
Suppose you are interested in getting the maximum quantity sold for each day. We have many ways to do that. One of the fastest one is by means of the .lookup()
method. This basically requires two arguments: row_labels
and col_labels
. While the former is typically the index of our dataframe, the latter must be of type object, since the lookup is going to be performed on that particular value - in our case, the column "Best"
.
sales_best_df['Best Quantity'] = sales_best_df.lookup(
sales_best_df.index, sales_best_df["Best"]
)
The expected result is shown in the following table:
Best | Best Quantity | |
---|---|---|
2020-10-01 | Orange | 28 |
2020-10-02 | Lemon | 20 |
2020-10-03 | Orange | 29 |
2020-10-04 | Orange | 29 |
2020-10-05 | Banana | 22 |
4. The .where()
method
This method can be extremely useful in the initial (or exploration) phase of a dataset, especially if you want to get both the total non-null values and their specific values, given a specific condition. Indeed, the .where()
simply replaces with a null value the rows for which the condition is not satisfied, whereas it retains the original values when the condition is True
.
For instance, if you want to verify when the column "Best Quantity"
is greater than 25, then you can simply use the .where()
method, which requires a boolean condition to be verified.
sales_best_df.where(sales_best_df['Best Quantity'] > 25)
Lemon | Apple | Orange | Banana | Mango | Best | Best Quantity | |
---|---|---|---|---|---|---|---|
Date | |||||||
2020-10-01 | 17.0 | 12.0 | 28.0 | 6.0 | 10.0 | Orange | 28.0 |
2020-10-02 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2020-10-03 | 5.0 | 24.0 | 29.0 | 12.0 | 18.0 | Orange | 29.0 |
2020-10-04 | 8.0 | 23.0 | 29.0 | 19.0 | 13.0 | Orange | 29.0 |
2020-10-05 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
You can also replace the null values with another condition: for instance, if we can pass to it an integer - say 0
- or a more complex expression. As an example, take the following one:
sales_best_df.where(sales_best_df['Best Quantity'] > 25, 0)['Mango']
Date
2020-10-01 10
2020-10-02 0
2020-10-03 18
2020-10-04 13
2020-10-05 0
Name: Mango, dtype: int64
The opposite of .where()
is .mask()
: as you can see below, it actually takes the complementary part of the result obtained by .where()
on the dataframe.
sales_best_df.mask(sales_best_df['Best Quantity'] > 25, 0)['Mango']
Date
2020-10-01 0
2020-10-02 14
2020-10-03 0
2020-10-04 0
2020-10-05 14
Name: Mango, dtype: int64
5. The .select_dtypes()
method
This is a nice one. Typically, we have to decide whether to include or not variables of type object inside our analysis. Obviously, one way to convert those variables into a proper form is via One-hot Encoding, which is performed via the pd.get_dummies()
function. However, sometimes we just want to remove them, since they would not bring us useful information. To do so, we can control the columns’ type to include via the .select_dtypes()
method.
For instance, our dataset has a column of type object, as shown below:
Lemon int64
Apple int64
Orange int64
Banana int64
Mango int64
Best object
dtype: object
Since that columnn is redundant, we might want to remove it. In standard pandas, we would have performed this operation has follows:
sales_best_df.loc[:, sales_best_df.dtypes == np.int64]
In addition to that, we can use the .select_dtypes()
method, which allows us to either include or exclude particular types of variables with a simpler synthax. For instance,
sales_best_df.select_dtypes(include=['int64'])
will remove all columns that are not of type int64
. Alternatively, if you want to be more conservative and say you want to keep both int64
and float64
columns, you can explicitly tell which type to remove, ie all columns of type object
. This can easily be done as follows:
sales_best_df.select_dtypes(exclude=['object'])
Conclusions
In this post, we have seen five different methods that you can now use in your pandas-based Data Science project. I strongly recommend all of them, since they have been thought to enhance your code performance. If you have enjoyed this post, and you want to get more, please feel free to subscribe here!