Data Filtering in Pandas

Data Filtering in Pandas

Pandas provides extensive methods for data selection, Below are some methods you could choose,

  • Pass condition expression to [] operation.
  • Use the query() function.
  • Use the eval() function.

To understand which is more optimised way of data filtering, We will create a sample dataframe with a large data set and check it out.

Photo by Delphine Ducaruge on Unsplash

#creating a sample large data set
import numpy as np
import pandas as pd
import timeit
np.random.seed(42)
d = np.random.randint(1, 10, size=(500000, 2))
df = pd.DataFrame(d)
df.columns = ["col1 ", "col2"]

We will be using this dataset, for the entire analysis repo_link, When the data size is small, it doesn’t matter which one you choose, but the [] operation may be faster (depending on the system you choose) . If the data size is large or the filter condition is complicated, you should choose it carefully.

Select data by **[]**

Indexing is the easier way of filtering the data, We can use pandas indexing to subset the DataFrame. Actually, the condition expression creates a Boolean series, and we can use it to filter the DataFrame.

# [] operator
start = timeit.default_timer()
dfTemp = df[(df["col1 "] % 2==0) & (df["col2"] > 5)]
stop = timeit.default_timer()
print("Run time of df[df[(df[\"col1\"] % 2==0) & (df[\"col2\"] > 5)] is {} seconds.".format(stop-start))

#output
Run time of df[df[(df["col1"] % 2==0) & (df["col2"] > 5)] is 0.020918252997944364 seconds.

Select data by query()

query() is a much cleaner and easier way to filter rows. Just pass a conditional expression with quotes to the query(). Moreover, query() supports much more complicated conditional expressions.

Note*: This operation wouldn’t change the DataFrame in place. If you want to change it, just pass `inplace=True`.*

#select data by query() operator
start = timeit.default_timer()
dfTemp = df.query("col1 % 2 == 0 & col2 > 5")
stop = timeit.default_timer()
print("Running time of df.query(\"col1 % 2 == 0 & col2 > 5\") is {} seconds.".format(stop-start))

#output
Running time of df.query("col1 % 2 == 0 & col2 > 5") is 0.01401120700073079 seconds.

Query seems to be faster when compared to [], can we do better !?

Select data by eval()

Pandas **dataframe.eval()** function is used to evaluate an expression in the context of the calling dataframe instance. The expression is evaluated over the columns of the dataframe and filters the data as needed.

#select data by eval() operator
start = timeit.default_timer()
dfTemp = df[df.eval('col1 % 2 == 0 & col2 > 5')]
stop = timeit.default_timer()
print("Running time of df[df.eval('col1 % 2 == 0 & col2 > 5')] is {} seconds.".format(stop-start))

#output
Running time of df[df.eval('col1 % 2 == 0 & col2 > 5')] is 0.001771175499743549 seconds.

When data is small, the orthodox *[]* operation tends to be faster. The advantages of *query()* and *eval()* lies in humongous dataset. Moreover, *eval()* has much cleaner syntax.

Below are some conditional statements and operators supported by eval:

  • Arithmetic operations except for the left shift (<<) and right shift (>>) operators.
  • Comparison operations.
  • Boolean operations, such as df.col1> 5 and df.col2 <=10
  • Attribute access, such as df.col1.
  • Subscript expressions, such as df[0].
  • Math functions.
import math  
df[df.eval('math.sin(df.col1)>0')]
  • List and tuple literals.
df[df.eval('df.col1 in ["Sun", "Moon"]')]

Photo by Luca Bravo on Unsplash

Optimising the way we filter the data, would be one of the options to look into system to improve the performance while working with dataframe, Optimising the filtering operations eval(), query(), [] wouldn’t guarantee the performance enhancement, It’s a multi-variant equation, but this would be a poignant variable and worth a try, Cheers!