Pandas best practices for Column Operations

Pandas best practices for Column Operations

In data processing, Sometimes we perform overall operations on certain columns in the dataframe. pandas provides a bunch of methods to handle column operations, In this article, we will be covering the best practices on the same.

Photo by Jonathan Smith on Unsplash

When we are dealing with humongous data sets, Performance would be taken for a toss and have a great impact on productivity, and we will be wasting resources.

Pandas provides following methods to operate on columns

  • Iteration by iloc.
  • Iteration by .iterrows().
  • apply() function.
  • Vectorize like Numpy.

Lets get the performance metrics by performing a task to calculate sqrt of a number in a particular column in ad dataset of 1 million records and observe the performance of above methods. below is the code snippet used for the performance testing in replit

Here is the output for the same.

The above result from the repl.it might vary vastky for different environment settings and configurations. From the output we can infer Vectorize is the fastest, iterrows() is the slowest method perform column operations.

Vectorize is approximately 5000x faster to iterrows So next time when you’re performing column operation consider vertorize method for optimal performance, Cheers!