In data processing, Sometimes we perform overall operations on certain columns in the dataframe. pandas provides a bunch of methods to handle column operations, In this article, we will be covering the best practices on the same.
Photo by Jonathan Smith on Unsplash
When we are dealing with humongous data sets, Performance would be taken for a toss and have a great impact on productivity, and we will be wasting resources.
Pandas provides following methods to operate on columns
- Iteration by
iloc
. - Iteration by
.iterrows()
. apply()
function.- Vectorize like Numpy.
Lets get the performance metrics by performing a task to calculate sqrt
of a number in a particular column in ad dataset of 1 million records and observe the performance of above methods. below is the code snippet used for the performance testing in replit
Here is the output for the same.
The above result from the repl.it might vary vastky for different environment settings and configurations. From the output we can infer Vectorize
is the fastest, iterrows()
is the slowest method perform column operations.
Vectorize
is approximately 5000x faster to iterrows
So next time when you’re performing column operation consider vertorize method for optimal performance, Cheers!