Efficient memory usage in Pandas

Efficient memory usage in Pandas

When you’re loading large file into a pandas Dataframe, object, Pandas consumes more memory than expected. In this blog, we will try to mitigate this with some best practices

Photo by Austin Distel on Unsplash

Type Conversion

Taking care of data type as multiple use-cases

  • Reduce memory usage
  • Unmatched type
  • Specifical type

Let’s focus on memory managementint64 takes much more memory than int8. When we load data from a CSV file, if we don’t specify the dtype for each column, the default dtype would be int64, float64, or object. These are all largest data types. If one column only stores numbers from 0 to 20, int8 is enough. For float64, you can use float32 as a replacement.

Let’s see how much memory we can save. A sample code snippet to check the usage

import numpy as np
import pandas as pd
data = np.random.randint(-100, 100, size=(50000, 10))
df = pd.DataFrame(d)
print(“The total information about original object”)
print(df.info())
print(“The memory usage of all columns”)
print(df.memory_usage())
print('--------------')
intCols = df.select_dtypes(include=[‘int64’]).columns.tolist()
df[intCols] = df[intCols].apply(pd.to_numeric, downcast=’integer’)
print(“The total information about modified object”)
print(df.info()
print(“The memory usage of all columns”)
print(df.memory_usage())

Here is a conclusion, the optimization effect is very obvious. The total memory usage has dropped from 1.5MB to 195KB, very impressive. Memory usage was reduced by nearly 90 percent. The only thing we did was change the data type from int64 to int8.

Using Category type instead of Object

In the real world, some columns of object types have limited choices. For example, in a dataset with 100 million rows, there is a column named Continent which identifies one of the seven continent, Although this data set has 100 million rows, we know that there are only a few hundred regions and countries in the world. If each row stores an object, this is a huge waste of memory. In pandas, we can use the category type instead of object for this column.

import pandas as pd
import string
import random
l = [''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(2))
for i in range(2000000)]
s = pd.Series(l)
print("The original series memory usage")
print(s.memory_usage())
s = s.astype("category")
print("The modified series memory usage")
print(s.memory_usage())

From the results it’s clear that category performs better, do give it a try!

Specify Column types when loading CSV files

This is more of tip which we already know, you can specify column types when loading CSV files. Moreover, you don’t need to load all columns, you can load only a subset of columns. In the real world, most data is stored in a file. The most common file type is a CSV.

Hope this article is helpful, do share your thoughts in comments on improving this article, Cheers!