When you’re loading large file into a pandas Dataframe, object, Pandas consumes more memory than expected. In this blog, we will try to mitigate this with some best practices
Photo by Austin Distel on Unsplash
Type Conversion
Taking care of data type as multiple use-cases
- Reduce memory usage
- Unmatched type
- Specifical type
Let’s focus on memory managementint64
takes much more memory than int8
. When we load data from a CSV file, if we don’t specify the dtype
for each column, the default dtype
would be int64
, float64
, or object
. These are all largest data types. If one column only stores numbers from 0 to 20, int8
is enough. For float64
, you can use float32
as a replacement.
Let’s see how much memory we can save. A sample code snippet to check the usage
import numpy as np
import pandas as pd
data = np.random.randint(-100, 100, size=(50000, 10))
df = pd.DataFrame(d)
print(“The total information about original object”)
print(df.info())
print(“The memory usage of all columns”)
print(df.memory_usage())
print('--------------')
intCols = df.select_dtypes(include=[‘int64’]).columns.tolist()
df[intCols] = df[intCols].apply(pd.to_numeric, downcast=’integer’)
print(“The total information about modified object”)
print(df.info()
print(“The memory usage of all columns”)
print(df.memory_usage())
Here is a conclusion, the optimization effect is very obvious. The total memory usage has dropped from 1.5MB
to 195KB
, very impressive. Memory usage was reduced by nearly 90 percent
. The only thing we did was change the data type from int64
to int8
.
Using Category type instead of Object
In the real world, some columns of object types have limited choices. For example, in a dataset with 100 million rows, there is a column named Continent
which identifies one of the seven continent, Although this data set has 100 million rows, we know that there are only a few hundred regions and countries in the world. If each row stores an object, this is a huge waste of memory. In pandas, we can use the category
type instead of object
for this column.
import pandas as pd
import string
import random
l = [''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(2))
for i in range(2000000)]
s = pd.Series(l)
print("The original series memory usage")
print(s.memory_usage())
s = s.astype("category")
print("The modified series memory usage")
print(s.memory_usage())
From the results it’s clear that category performs better, do give it a try!
Specify Column types when loading CSV files
This is more of tip which we already know, you can specify column types when loading CSV files. Moreover, you don’t need to load all columns, you can load only a subset of columns. In the real world, most data is stored in a file. The most common file type is a CSV.
Hope this article is helpful, do share your thoughts in comments on improving this article, Cheers!