How to limit Pandas memory usage
On 09 August 2024 - tagged python
Pandas is a powerful library for data analysis in Python. It is widely used in data science and data engineering. However, it can be memory-intensive, especially when working with large datasets. In this post, we will explore how to limit Pandas memory usage and improve it by 80% on an example trading dataset.
Pandas memory usage #
When working with Pandas, it is important to understand how memory is being used. Pandas uses a variety of data structures to store data, including NumPy arrays, lists, and dictionaries. These data structures can consume a significant amount of memory, especially when dealing with large datasets.
The most common mistake with Pandas is not correctly managing dtypes. Dtypes set the underlying storage format of column values. In Pandas, the common dtypes are
int64
, float64
, category
, datetime64[ns]
and object
. If you don't specify a dtype, Pandas will infer the
dtype based on the data. However, Pandas tend to choose the biggest dtype, eg. int64 instead of int8 even if you only store values up to 255.
Example #
Lets see how it works on an example:
import pandas as pd
import numpy as np
# Create an example DataFrame with trades happening in some market
df = pd.DataFrame({
'timestamp': np.random.randint(1, 100, size=1000000)
'side': np.random.choice(['buy', 'sell'], size=1000000),
'price': np.random.randint(1, 1000, size=1000000),
'size': np.random.randint(1, 100, size=1000000),
})
# Price is some float
df['price'] /= 100
df['timestamp'] = pd.to_datetime(df['timestamp'].cumsum(), unit='s')
# Print memory usage
print(f"Initial memory usage: {round(df.memory_usage().sum() / 1024 / 1024, 2)} MB")
print(df.dtypes)
print(df.memory_usage())
This results in the following output:
Memory usage before: 80.52 MB
side object
price float64
size int64
timestamp datetime64[ns]
Index 132
side 60499953
price 8000000
size 8000000
timestamp 8000000
As we can see, all the values are in 64 bit format occupying 8 bytes per value. Side occupies much more space because it can store an arbitrary string. Actually, pandas stores all the millions of 'buy' and 'sell' letters in memory.
Improvement #
This can be solved by simply converting the column to better dtypes:
df['side'] = df['side'].astype('category')
df['price'] = df['price'].astype('float32')
df['size'] = df['size'].astype('int32')
This immediately reduces the memory usage to 17.21 MB:
Memory usage after: 16.21 MB
Index 132
side 1000229
price 4000000
size 4000000
timestamp 8000000
Ability to work efficienty with Pandas is super important for data science. In trading, bigger datasets can lead to more precise models and give you an edge in the market.
New posts are announced on twitter @jan_skoda or @crypto_lake_com, so follow us!