How to limit Pandas memory usage

On 09 August 2024 - tagged python

Pandas is a powerful library for data analysis in Python. It is widely used in data science and data engineering. However, it can be memory-intensive, especially when working with large datasets. In this post, we will explore how to limit Pandas memory usage and improve it by 80% on an example trading dataset.

Pandas memory usage #

When working with Pandas, it is important to understand how memory is being used. Pandas uses a variety of data structures to store data, including NumPy arrays, lists, and dictionaries. These data structures can consume a significant amount of memory, especially when dealing with large datasets.

The most common mistake with Pandas is not correctly managing dtypes. Dtypes set the underlying storage format of column values. In Pandas, the common dtypes are int64, float64, category, datetime64[ns] and object. If you don't specify a dtype, Pandas will infer the dtype based on the data. However, Pandas tend to choose the biggest dtype, eg. int64 instead of int8 even if you only store values up to 255.

Example #

Lets see how it works on an example:

import pandas as pd
import numpy as np

# Create an example DataFrame with trades happening in some market
df = pd.DataFrame({
  'timestamp': np.random.randint(1, 100, size=1000000)
  'side': np.random.choice(['buy', 'sell'], size=1000000),
  'price': np.random.randint(1, 1000, size=1000000),
  'size': np.random.randint(1, 100, size=1000000),
})
# Price is some float
df['price'] /= 100
df['timestamp'] = pd.to_datetime(df['timestamp'].cumsum(), unit='s')

# Print memory usage
print(f"Initial memory usage: {round(df.memory_usage().sum() / 1024 / 1024, 2)} MB")
print(df.dtypes)
print(df.memory_usage())

This results in the following output:

Memory usage before: 80.52 MB

side                 object
price               float64
size                  int64
timestamp    datetime64[ns]

Index             132
side         60499953
price         8000000
size          8000000
timestamp     8000000

As we can see, all the values are in 64 bit format occupying 8 bytes per value. Side occupies much more space because it can store an arbitrary string. Actually, pandas stores all the millions of 'buy' and 'sell' letters in memory.

Improvement #

This can be solved by simply converting the column to better dtypes:

df['side'] = df['side'].astype('category')
df['price'] = df['price'].astype('float32')
df['size'] = df['size'].astype('int32')

This immediately reduces the memory usage to 17.21 MB:

Memory usage after: 16.21 MB

Index            132
side         1000229
price        4000000
size         4000000
timestamp    8000000

Ability to work efficienty with Pandas is super important for data science. In trading, bigger datasets can lead to more precise models and give you an edge in the market.

New posts are announced on X/twitter @jan_skoda or @crypto_lake_com, so follow us!