Jupyter Notebook · MemoryError · Large data

How to fix MemoryError in Jupyter Notebook for large datasets

You run a cell in Jupyter, your fan goes crazy, then: MemoryError or a dead kernel. Let’s walk through practical ways to keep your notebook alive when working with large data.

Step 1 – Confirm it’s really a memory issue

On most machines you can quickly check memory usage:

import psutil

mem = psutil.virtual_memory()
print(f"Total: {mem.total/1024**3:.2f} GB")
print(f"Used:  {mem.used/1024**3:.2f} GB")
print(f"Free:  {mem.available/1024**3:.2f} GB")

Step 2 – Avoid loading everything at once

The classic trap is doing:

import pandas as pd
df = pd.read_csv("huge.csv")   # >> MemoryError

Instead, use chunked reading:

import pandas as pd

reader = pd.read_csv("huge.csv", chunksize=100_000)
rows = 0

for chunk in reader:
    # process chunk here (aggregate, filter, feature engineering)
    rows += len(chunk)

print("Processed rows:", rows)

Step 3 – Work with a sample first

For EDA and prototyping, a sample is enough:

import pandas as pd

sample = pd.read_csv("huge.csv", nrows=50_000)
sample.head()

Step 4 – Optimize dtypes

Use smaller integer / float types to cut memory usage:

for col in sample.select_dtypes(include="int64").columns:
    sample[col] = pd.to_numeric(sample[col], downcast="integer")

for col in sample.select_dtypes(include="float64").columns:
    sample[col] = pd.to_numeric(sample[col], downcast="float")

Step 5 – Clear variables and restart smarter

In notebooks, old variables stick around. Use:

%reset -f   # clears variables in current kernel

Or manually delete large objects:

del df
import gc
gc.collect()

Step 6 – Keep plots reasonable

Very large heatmaps or pairplots can explode memory. Try plotting only a subset of rows/columns or aggregate first.

Reminder: Jupyter Notebooks are great for exploration, but for production-scale data pipelines, consider moving heavy lifting into scripts or jobs that don’t depend on the browser.

Snapshot your working notebook with NoteCapsule

After you’ve finally built a memory-safe workflow, capture it as a reproducible Capsule so you don’t lose this “known good” state.

Create a Capsule from your Jupyter project

pip install "git+https://github.com/YOUR_GITHUB_USERNAME/notecapsule.git@main"

from notecapsule import create_capsule

create_capsule(
    name="jupyter-large-data-safe",
    notebook_path="analysis.ipynb",
    data_dirs=["./data"],
    base_dir="."
)

Commit the Capsule folder to Git or share it as a zip for grading, viva, or collaboration.