Google Colab · Large datasets · Memory

Large datasets in Google Colab – how to avoid out-of-memory

Many Colab sessions crash when students try to load a 4M-row CSV, big image dataset, or a dense heatmap into memory. The trick is to treat Colab as a limited laptop, not a big server.

Understand Colab’s memory limits

Free Colab typically gives you a few GBs of RAM. You can check with:

import psutil, os

ram_gb = psutil.virtual_memory().total / (1024**3)
print(f"RAM: {ram_gb:.2f} GB")

Tip 1 – Use sampling for exploratory analysis

Instead of loading all rows, sample a subset for prototyping:

import pandas as pd

df_iter = pd.read_csv("data.csv", chunksize=100_000)
sample = next(df_iter).sample(5000, random_state=42)
sample.head()

Once your code works on the sample, move to chunked processing for the full dataset.

Tip 2 – Load CSVs in chunks instead of all at once

import pandas as pd

chunks = pd.read_csv("data.csv", chunksize=100_000)

total_rows = 0
for chunk in chunks:
    # do your processing here, e.g. aggregate, clean, feature engineer
    total_rows += len(chunk)

print("Processed rows:", total_rows)

This way, only one chunk is in memory at a time.

Tip 3 – Don’t build huge heatmaps or full correlation matrices

Plots like sns.heatmap(dataset.isnull()) on millions of cells can crash Colab. Prefer:

Sampling rows and columns.
Aggregating first (e.g., null counts per column).

import seaborn as sns
import matplotlib.pyplot as plt

# Instead of massive heatmap, just show null counts
null_counts = sample.isnull().sum().sort_values(ascending=False)
sns.barplot(x=null_counts.values, y=null_counts.index)
plt.tight_layout()

Tip 4 – Use efficient dtypes and drop useless columns

Convert columns to smaller dtypes where possible:

df = pd.read_csv("data.csv")

# Convert integers
for col in df.select_dtypes(include="int64").columns:
    df[col] = pd.to_numeric(df[col], downcast="integer")

# Convert floats
for col in df.select_dtypes(include="float64").columns:
    df[col] = pd.to_numeric(df[col], downcast="float")

Drop columns you do not need for your model:

cols_to_drop = ["raw_text", "description_html"]
df = df.drop(columns=cols_to_drop)

Tip 5 – Use external storage for very large data

For truly large datasets (tens of GBs), consider:

BigQuery or a cloud database and query only the subset you need.
Pre-aggregating data offline, then using a smaller file in Colab.

Protect your working large-dataset notebook with NoteCapsule

Getting a large dataset pipeline stable in Colab is painful. Once it works, snapshot it so you can always roll back to this version.

Create a Capsule for your large-data project

!pip install "git+https://github.com/YOUR_GITHUB_USERNAME/notecapsule.git@main"

from notecapsule import create_capsule

create_capsule(
    name="large-dataset-pipeline",
    notebook_path="notebook.ipynb",
    data_dirs=["./"],
    base_dir="."
)

NoteCapsule will record the exact notebook, data files (with hashes), and environment so you don’t have to guess what worked later.