Google Colab · Large datasets · Memory
Large datasets in Google Colab – how to avoid out-of-memory
Many Colab sessions crash when students try to load a 4M-row CSV, big image dataset, or a dense heatmap into memory. The trick is to treat Colab as a limited laptop, not a big server.
Understand Colab’s memory limits
Free Colab typically gives you a few GBs of RAM. You can check with:
import psutil, os
ram_gb = psutil.virtual_memory().total / (1024**3)
print(f"RAM: {ram_gb:.2f} GB")
Tip 1 – Use sampling for exploratory analysis
Instead of loading all rows, sample a subset for prototyping:
import pandas as pd
df_iter = pd.read_csv("data.csv", chunksize=100_000)
sample = next(df_iter).sample(5000, random_state=42)
sample.head()
Once your code works on the sample, move to chunked processing for the full dataset.
Tip 2 – Load CSVs in chunks instead of all at once
import pandas as pd
chunks = pd.read_csv("data.csv", chunksize=100_000)
total_rows = 0
for chunk in chunks:
# do your processing here, e.g. aggregate, clean, feature engineer
total_rows += len(chunk)
print("Processed rows:", total_rows)
This way, only one chunk is in memory at a time.
Tip 3 – Don’t build huge heatmaps or full correlation matrices
Plots like sns.heatmap(dataset.isnull()) on millions of cells can crash Colab.
Prefer:
- Sampling rows and columns.
- Aggregating first (e.g., null counts per column).
import seaborn as sns
import matplotlib.pyplot as plt
# Instead of massive heatmap, just show null counts
null_counts = sample.isnull().sum().sort_values(ascending=False)
sns.barplot(x=null_counts.values, y=null_counts.index)
plt.tight_layout()
Tip 4 – Use efficient dtypes and drop useless columns
Convert columns to smaller dtypes where possible:
df = pd.read_csv("data.csv")
# Convert integers
for col in df.select_dtypes(include="int64").columns:
df[col] = pd.to_numeric(df[col], downcast="integer")
# Convert floats
for col in df.select_dtypes(include="float64").columns:
df[col] = pd.to_numeric(df[col], downcast="float")
Drop columns you do not need for your model:
cols_to_drop = ["raw_text", "description_html"]
df = df.drop(columns=cols_to_drop)
Tip 5 – Use external storage for very large data
For truly large datasets (tens of GBs), consider:
- BigQuery or a cloud database and query only the subset you need.
- Pre-aggregating data offline, then using a smaller file in Colab.
Protect your working large-dataset notebook with NoteCapsule
Getting a large dataset pipeline stable in Colab is painful. Once it works, snapshot it so you can always roll back to this version.
!pip install "git+https://github.com/YOUR_GITHUB_USERNAME/notecapsule.git@main"
from notecapsule import create_capsule
create_capsule(
name="large-dataset-pipeline",
notebook_path="notebook.ipynb",
data_dirs=["./"],
base_dir="."
)
NoteCapsule will record the exact notebook, data files (with hashes), and environment so you don’t have
to guess what worked later.