Google Colab · Hugging Face · Datasets
How to load Hugging Face datasets in Google Colab (without errors)
Hugging Face datasets is the easiest way to get popular NLP / CV datasets,
but in Google Colab it often leads to weird errors, slow downloads, or out-of-memory crashes.
Let’s fix that with a clean, reproducible pattern.
Quick answer
In most cases you only need to:
!pip install -q datasets
from datasets import load_dataset
ds = load_dataset("imdb") # or your dataset name
print(ds)
print(ds["train"][0])
But on Colab, you should also:
- Pin versions of
datasetsandtransformers. - Store datasets in Google Drive so they survive runtime resets.
- Use streaming or subsets for very large datasets.
Step 1 – Install the right libraries in Colab
Use a dedicated cell at the top of your notebook:
!pip install -q "datasets>=2.18.0" "transformers>=4.40.0"
Run this, then restart the runtime if Colab asks. Avoid mixing very old and very new versions in the same notebook.
Step 2 – Load a small dataset first (sanity check)
Test with something small like IMDB:
from datasets import load_dataset
ds = load_dataset("imdb")
print(ds)
print("Train size:", len(ds["train"]))
print("Example:", ds["train"][0])
Step 3 – Cache datasets in Google Drive (so they persist)
By default, Hugging Face caches data under /root/.cache/huggingface which is wiped when Colab resets.
- Mount Drive:
from google.colab import drive
drive.mount("/content/drive")
- Tell
datasetsto use a cache folder inside Drive:
import os
os.environ["HF_DATASETS_CACHE"] = "/content/drive/MyDrive/hf_datasets_cache"
Now future runs will reuse the downloaded dataset, even if Colab restarts.
Step 4 – Use subsets & streaming for large datasets
Loading the entire dataset into RAM can crash Colab. Use:
Load only a split
ds = load_dataset("ag_news", split="train[:10%]") # first 10% of train
Use streaming (don’t load all into RAM)
ds_stream = load_dataset("c4", "en", split="train", streaming=True)
for i, example in enumerate(ds_stream):
print(example)
if i == 3:
break
Streaming is great when you just need to iterate, not hold everything in memory.
Common errors & how to fix them
1. ModuleNotFoundError: No module named 'datasets'
- You didn’t run the
pip installcell, or you ran it after importing. - Solution: run the install cell, then restart runtime and import again.
2. SSL / network errors when downloading
- Retry the cell after a minute – sometimes it's just Colab’s network.
- If it keeps failing, you can
download=Truelocally and upload to Drive, but that’s rare.
3. Out-of-memory when converting to pandas / NumPy
The dataset object is lazy; converting everything to a huge DataFrame can kill RAM:
# Risky: loads everything
df = ds["train"].to_pandas()
Instead, sample or iterate in batches:
sample = ds["train"].shuffle(seed=42).select(range(5000)) # 5k rows
df = sample.to_pandas()
Make your Hugging Face pipeline reproducible with NoteCapsule
After you finally get the dataset + preprocessing + training working in Colab, you should capture that working state so it doesn’t break right before submission.
In Colab:
!pip install "git+https://github.com/YOUR_GITHUB_USERNAME/notecapsule.git@main"
from notecapsule import create_capsule
create_capsule(
name="huggingface-experiment-v1",
notebook_path="your_notebook.ipynb",
data_dirs=["./"],
base_dir="."
)
This creates a Capsule folder (notebook + data manifest + environment) that you can download,
submit, or re-run later.