How to handle Google Colab disconnects for long-running training
Google Colab is a lifesaver when you don’t have a GPU locally. But if you’ve tried training anything non-trivial, you’ve probably seen this:
- runtime disconnects after some hours,
- long epochs getting cut off halfway,
- losing progress when the notebook tab is idle.
In this guide, we’ll cover realistic ways to live with Colab’s limits: **saving progress**, **resuming training**, and protecting your notebook + project state so disconnects don’t kill you.
Understand Colab’s runtime limits
Colab gives you a free (or Pro) VM with a GPU, but:
- Sessions are temporary – they reset after inactivity or a fixed max time.
- Installed packages and in-RAM objects disappear when the runtime restarts.
- Files stored under
/content(without Drive) vanish after reset.
Strategy 1: Save model checkpoints during training
1. Save weights after each epoch (Keras example)
from tensorflow.keras.callbacks import ModelCheckpoint
checkpoint_cb = ModelCheckpoint(
"checkpoints/model_epoch_{epoch:02d}_val{val_loss:.3f}.h5",
save_weights_only=True,
save_best_only=False
)
history = model.fit(
train_dataset,
epochs=20,
validation_data=val_dataset,
callbacks=[checkpoint_cb]
)
If Colab disconnects after, say, epoch 7, you can reload the last checkpoint and continue from there in a fresh runtime.
2. Store checkpoints in Google Drive
Make sure you’re saving to a Drive-mounted path so checkpoints survive runtime reset:
from google.colab import drive
drive.mount('/content/drive')
CHECKPOINT_DIR = "/content/drive/MyDrive/colab-checkpoints/"
Strategy 2: Resume training after a disconnect
1. Reload model and optimizer state
For Keras:
from tensorflow.keras.models import load_model
model = load_model("checkpoints/best_model.h5")
For PyTorch, you often save a dict with model + optimizer + epoch:
checkpoint = torch.load(PATH, map_location="cpu") model.load_state_dict(checkpoint["model_state_dict"]) optimizer.load_state_dict(checkpoint["optimizer_state_dict"]) start_epoch = checkpoint["epoch"] + 1
Then continue training from start_epoch. The key is to **always** save enough information
to rebuild your training state.
2. Log your hyperparameters and config
Keep a small config dict (or YAML/JSON file) with learning rates, batch sizes, and other hyperparameters. This makes resuming or reproducing runs much easier later.
Strategy 3: Make your Colab projects resilient, not fragile
Beyond model checkpoints, disconnects are a good reminder that your **entire project** needs to be resilient: notebook, data paths, dependencies, and code.
1. Keep everything under a single project folder in Drive
/content/drive/MyDrive/projects/my-long-run/ notebook.ipynb data/ checkpoints/ capsules/
This way, even if Colab resets, your core project files are safe in Drive.
2. Use NoteCapsule to snapshot working states
At key milestones (e.g. “baseline training runs to completion”, “best validation accuracy so far”), use NoteCapsule to create a Capsule – a snapshot of your project state.
!pip install notebookcapsule -q
from notebookcapsule import create_capsule
create_capsule(
name="longrun_epoch_10_checkpoint",
data_dirs=["./data", "./checkpoints"]
)
This builds a folder with:
- your notebook,
- a suggested requirements file from your imports,
- a
data_manifest.jsonincluding checkpoints paths, - a README template so you (or others) can recreate the run later.
Strategy 4: Break huge runs into resumable chunks
Instead of planning a 7-day training run, design your training so you can:
- train for a few epochs at a time,
- save checkpoints frequently,
- evaluate and adjust before the next chunk.
This is good ML practice anyway – Colab’s limits just force you to behave sensibly.
Quick checklist for surviving Colab disconnects
- ✅ Mount Google Drive and save data + checkpoints there, not just under
/content. - ✅ Use framework-specific checkpointing (Keras callbacks, PyTorch
state_dict, etc.). - ✅ Reload model + optimizer state in a fresh runtime before resuming training.
- ✅ Keep project files in a single Drive folder for stability.
- ✅ Use NoteCapsule Capsules to snapshot working project states at key milestones.
Want your Colab projects to survive disconnects?
NoteCapsule gives you a simple create_capsule(...) call to capture notebook, dependencies,
data layout, and checkpoints – so Colab resets are annoying, not catastrophic.
Sign up from the homepage and we’ll send you a setup guide plus an example Capsule tailored for Colab training workflows.