How to handle Google Colab disconnects for long-running training

Google Colab · Training · Approx. 9 min read

Google Colab is a lifesaver when you don’t have a GPU locally. But if you’ve tried training anything non-trivial, you’ve probably seen this:

runtime disconnects after some hours,
long epochs getting cut off halfway,
losing progress when the notebook tab is idle.

In this guide, we’ll cover realistic ways to live with Colab’s limits: **saving progress**, **resuming training**, and protecting your notebook + project state so disconnects don’t kill you.

Understand Colab’s runtime limits

Colab gives you a free (or Pro) VM with a GPU, but:

Sessions are temporary – they reset after inactivity or a fixed max time.
Installed packages and in-RAM objects disappear when the runtime restarts.
Files stored under /content (without Drive) vanish after reset.

Colab is not a long-running training cluster. Think of it as a powerful but temporary lab bench. Your job is to save what matters before the bench is cleared.

Strategy 1: Save model checkpoints during training

1. Save weights after each epoch (Keras example)

from tensorflow.keras.callbacks import ModelCheckpoint

checkpoint_cb = ModelCheckpoint(
    "checkpoints/model_epoch_{epoch:02d}_val{val_loss:.3f}.h5",
    save_weights_only=True,
    save_best_only=False
)

history = model.fit(
    train_dataset,
    epochs=20,
    validation_data=val_dataset,
    callbacks=[checkpoint_cb]
)

If Colab disconnects after, say, epoch 7, you can reload the last checkpoint and continue from there in a fresh runtime.

2. Store checkpoints in Google Drive

Make sure you’re saving to a Drive-mounted path so checkpoints survive runtime reset:

from google.colab import drive
drive.mount('/content/drive')

CHECKPOINT_DIR = "/content/drive/MyDrive/colab-checkpoints/"

Strategy 2: Resume training after a disconnect

1. Reload model and optimizer state

For Keras:

from tensorflow.keras.models import load_model

model = load_model("checkpoints/best_model.h5")

For PyTorch, you often save a dict with model + optimizer + epoch:

checkpoint = torch.load(PATH, map_location="cpu")
model.load_state_dict(checkpoint["model_state_dict"])
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
start_epoch = checkpoint["epoch"] + 1

Then continue training from start_epoch. The key is to **always** save enough information to rebuild your training state.

2. Log your hyperparameters and config

Keep a small config dict (or YAML/JSON file) with learning rates, batch sizes, and other hyperparameters. This makes resuming or reproducing runs much easier later.

Strategy 3: Make your Colab projects resilient, not fragile

Beyond model checkpoints, disconnects are a good reminder that your **entire project** needs to be resilient: notebook, data paths, dependencies, and code.

1. Keep everything under a single project folder in Drive

/content/drive/MyDrive/projects/my-long-run/
  notebook.ipynb
  data/
  checkpoints/
  capsules/

This way, even if Colab resets, your core project files are safe in Drive.

2. Use NoteCapsule to snapshot working states

At key milestones (e.g. “baseline training runs to completion”, “best validation accuracy so far”), use NoteCapsule to create a Capsule – a snapshot of your project state.

!pip install notebookcapsule -q

from notebookcapsule import create_capsule

create_capsule(
    name="longrun_epoch_10_checkpoint",
    data_dirs=["./data", "./checkpoints"]
)

This builds a folder with:

your notebook,
a suggested requirements file from your imports,
a data_manifest.json including checkpoints paths,
a README template so you (or others) can recreate the run later.

Strategy 4: Break huge runs into resumable chunks

Instead of planning a 7-day training run, design your training so you can:

train for a few epochs at a time,
save checkpoints frequently,
evaluate and adjust before the next chunk.

This is good ML practice anyway – Colab’s limits just force you to behave sensibly.

Quick checklist for surviving Colab disconnects

✅ Mount Google Drive and save data + checkpoints there, not just under /content.
✅ Use framework-specific checkpointing (Keras callbacks, PyTorch state_dict, etc.).
✅ Reload model + optimizer state in a fresh runtime before resuming training.
✅ Keep project files in a single Drive folder for stability.
✅ Use NoteCapsule Capsules to snapshot working project states at key milestones.

Want your Colab projects to survive disconnects?

NoteCapsule gives you a simple create_capsule(...) call to capture notebook, dependencies, data layout, and checkpoints – so Colab resets are annoying, not catastrophic.

Join NoteCapsule early access

Sign up from the homepage and we’ll send you a setup guide plus an example Capsule tailored for Colab training workflows.