Skip to content

Checkpointing Integration Guide

CheckpointManager provides save/restore of model state, optimizer state, LR scheduler, tokenizer, and training configuration. It supports keep-last and keep-best rotation policies to bound disk usage.

Basic Usage

from pathlib import Path
from codex_ml.utils.checkpointing import CheckpointManager

mgr = CheckpointManager(Path("output/checkpoints"), keep_last=5, keep_best=1)

Save a checkpoint after each epoch:

mgr.save(
    epoch=epoch,
    model=model,
    optimizer=optimizer,
    scheduler=scheduler,
    tokenizer=tokenizer,
    config=config,
    metrics={"val_loss": val_loss},
)

Resume from the latest checkpoint:

info = mgr.resume_from(
    Path("output/checkpoints/epoch-10"), model, optimizer, scheduler
)
print(f"Resumed from epoch {info['epoch']}")

CLI Flags

Add these flags to your training entry-point and wire them to CheckpointManager:

Flag Default Description
--checkpoint-dir output/checkpoints Where checkpoints are saved
--resume-from None Path to resume checkpoint
--keep-last 5 Keep N most-recent checkpoints
--keep-best 1 Keep N best checkpoints by metric

Rotation Policy

When keep_last=5, checkpoints older than the 5 most recent are deleted automatically. keep_best=1 retains the single lowest-loss checkpoint regardless of age.