First principles curricula¶

Curriculum-first training anchors the reasoning roadmap. This guide walks through how we design, stage, and evaluate curricula across training, evaluation, and deployment.

Design principles¶

Start from the target metric — Anchor every phase on the milestone gate (for example M1 win rate ≥0.55).
Minimise hidden state — Curriculum YAML fragments must be diff-friendly; prefer declarative overrides to custom code.
Embed observability — Each phase should emit trace markers (phase_id, prompt_complexity) for downstream dashboards.
Document fallback paths — Capture baselines in ../status_updates/ before experimenting.

Curriculum blueprint¶

Phase	Objective	Dataset preset	Signals
Warm-up	Stabilise reasoning traces	`datasets/reasoning/warmup.jsonl`	Trace coverage, loss
First principles	Teach decomposition heuristics	`datasets/reasoning/first_principles.jsonl`	Win rate, critique density
Challenge set	Stress bespoke behaviors	`datasets/reasoning/challenge.jsonl`	Latency deltas, judge disagreement

Phase definitions live in configs/training/reasoning/curricula/. Each YAML file exports:

yaml phase_schedule: - id: warmup dataset: datasets/reasoning/warmup.jsonl steps: 200 - id: first_principles dataset: datasets/reasoning/first_principles.jsonl steps: 400 - id: challenge dataset: datasets/reasoning/challenge.jsonl steps: 300text

Training pipeline¶

Select a curriculum file (for example curriculum=first_principles).
Launch training with explicit overrides: bash codex-train +reasoning=baseline \ curriculum=first_principles \ logging.reasoning_trace=true \ training.output_dir=artifacts/runs/first-principles
Monitor traces with codex metrics summarize --metric reasoning.trace_coverage.
Record qualitative notes (prompt outliers, judge disagreements) in status_updates/first_principles.md.

Evaluation pipeline¶

Materialise evaluation packs: bash codex datasets materialize --preset reasoning/first_principles
Run evaluators with curriculum tags: bash codex evaluate --config configs/evaluation/reasoning.yaml \ curriculum.id=first_principles \ --log-metrics .codex/metrics/reasoning.ndjson
Compare against baselines via codex metrics compare --metric reasoning.win_rate --reference baseline.
Capture feedback loops in status_updates/first_principles.md.

Deployment pipeline¶

Verify the bundle: bash codex deploy --config configs/deploy/reasoning_pod.yaml \ --model artifacts/runs/first-principles:last \ --dry-run
Shadow host until latency and judge deltas meet the milestone gate.
Register the model with curriculum metadata: bash codex register --bundle artifacts/runs/first-principles:last \ --tag reasoning/m1/first-principles \ --notes "Curriculum M1 rollout"
Update rollout docs under ../deployment/ with observed risks or mitigations.

Troubleshooting¶

Trace gaps — Re-run with logging.reasoning_trace=true and verify .codex/reasoning_runs/ contains phase markers.
Evaluator regressions — Use codex evaluate --metrics-only to isolate metric drift before replaying full judge sweeps.
Deployment parity — If bespoke hosts diverge, capture diffs in status_updates/ and raise an action item against the M2 gate.

Curricula evolve with the roadmap. Submit updates alongside milestone retrospectives and link the diff in your status report.