Reasoning overview¶
This guide orients you across the systems, checkpoints, and metrics that define the reasoning roadmap. Keep it close when triaging milestones or proposing architectural changes.
Milestone guardrails¶
| Milestone | Gate | Acceptance notes |
|---|---|---|
| M0 | Trace coverage ≥95% on curated templates | Validate with codex metrics summarize --metric reasoning.trace_coverage. |
| M1 | Curriculum win rate ≥0.55 on benchmarks/cot-lite |
Run the curriculum smoke in First principles curricula. |
| M2 | Shadow latency p95 ≤700 ms | Capture with codex deploy --dry-run --latency-report. |
| M3 | Weekly redeploy cadence with zero manual overrides | Enforced by the deployment checklist in templates/. |
Milestones build sequentially: do not advance without closing action items or documenting explicit risk trade-offs in
status_updates/.
Systems topology¶
- Authoring — Hydra defaults stitch reasoning templates from
configs/training/reasoning/with classical knobs. Updating a template requires bumping the manifest digest and notifying deployment partners. - Training — Training and trace capture are coordinated by the unified training stack:
src/codex_ml/training/unified_training.pyexposes configuration for curriculum phases, continual replay, and resume strategy,src/codex_ml/train_loop.pyexecutes a single run, attaches the reasoning harness, and logs traces / checkpoints. When these docs refer to "the trainer", they mean this pair of modules (plus the Hydra overlays inconfigs/training/reasoning/*), not a class literally namedReasoningTrainer. Trace payloads mirror the schema described in../reference/reasoning_trace.md.- Evaluation — Evaluators register under
codex_ml.eval.registry. The reasoning profile uses tiered NDJSON ledgers (.codex/metrics/reasoning.ndjson) that feed status reports. - Deployment — Serving pods mount bespoke model bundles and rely on
codex deployto enforce manifest parity.
When proposing topology changes, update ../diagrams/architecture.svg and include a short
rationale in status_updates/.
Training pipeline¶
- Select a template:
codex reasoning-templates list→ choose an entry (for examplebaseline). - Compose overrides:
- Inspect traces:
- Promote artifacts via
codex register --bundle ... --tag reasoning/<milestone>.
Use the curriculum.phase_schedule knob to align experiment duration with milestone targets. For ablations, document the
variant name in training.output_dir so trace comparisons remain legible.
Evaluation pipeline¶
- Generate evaluation inputs with
codex datasets materialize --preset reasoning/baseline. - Run the evaluator:
- Append commentary to
status_updates/<milestone>.mdsummarising regressions or deltas. - Trigger the optional smoke:
codex evaluate --config ... --metrics-onlyfor dashboard-friendly output.
Deployment pipeline¶
- Validate manifests:
The deploy command consumes the offline
codex deploy --config configs/deploy/reasoning_pod.yaml \ --run-metadata-dir runs/train_loop/latest \ --dry-runrun_metadata.jsonemitted by the training pipeline. Point--run-metadata-dirat the directory containing that file (for example,runs/train_loop/latest). - Shadow host in the target environment and confirm p95 latency ≤700 ms.
- Update
../deployment/reasoning_pod.mdwith any override notes. - Promote the template via
codex reasoning-templates explain <name>and store the explanation alongside the rollout PR.
Observability¶
- Trace ledger —
.codex/metrics/reasoning.ndjson(mirrors the evaluation ledger for quick correlation). - Model registry —
artifacts/runs/<experiment>seeded bycodex register. - Redeploy dashboard — Link your dashboards in
../status_updates/README.mdso releases can reference the same views.
Keep observability wiring hermetic: do not rely on third-party plugins without documenting mocks or fallbacks.