Codex Symbolic Pipeline — Example¶

The codex_symbolic_pipeline module is a self-contained, dependency-free reference implementation of the Pretraining → SFT → RLHF training workflow.
It uses deterministic bag-of-words models so that unit tests can exercise real token counting, loss computation, and PPO-like updates without any external ML library.

Source: docs/examples/codex_symbolic_pipeline.py
Production module: src/codex_ml/symbolic_pipeline.py

Public API¶

Symbol	Kind	Purpose
`tokenize(text)`	function	Deterministic regex tokeniser (lowercase)
`pretrain(corpus, cfg)`	function	Stage 1 — next-token unigram pretraining
`sft(model, demos, cfg)`	function	Stage 2 — supervised fine-tuning on demos
`train_reward_model(prefs, cfg)`	function	Train logistic reward model from preference pairs
`rlhf_ppo(model, rm, cfg)`	function	Stage 3 — PPO update against reward model
`Weights`	dataclass	Mutable token probability table
`PretrainCfg`	dataclass	Pretraining hyperparameters (epochs, lr, seed)
`SFTCfg`	dataclass	SFT hyperparameters
`RewardModelCfg`	dataclass	Reward model training config
`RLHFCfg`	dataclass	PPO hyperparameters (kl_coef, steps, …)
`ModelHandle`	dataclass	Wraps `Weights` + disallowed-token set
`RewardModelHandle`	dataclass	Wraps reward model weights

Quickstart¶

from docs.examples.codex_symbolic_pipeline import (
    pretrain, sft, train_reward_model, rlhf_ppo,
    PretrainCfg, SFTCfg, RewardModelCfg, RLHFCfg,
)

# --- Stage 1: Pretrain ---
corpus = ["def foo(): pass", "x = 1 + 2", "print('hello')"]
m0 = pretrain(corpus, PretrainCfg(epochs=3, lr=0.1, seed=0))

# --- Stage 2: SFT ---
demos = [{"prompt": "def add(a, b):", "response": "return a + b"}]
m1 = sft(m0, demos, SFTCfg(epochs=2, lr=0.05, seed=0))

# --- Stage 3: RLHF ---
prefs = [{"prompt": "def add", "chosen": "return a + b", "rejected": "pass"}]
rm = train_reward_model(prefs, RewardModelCfg(epochs=3, lr=0.1, seed=0))
m2 = rlhf_ppo(m1, rm, RLHFCfg(steps=5, lr=0.01, kl_coef=0.1, seed=0))

print("Pipeline complete. Final model token count:", len(m2.weights.probs))

Running the pipeline end-to-end¶

# From repo root
python deploy/deploy_codex_pipeline.py \
  --corpus  data/corpus.jsonl \
  --demos   data/demos.jsonl \
  --prefs   data/prefs.jsonl \
  --output-dir runs/exp1

# Validate reproducibility
pytest tests/test_deploy_codex_pipeline.py -v

Pipeline stages¶

Stage 1 — Pretraining¶

Corpus (text/code) ──► tokenize ──► unigram LM ──► M₀ Weights

Builds a unigram probability table via maximum-likelihood estimation across cfg.epochs passes over the corpus. Token probabilities are L1-normalised after each epoch; a safety penalty suppresses disallowed tokens.

Stage 2 — Supervised Fine-Tuning (SFT)¶

M₀ + Demos ──► teacher-forcing cross-entropy ──► M₁ Weights

Updates the pretrained weights by increasing probability mass on demonstration response tokens. Uses cfg.lr as a simple additive step size.

Stage 3 — RLHF (PPO)¶

M₁ + Reward Model ──► PPO update (KL-regularised) ──► M₂ Weights

Samples prompts from the demonstration set, scores each response with the reward model, applies a PPO-style gradient step, and regularises with a KL-divergence penalty to the pretrained model to prevent reward hacking.

Objective function (schematic)¶

$$ \min_{M}\; \mathcal{L}(M) = \alpha\,\mathcal{L}{\text{SFT}}(M;\,D) + \beta\,\mathcal{L}_{\text{RLHF}}(M;\,R) + \gamma\,\Omega(M) $$}

where $\Omega(M)$ is the safety regulariser and $R$ is the trained reward model.

Tests¶

pytest tests/ -k "symbolic" -v

The test suite covers: deterministic reproducibility (seed=0), empty corpus edge-cases, mis-specified config errors, and safety-penalty enforcement.