Monitoring & Experiment Tracking¶

This project provides optional integration for:

TensorBoard (scalars, histograms): logs under runs/<run-name>/tensorboard/
Weights & Biases (W&B): enable with --enable-wandb and run with WANDB_MODE=offline (or disabled) plus WANDB_PROJECT=<your_project>
MLflow (local file store): enable with --mlflow-enable, optionally set --mlflow-tracking-uri and --mlflow-experiment; logs to runs/<run-name>/mlruns/

Both deploy/deploy_codex_pipeline.py and the Hydra CLI (python -m codex_ml.cli.main) honor these flags to stream metrics and persist run artifacts.

Quickstart¶

python tools/monitoring_integrate.py --run-name demo --enable-tensorboard --enable-mlflow
# With Weights & Biases
WANDB_MODE=offline WANDB_PROJECT=myproj python tools/monitoring_integrate.py --run-name demo --enable-tensorboard --enable-wandb
# Pipeline example
python deploy/deploy_codex_pipeline.py --corpus data.jsonl --demos demos.jsonl --prefs prefs.jsonl --output-dir out --enable-wandb --mlflow-enable
# Hydra CLI example
python -m codex_ml.cli.main --enable-wandb --mlflow-enable
# Functional trainer example with system metrics logging (writes to <checkpoint_dir>/system_metrics.jsonl)
python -m codex_ml.cli train-model --config configs/training/base.yaml --system-metrics AUTO --system-metrics-interval 15

Test coverage¶

tests/cli/test_monitoring_cli.py exercises the Typer commands (inspect and export) against temporary NDJSON data to keep the CLI working offline. Companion coverage in tests/cli/test_plugins_cli.py verifies plugin registry inspection commands.

Viewing¶

TensorBoard: tensorboard --logdir runs/demo/tensorboard
MLflow UI: mlflow ui --backend-store-uri file:runs/demo/mlruns

All executions run locally via CLI. Do NOT activate any GitHub Actions online files.

System metrics logging¶

codex_ml.monitoring.system_metrics.SystemMetricsLogger uses psutil to capture CPU utilisation, memory statistics, load averages, and per-process usage. When psutil is missing or disabled the module emits structured system_metrics.psutil_missing and system_metrics.logger_disabled warnings (alongside system_metrics.dependency_missing during import failures) and the background logger becomes a no-op. Callers can still invoke sample_system_metrics() to retrieve a lightweight pure-Python snapshot (load averages, heuristic CPU %, and process RSS where available) and inspect the SYSTEM_METRICS_DEGRADED flag to detect the reduced capability. Requested GPU telemetry is gated behind NVIDIA's NVML bindings; when NVML is absent or fails to initialise the sampler records system_metrics.nvml_missing and continues streaming CPU-only payloads.
Enable the logger via training CLI flag --system-metrics. Passing AUTO (or omitting a value) writes to <checkpoint_dir>/system_metrics.jsonl; provide a relative or absolute path to redirect output.
Control sampling cadence with --system-metrics-interval <seconds> (minimum 0.1 s). Records are newline-delimited JSON objects.
Feature flags: set CODEX_MONITORING_ENABLE_PSUTIL=0 to skip psutil entirely. GPU telemetry is opt-in via CODEX_MONITORING_ENABLE_GPU=1 (optionally CODEX_MONITORING_ENABLE_NVML=1 for NVML-backed metrics); force-disable it with CODEX_MONITORING_DISABLE_GPU=1 or configure_system_metrics(poll_gpu=False). Set CODEX_DISABLE_NVML=1 to skip NVML imports altogether—system_metrics.nvml_disabled is logged at INFO level and the sampler remains CPU-only.

NVML fallback (CPU-only environments)¶

GPU metrics are gathered via NVIDIA NVML (pynvml) when the bindings are present.
When NVML is unavailable or fails to initialise, the callback emits stable GPU keys with zeroed values so downstream schemas remain deterministic.

Environment	Keys Emitted	Notes
With NVML	`gpu{i}_util`, `gpu{i}_mem`	One row per visible device
Without NVML	`gpu0_util=0`, `gpu0_mem=0`	Ensures downstream schemas remain stable

Test the fallback¶

pytest -q tests/monitoring/test_system_metrics_cpu_fallback.py

Prometheus (optional)¶

Monitoring & Experiment Tracking¶

Flags:

--enable-wandb
--mlflow-enable / --mlflow-tracking-uri / --mlflow-experiment
--system-metrics / --system-metrics-interval

Behavior:

TensorBoard: logs to <output>/tb
Weights & Biases: enabled when flag set (honours WANDB_MODE for offline/disabled)
MLflow: wraps mlflow.* via codex_ml.tracking.mlflow_utils.*; artifacts/runs tracked where configured
System metrics: writes newline-delimited JSON samples (CPU %, memory, load averages, process stats) under <checkpoint_dir>/system_metrics.jsonl or the path supplied to the flag.

Hardware metrics¶

codex_ml.monitoring.system_metrics provides the CPU/memory sampler. When the --system-metrics flag is active the functional trainer launches SystemMetricsLogger in the background to append samples during training. GPU telemetry is opt-in: set CODEX_MONITORING_ENABLE_GPU=1 (and, if necessary, CODEX_MONITORING_ENABLE_NVML=1) to initialise NVML, or CODEX_MONITORING_DISABLE_GPU=1/configure_system_metrics(poll_gpu=False) to keep sampling CPU-only environments quiet. Administrators can also set CODEX_DISABLE_NVML=1 to short-circuit NVML probing. When dependencies are missing the sampler degrades gracefully with structured warnings (system_metrics.psutil_missing, system_metrics.logger_disabled, system_metrics.nvml_missing/system_metrics.nvml_disabled) and minimal telemetry is still available via sample_system_metrics(). The module exposes SYSTEM_METRICS_DEGRADED so callers can detect when psutil-backed sampling is unavailable.