Monitoring & Experiment Tracking¶
This project provides optional integration for:
- TensorBoard (scalars, histograms): logs under
runs/<run-name>/tensorboard/ - Weights & Biases (W&B): enable with
--enable-wandband run withWANDB_MODE=offline(ordisabled) plusWANDB_PROJECT=<your_project> - MLflow (local file store): enable with
--mlflow-enable, optionally set--mlflow-tracking-uriand--mlflow-experiment; logs toruns/<run-name>/mlruns/
Both deploy/deploy_codex_pipeline.py and the Hydra CLI (python -m codex_ml.cli.main) honor these flags to stream metrics and persist run artifacts.
Quickstart¶
python tools/monitoring_integrate.py --run-name demo --enable-tensorboard --enable-mlflow
# With Weights & Biases
WANDB_MODE=offline WANDB_PROJECT=myproj python tools/monitoring_integrate.py --run-name demo --enable-tensorboard --enable-wandb
# Pipeline example
python deploy/deploy_codex_pipeline.py --corpus data.jsonl --demos demos.jsonl --prefs prefs.jsonl --output-dir out --enable-wandb --mlflow-enable
# Hydra CLI example
python -m codex_ml.cli.main --enable-wandb --mlflow-enable
# Functional trainer example with system metrics logging (writes to <checkpoint_dir>/system_metrics.jsonl)
python -m codex_ml.cli train-model --config configs/training/base.yaml --system-metrics AUTO --system-metrics-interval 15
Test coverage¶
tests/cli/test_monitoring_cli.pyexercises the Typer commands (inspectandexport) against temporary NDJSON data to keep the CLI working offline. Companion coverage intests/cli/test_plugins_cli.pyverifies plugin registry inspection commands.
Viewing¶
- TensorBoard:
tensorboard --logdir runs/demo/tensorboard - MLflow UI:
mlflow ui --backend-store-uri file:runs/demo/mlruns
All executions run locally via CLI. Do NOT activate any GitHub Actions online files.
System metrics logging¶
codex_ml.monitoring.system_metrics.SystemMetricsLoggerusespsutilto capture CPU utilisation, memory statistics, load averages, and per-process usage. Whenpsutilis missing or disabled the module emits structuredsystem_metrics.psutil_missingandsystem_metrics.logger_disabledwarnings (alongsidesystem_metrics.dependency_missingduring import failures) and the background logger becomes a no-op. Callers can still invokesample_system_metrics()to retrieve a lightweight pure-Python snapshot (load averages, heuristic CPU %, and process RSS where available) and inspect theSYSTEM_METRICS_DEGRADEDflag to detect the reduced capability. Requested GPU telemetry is gated behind NVIDIA's NVML bindings; when NVML is absent or fails to initialise the sampler recordssystem_metrics.nvml_missingand continues streaming CPU-only payloads.- Enable the logger via training CLI flag
--system-metrics. PassingAUTO(or omitting a value) writes to<checkpoint_dir>/system_metrics.jsonl; provide a relative or absolute path to redirect output. - Control sampling cadence with
--system-metrics-interval <seconds>(minimum 0.1 s). Records are newline-delimited JSON objects. - Feature flags: set
CODEX_MONITORING_ENABLE_PSUTIL=0to skip psutil entirely. GPU telemetry is opt-in viaCODEX_MONITORING_ENABLE_GPU=1(optionallyCODEX_MONITORING_ENABLE_NVML=1for NVML-backed metrics); force-disable it withCODEX_MONITORING_DISABLE_GPU=1orconfigure_system_metrics(poll_gpu=False). SetCODEX_DISABLE_NVML=1to skip NVML imports altogether—system_metrics.nvml_disabledis logged at INFO level and the sampler remains CPU-only.
NVML fallback (CPU-only environments)¶
- GPU metrics are gathered via NVIDIA NVML (
pynvml) when the bindings are present. - When NVML is unavailable or fails to initialise, the callback emits stable GPU keys with zeroed values so downstream schemas remain deterministic.
| Environment | Keys Emitted | Notes |
|---|---|---|
| With NVML | gpu{i}_util, gpu{i}_mem |
One row per visible device |
| Without NVML | gpu0_util=0, gpu0_mem=0 |
Ensures downstream schemas remain stable |
Test the fallback¶
Prometheus (optional)¶
Monitoring & Experiment Tracking¶
Flags:
--enable-wandb--mlflow-enable/--mlflow-tracking-uri/--mlflow-experiment--system-metrics/--system-metrics-interval
Behavior:
- TensorBoard: logs to
<output>/tb - Weights & Biases: enabled when flag set (honours
WANDB_MODEfor offline/disabled) - MLflow: wraps
mlflow.*viacodex_ml.tracking.mlflow_utils.*; artifacts/runs tracked where configured - System metrics: writes newline-delimited JSON samples (CPU %, memory, load averages, process stats) under
<checkpoint_dir>/system_metrics.jsonlor the path supplied to the flag.
Hardware metrics¶
codex_ml.monitoring.system_metrics provides the CPU/memory sampler. When the --system-metrics
flag is active the functional trainer launches SystemMetricsLogger in the background to
append samples during training. GPU telemetry is opt-in: set
CODEX_MONITORING_ENABLE_GPU=1 (and, if necessary, CODEX_MONITORING_ENABLE_NVML=1) to
initialise NVML, or CODEX_MONITORING_DISABLE_GPU=1/configure_system_metrics(poll_gpu=False)
to keep sampling CPU-only environments quiet. Administrators can also set
CODEX_DISABLE_NVML=1 to short-circuit NVML probing. When dependencies are missing the sampler
degrades gracefully with structured warnings (system_metrics.psutil_missing,
system_metrics.logger_disabled, system_metrics.nvml_missing/system_metrics.nvml_disabled)
and minimal telemetry is still available via sample_system_metrics(). The module exposes
SYSTEM_METRICS_DEGRADED so callers can detect when psutil-backed sampling is unavailable.