Codex ML Architecture (v0.1.0)ΒΆ
Version: v0.1.0 Pre-Release Last Updated: 2026-02-24 Status: Living Document Managed By: AI Assistant Autonomous System
AI-Managed Repository Notice: This repository is designed for and managed by AI Assistants and Agents. All architectural decisions, reviews, and updates are performed autonomously by AI systems.
Package Name: codex-ml (PyPI/Distribution) | Repository: _codex_
This document provides a comprehensive architectural overview of the _codex_ ML training, evaluation, and plugin framework using C4-lite modeling.
Table of ContentsΒΆ
- Component Architecture
- Data Flow
- Operational Concerns
- Technology Choices
- Roadmap
- Architecture Decision Records
System Context (current)ΒΆ
The Codex ML system provides a comprehensive framework for ML model training, evaluation, and deployment with emphasis on reproducibility, observability, and extensibility. It includes the MCP ecosystem, Cognitive Brain system, and 218+ autonomous agents.
graph TB
User[Data Scientist / ML Engineer
Platform User]
Copilot[GitHub Copilot
AI Coding Agent]
Agents[218+ Autonomous Agents
π€ MCP-enabled]
Codex[codex-ml
Production-Ready ML Platform
15,640+ Tests | ~17% Coverage]
Brain[Cognitive Brain
kβ=0.35 | 2.86x Advantage
289 patterns learned]
MCP[MCP System
Model Context Protocol
133 active workflows]
Pipeline[Python Ingestion
Ingest β Analyze β Transform β Verify]
HF[Hugging Face Hub
Models + Datasets]
MLflow[MLflow Tracking Server
Experiments + Registry]
Storage[Cloud Storage
S3 / Azure / GCS]
Compute[GPU Compute
Ray Cluster / Distributed]
GitHub[GitHub
Actions + PR Automation]
User -->|Configure & Train| Codex
Copilot -->|Code Generation & Review| Codex
Agents -->|Autonomous Operations| Codex
Codex --> Brain
Codex --> MCP
Codex --> Pipeline
Brain -->|Pattern-guided Decisions| Agents
MCP -->|Context Protocol| Agents
Codex -->|Load Models & Data| HF
Codex -->|Track Experiments| MLflow
Codex -->|Store Artifacts| Storage
Codex -->|Distribute Training| Compute
Codex -->|CI/CD Automation| GitHub
style Codex fill:#3b82f6,stroke:#fff,stroke-width:4px,color:#fff
style Brain fill:#8b5cf6,stroke:#fff,stroke-width:3px,color:#fff
style MCP fill:#10b981,stroke:#fff,stroke-width:3px,color:#fff
style Agents fill:#f59e0b,stroke:#fff,stroke-width:2px,color:#fff
External Actors (current)ΒΆ
- Data Scientists / ML Engineers: Primary users who configure, train, and evaluate models
- GitHub Copilot: AI coding agent that autonomously fixes CI failures, fills coverage gaps, and implements features
- 218+ Autonomous Agents: Specialized domain agents for testing, documentation, security, and operations
- CI/CD Systems: 133 active GitHub Actions workflows for testing, deployment, and self-healing
External SystemsΒΆ
- Hugging Face Hub: Model and dataset repository
- MLflow: Experiment tracking and model registry
- Cloud Storage: Artifact storage (checkpoints, logs, data) - S3, Azure, GCS
- Ray Cluster: Distributed compute for training and serving
- GitHub: PR automation, Actions workflows, agent orchestration
Container Architecture (current)ΒΆ
The system is organized into several logical containers (processes or deployable units). Version 0.1.0 introduces MCP system, Cognitive Brain, and autonomous agent orchestration.
graph TB
subgraph "codex-ml v0.1.0 System"
subgraph "Core ML Platform"
CLI[CLI Interface
Typer/Click
π§ Main Entry Point]
Training[Training Engine
PyTorch + Transformers
π Distributed Training]
Eval[Evaluation Engine
lm-eval + custom metrics
π 15,640+ Tests]
Serve[Model Serving
Ray Serve + FastAPI
π Production API]
Config[Configuration
Hydra + OmegaConf
βοΈ Hierarchical]
Logging[Session Logging
SQLite + Telemetry
π Complete Audit]
end
subgraph "Cognitive Brain (kβ=0.35)"
Brain[Decision Engine
Superposition + Entanglement
π§ 2.86x Advantage]
Memory[Memory Manager
STM/LTM + Patterns
πΎ 60% Compression]
Optimizer[Adaptive Scoring
ML-inspired Weights
π Self-optimizing]
end
subgraph "MCP Ecosystem"
MCPCore[MCP Core
Model Context Protocol
π Standardized]
Adapters[Adapters
Pinecone/Mock/Custom
π Extensible]
Workers[Background Workers
Embeddings + Checkpoints
βοΈ Async]
Metrics[MCP Metrics
Telemetry + Monitoring
π Observability]
end
subgraph "Python Ingestion Pipeline"
Ingest[Ingest Module
File/ZIP/Git/URL
π₯ Multi-source]
Analyze[Analysis Module
AST + Runtime
π Static + Dynamic]
Transform[Transform Module
Tier A/B/C
π LLM-guided]
Verify[Verify Module
Behavior Compare
β
Test Gen]
end
subgraph "Agent System (218+ Agents)"
AgentCore[Agent Core
RAG + RAGIndexer
π€ Autonomous]
ToolRegistry[Tool Registry
Centralized Discovery
π§ Dynamic]
AgentMemory[Agent Memory
SQLite Persistent
πΎ Pattern Library]
end
subgraph "Infrastructure"
Security[Security Layer
48 CVEs Fixed
π Production]
CICD[CI/CD Automation
Auto-Fix + Self-Heal
π§ Time Savings]
Plugins[Plugin Framework
Dynamic Loading
π Extensible]
end
end
subgraph "External Services"
MLflow[MLflow Server
Experiments + Registry]
Storage[Object Storage
S3/Azure/GCS]
HF[Hugging Face
Models + Datasets]
GitHub[GitHub
Actions + API]
end
%% Core Flow
CLI --> Config
CLI --> Training
CLI --> Eval
CLI --> Serve
CLI --> Ingest
Config -.configures.-> Training
Config -.configures.-> Eval
Config -.configures.-> Brain
Training --> Logging
Eval --> Logging
Serve --> Logging
%% Cognitive Brain
Brain --> Memory
Brain --> Optimizer
AgentCore --> Brain
%% MCP System
MCPCore --> Adapters
MCPCore --> Workers
MCPCore --> Metrics
AgentCore --> MCPCore
%% Pipeline
Ingest --> Analyze
Analyze --> Transform
Transform --> Verify
CLI --> Ingest
%% Agent System
AgentCore --> ToolRegistry
AgentCore --> AgentMemory
AgentCore --> CICD
%% Infrastructure
Security -.protects.-> Training
Security -.protects.-> MCPCore
CICD -.automates.-> GitHub
Plugins -.extends.-> Training
%% External
Training --> MLflow
Eval --> MLflow
Training --> Storage
Training --> HF
Eval --> HF
AgentCore --> GitHub
%% Styling
style CLI fill:#3b82f6,stroke:#1e40af,stroke-width:2px,color:#fff
style Brain fill:#8b5cf6,stroke:#6d28d9,stroke-width:2px,color:#fff
style MCPCore fill:#10b981,stroke:#059669,stroke-width:2px,color:#fff
style Ingest fill:#f59e0b,stroke:#d97706,stroke-width:2px,color:#fff
style AgentCore fill:#ef4444,stroke:#dc2626,stroke-width:2px,color:#fff
style Security fill:#dc2626,stroke:#991b1b,stroke-width:2px,color:#fff
Serve --> HF
Config -.->|Hydra compose| Training
Config -.->|Hydra compose| Eval
Config -.->|Hydra compose| Serve
Plugins -.->|Extend| Training
Plugins -.->|Extend| Eval
style CLI fill:#4a9eff
style Training fill:#ff6b6b
style Eval fill:#51cf66
style Serve fill:#ffd43b
style Logging fill:#845ef7
style Config fill:#ff8787
style Plugins fill:#69db7c
```text
Container DescriptionsΒΆ
| Container | Technology | Purpose | Dependencies |
|---|---|---|---|
| CLI Interface | Typer, Click | Entry point for all user interactions | Config, Training, Eval, Serve |
| Training Engine | PyTorch, Transformers, PEFT, Accelerate | Model training with LoRA/QLoRA support | Config, Logging, MLflow, Storage |
| Evaluation Engine | lm-eval, custom metrics | Model evaluation and benchmarking | Config, Logging, HF Hub |
| Model Serving | Ray Serve, FastAPI | Production model inference API | Config, Logging, HF Hub |
| Logging & Telemetry | SQLite, custom session logger | Conversation tracking, session management | None |
| Configuration | Hydra, OmegaConf | Hierarchical configuration management | None |
| Plugin Framework | Python importlib | Dynamic plugin loading and extension | Config |
| --- | |||
| ## Component Architecture | |||
| ### Core Components | |||
| ```mermaid | |||
| graph TB | |||
| subgraph "Training Engine" | |||
| Trainer[Trainer Main orchestrator] |
|||
| DataLoader[DataLoader Dataset preparation] |
|||
| ModelInit[Model Initializer Load/create models] |
|||
| Optimizer[Optimizer & Scheduler Training optimization] |
|||
| Checkpoint[Checkpoint Manager Save/resume training] |
|||
| end | |||
| subgraph "Evaluation Engine" | |||
| EvalRunner[Evaluation Runner] | |||
| Metrics[Metrics Calculator] | |||
| Benchmarks[Benchmark Suite] | |||
| Reporter[Results Reporter] | |||
| end | |||
| subgraph "Configuration Management" | |||
| HydraConfig[Hydra Config Loader] | |||
| Validator[Config Validator Pydantic schemas] |
|||
| Defaults[Default Configs] | |||
| end | |||
| subgraph "Logging Infrastructure" | |||
| SessionLogger[Session Logger SQLite backend] |
|||
| QueryEngine[Query Engine Search transcripts] |
|||
| Viewer[Log Viewer CLI interface] |
|||
| end | |||
| Trainer --> DataLoader | |||
| Trainer --> ModelInit | |||
| Trainer --> Optimizer | |||
| Trainer --> Checkpoint | |||
| Trainer --> SessionLogger | |||
| EvalRunner --> Metrics | |||
| EvalRunner --> Benchmarks | |||
| EvalRunner --> Reporter | |||
| EvalRunner --> SessionLogger | |||
| HydraConfig --> Validator | |||
| HydraConfig --> Defaults | |||
| style Trainer fill:#ff6b6b | |||
| style EvalRunner fill:#51cf66 | |||
| style HydraConfig fill:#ff8787 | |||
| style SessionLogger fill:#845ef7 | |||
| ```text | |||
| ### Component Responsibilities | |||
| #### Training Engine Components | |||
| - Trainer: Orchestrates the training loop, manages epochs, batching, and gradient accumulation | |||
| - DataLoader: Prepares datasets from Hugging Face, local files, or custom sources | |||
| - Model Initializer: Loads pre-trained models or creates new architectures | |||
| - Optimizer & Scheduler: Manages learning rate schedules and optimization algorithms | |||
| - Checkpoint Manager: Handles model checkpointing, resumption, and artifact storage | |||
| #### Evaluation Engine Components | |||
| - Evaluation Runner: Coordinates evaluation tasks across different benchmarks | |||
| - Metrics Calculator: Computes accuracy, perplexity, BLEU, and custom metrics | |||
| - Benchmark Suite: Integrates lm-eval and custom evaluation tasks | |||
| - Results Reporter: Formats and outputs evaluation results | |||
| #### Configuration Management | |||
| - Hydra Config Loader: Composes configurations from multiple sources | |||
| - Config Validator: Validates configurations using Pydantic schemas | |||
| - Default Configs: Provides sensible defaults for common scenarios | |||
| #### Logging Infrastructure | |||
| - Session Logger: Records conversation events and training sessions to SQLite | |||
| - Query Engine: Enables searching through conversation transcripts | |||
| - Log Viewer: CLI tool for viewing and analyzing logs | |||
| --- | |||
| ## Data Flow | |||
| ### Training Data Flow | |||
| ```mermaid | |||
| sequenceDiagram | |||
| participant User | |||
| participant CLI | |||
| participant Config | |||
| participant Trainer | |||
| participant DataLoader | |||
| participant Model | |||
| participant MLflow | |||
| participant Storage | |||
| User->>CLI: Run training command | |||
| CLI->>Config: Load Hydra config | |||
| Config-->>CLI: Resolved configuration | |||
| CLI->>Trainer: Initialize with config | |||
| Trainer->>DataLoader: Load dataset | |||
| DataLoader-->>Trainer: Batched data | |||
| Trainer->>Model: Forward pass | |||
| Model-->>Trainer: Loss | |||
| Trainer->>Trainer: Backward pass & optimize | |||
| loop Every N steps | |||
| Trainer->>MLflow: Log metrics | |||
| Trainer->>Storage: Save checkpoint | |||
| end | |||
| Trainer-->>CLI: Training complete | |||
| CLI-->>User: Results & artifact paths | |||
| ```text | |||
| ### Evaluation Data Flow | |||
| ```mermaid | |||
| sequenceDiagram | |||
| participant User | |||
| participant CLI | |||
| participant Config | |||
| participant EvalRunner | |||
| participant Model | |||
| participant Benchmarks | |||
| participant Reporter | |||
| User->>CLI: Run evaluation command | |||
| CLI->>Config: Load Hydra config | |||
| Config-->>CLI: Resolved configuration | |||
| CLI->>EvalRunner: Initialize evaluator | |||
| EvalRunner->>Model: Load checkpoint | |||
| EvalRunner->>Benchmarks: Run tasks | |||
| loop For each task | |||
| Benchmarks->>Model: Generate predictions | |||
| Model-->>Benchmarks: Outputs | |||
| Benchmarks->>Benchmarks: Compute metrics | |||
| end | |||
| Benchmarks-->>EvalRunner: Aggregated results | |||
| EvalRunner->>Reporter: Format results | |||
| Reporter-->>CLI: Formatted report | |||
| CLI-->>User: Evaluation results | |||
| ```text | |||
| ### Configuration Resolution Flow | |||
| ```mermaid | |||
| flowchart LR | |||
| Defaults[Default Configs config/] |
|||
| User[User Overrides CLI args] |
|||
| Env[Environment Variables CODEX_*] |
|||
| Hydra[Hydra Composer] | |||
| Validator[Pydantic Validator] | |||
| Final[Final Config Object] | |||
| Defaults --> Hydra | |||
| User --> Hydra | |||
| Env --> Hydra | |||
| Hydra --> Validator | |||
| Validator --> Final | |||
| style Final fill:#51cf66 | |||
| ```text | |||
| --- | |||
| ## Operational Concerns | |||
| ### Deployment Patterns | |||
| #### Local Development | |||
| - Run training on local GPU | |||
| - Use SQLite for session logging | |||
| - Store artifacts locally or in cloud storage | |||
| #### Cloud Training | |||
| - Distribute training across Ray cluster | |||
| - Use MLflow for experiment tracking | |||
| - Store artifacts in S3/GCS | |||
| #### Model Serving | |||
| - Deploy with Ray Serve for horizontal scaling | |||
| - FastAPI endpoints for inference | |||
| - Health checks and monitoring | |||
| ### Observability | |||
| ```mermaid | |||
| graph LR | |||
| App[Codex ML] | |||
| Logs[Session Logs SQLite] |
|||
| Metrics[MLflow Metrics Training/Eval] |
|||
| Traces[Conversation Traces Query Engine] |
|||
| App --> Logs | |||
| App --> Metrics | |||
| App --> Traces | |||
| Viewer[Log Viewer CLI] | |||
| MLflowUI[MLflow UI] | |||
| QueryCLI[Query CLI] | |||
| Logs --> Viewer | |||
| Metrics --> MLflowUI | |||
| Traces --> QueryCLI | |||
| style App fill:#326ce5,color:#fff | |||
| ```text | |||
| Logging Levels: | |||
| - Session events (system, user, assistant, tool roles) | |||
| - Training metrics (loss, learning rate, throughput) | |||
| - Evaluation results (accuracy, perplexity, custom metrics) | |||
| - Error tracking and stack traces | |||
| Key Metrics: | |||
| - Training: Loss, learning rate, gradient norm, samples/sec | |||
| - Evaluation: Accuracy, F1, perplexity, BLEU | |||
| - Infrastructure: GPU utilization, memory usage, I/O throughput | |||
| ### Security Considerations | |||
| - Secrets Management: Use environment variables, never commit secrets | |||
| - Input Validation: Validate all configurations and user inputs | |||
| - Dependency Scanning: Automated vulnerability scanning via Dependabot | |||
| - Code Analysis: Bandit for Python security issues | |||
| See SECURITY.md for vulnerability reporting. | |||
| ### Scalability | |||
| - Horizontal Scaling: Ray for distributed training and serving | |||
| - Vertical Scaling: Multi-GPU support via Accelerate | |||
| - Data Parallelism: Sharded datasets for large-scale training | |||
| - Model Parallelism: Support for large models via FSDP/DeepSpeed | |||
| ### Reliability | |||
| - Checkpointing: Automatic checkpoint saving and resumption | |||
| - Fault Tolerance: Ray's fault-tolerant execution | |||
| - Graceful Degradation: Fallback to CPU if GPU unavailable | |||
| - Validation: Pydantic-based configuration validation | |||
| --- | |||
| ## Technology Choices | |||
| ### Core Technologies | |||
| Category | Technology | Rationale | |
| ---------- | ----------- | ----------- | |
| ML Framework | PyTorch | Industry standard, excellent ecosystem | |
| Transformers | Hugging Face Transformers | De facto standard for NLP models | |
| Configuration | Hydra + OmegaConf | Composable configs, CLI overrides | |
| Experiment Tracking | MLflow | Open-source, model registry, UI | |
| Distributed Compute | Ray | Scalable, fault-tolerant, Python-native | |
| Model Serving | Ray Serve + FastAPI | Scalable inference, familiar API patterns | |
| CLI Framework | Typer | Modern, type-safe, auto-docs | |
| Data Validation | Pydantic | Type safety, automatic validation | |
| Testing | pytest | Powerful, extensive plugin ecosystem | |
| Linting | Ruff + Black + mypy | Fast, comprehensive, type-checked | |
| ### Design Patterns | |||
| - Dependency Injection: Hydra provides configs to all components | |||
| - Plugin Architecture: Dynamic loading for extensibility | |||
| - Factory Pattern: Model and dataset creation | |||
| - Strategy Pattern: Different training strategies (LoRA, full fine-tuning) | |||
| - Observer Pattern: Event logging throughout training | |||
| --- | |||
| ## Roadmap | |||
| ### Current Capabilities (v0.x) | |||
| - β LoRA/QLoRA fine-tuning | |||
| - β Hydra-based configuration | |||
| - β MLflow experiment tracking | |||
| - β Session logging to SQLite | |||
| - β CLI interface | |||
| - β Evaluation with lm-eval | |||
| - β Plugin framework | |||
| ### Near-Term (Phase 1 (Current Cycle)) | |||
| - π Enhanced model serving with caching | |||
| - π Advanced evaluation metrics | |||
| - π Automated hyperparameter tuning | |||
| - π Better documentation and tutorials | |||
| - π Distributed training optimizations | |||
| ### Medium-Term (Cycle 2-Phase 3 (Current Cycle)) | |||
| - π Multi-modal support (vision + language) | |||
| - π Reinforcement learning from human feedback (RLHF) | |||
| - π Model compression and quantization | |||
| - π Automated dataset curation | |||
| - π Enhanced monitoring and alerting | |||
| ### Long-Term (Current Cycle+) | |||
| - π‘ Auto-ML capabilities | |||
| - π‘ Federated learning support | |||
| - π‘ Edge deployment | |||
| - π‘ Advanced privacy-preserving techniques | |||
| Legend: β Complete | π In Progress | π Planned | π‘ Under Consideration |
| --- | |||
| ## Architecture Decision Records | |||
| For detailed architectural decisions and their rationale, see: | |||
| - ADR Directory - All architecture decision records | |||
| - ADR-0001: Record Architecture Decisions - Meta-ADR about the ADR process | |||
| ### Key Decisions | |||
| 1. ADR-0001: Use Architecture Decision Records for documenting significant decisions | |||
| 2. Use Hydra for Configuration: Enables composable, overridable configurations | |||
| 3. SQLite for Session Logging: Lightweight, local-first, queryable logs | |||
| 4. Ray for Distribution: Python-native, supports both training and serving | |||
| 5. Plugin-Based Extensibility: Allow users to extend without forking | |||
| --- | |||
| ## Fence Validation Architecture (Legacy) | |||
| > Note: This section documents the fence validation tooling used for Markdown quality checks. | |||
The tools/validate_fences.py traverses Markdown inputs and surfaces fence issues for local contributors. |
|||
| ### Component Overview | |||
- Target discovery (iter_files): Walks requested roots while skipping generated locations |
|||
- Line preparation (_prepare_line): Strips diff prefixes and indentation |
|||
- Fence analysis (_scan_file): Maintains FenceState metadata to validate symmetry |
|||
- Public entry points: validate_file (Python API), main (CLI) |
|||
| ```mermaid | |||
| flowchart TD | |||
| A[CLI or caller] --> | argv / path list | B[_parse_args + _gather_targets] | |
| B --> C{Targets?} | |||
| C -- none --> D[Emit "[fence-check] No matching files"] | |||
| C -- files --> E[iter_files] | |||
| E --> F[_scan_file] | |||
| F --> | errors | G[[STDOUT error lines]] | |
| F --> | warnings | H[[STDOUT warning lines]] | |
| F --> | ok state | I[["[fence-check] OK"]] | |
| ```text | |||
| ### Running Locally | |||
| ```bash | |||
| python -m pip install -r requirements-dev.txt | |||
| pytest -q tests/test_validate_fences.py | |||
| ``` |
Contributing to ArchitectureΒΆ
When proposing architectural changes:
- Create an ADR: Document the decision in
docs/decision_records/ - Update diagrams: Keep Mermaid diagrams current
- AI Assistant autonomous review: Automated architectural validation and feedback
- Update this document: Reflect changes in this ARCHITECTURE.md
- Update related docs: Keep API docs, guides, and README in sync
ReferencesΒΆ
Questions or suggestions? Open a discussion or submit for AI Assistant autonomous review