MCP Observability¶
Last Updated: 2026-01-23T11:45:00Z
Status: ✅ Prototype Implementation
Priority: P2 (Supporting Documentation)
MCP Protocol Version: 2024-11-05
🎯 Mission Overview¶
Objective: Establish comprehensive observability framework for MCP operations through FastAPI middleware, JSON-RPC logging, metrics collection, and tracing hooks while maintaining offline-first architecture.
Energy Level: ⚡⚡⚡ (3/5) - Essential monitoring infrastructure supporting MCP reliability.
Operational Status: - ✅ FastAPI middleware integrated - ✅ JSON-RPC logging operational - ✅ Health endpoint exposes status payload - 🔄 Prometheus scraping endpoint placeholder ready - 🔮 OpenTelemetry tracing hooks disabled by default (offline-first)
⚖️ Verification Checklist¶
Observability Infrastructure:
- [ ] Health endpoint returns valid JSON status
- [ ] Logs capture all JSON-RPC unknown notifications
- [ ] Logs capture transport errors with stack traces
- [ ] HTTP smoke tests exercise health and query endpoints
- [ ] Prometheus /metrics endpoint ready for instrumentation
Monitoring Validation: - [ ] Log rotation configured (size/time-based) - [ ] Log levels adjustable via configuration - [ ] Metrics collection has minimal performance impact (<5% overhead) - [ ] Tracing can be enabled without code changes (env var)
Production Readiness: - [ ] Structured logging (JSON format) enabled - [ ] Log aggregation pipeline tested (if applicable) - [ ] Alert rules defined for critical errors - [ ] Dashboard templates created (Grafana/similar)
📈 Success Metrics¶
| Metric | Target | Current | Status |
|---|---|---|---|
| Health Endpoint Availability | 99.9% | - | 🔮 Pending Monitoring |
| Log Write Latency (p95) | <10ms | - | 🔮 Pending Monitoring |
| Metrics Scrape Interval | 15s | - | 🔮 Pending Config |
| Tracing Overhead (when enabled) | <8% | - | 🔮 Pending Measurement |
| Alert Response Time | <5min | - | 🔮 Pending Ops Setup |
Observability Coverage KPIs: - Critical paths instrumented: 100% (target) - Error scenarios logged: 100% - Performance metrics tracked: >80% - Distributed traces captured: >70% (when enabled)
⚛️ Physics Alignment¶
Path 🛤️ (Observability Flow)¶
Monitoring Path: Event → Log/Metric → Collection → Aggregation → Analysis → Alert/Dashboard
graph TD
A[MCP Event] --> B{Event Type}
B -->|HTTP Request| C[FastAPI Middleware]
B -->|JSON-RPC Call| D[JSON-RPC Logger]
B -->|VectorStore Op| E[Query Tracer]
C --> F[Structured Log]
D --> F
E --> F
C --> G[Prometheus Metric]
E --> G
F --> H[Log Aggregator]
G --> I[Metrics Scraper]
E --> J[Trace Collector]
H --> K[Analysis/Search]
I --> L[Dashboard]
J --> L
K --> M{Anomaly?}
M -->|Yes| N[Alert]
M -->|No| O[Archive]
Fields 🔄 (Observability States)¶
Instrumentation States: 1. Uninstrumented: No observability hooks 2. Logging Only: Basic log output 3. Metrics Enabled: Prometheus scraping active 4. Tracing Enabled: OpenTelemetry spans captured 5. Fully Observable: Logs + Metrics + Traces + Alerts
Patterns 👁️ (Observable Patterns)¶
- Request Duration Pattern: p50/p95/p99 latency tracking
- Error Rate Pattern: 4xx/5xx HTTP status codes
- Throughput Pattern: Requests per second (RPS)
- Resource Usage Pattern: Memory/CPU per operation
- Dependency Health Pattern: VectorStore response times
Redundancy 🔀 (Observability Resilience)¶
Multi-Layer Monitoring: - Application logs: Immediate issue detection - Metrics: Trend analysis and capacity planning - Traces: Root cause analysis for slow requests - Health checks: Service availability confirmation
Fallback Mechanisms: - If metrics scraping fails: Logs still capture events - If tracing overhead high: Disable via env var - If log aggregation down: Local file logs persist
Balance ⚖️ (Overhead vs. Insight)¶
Performance Trade-offs: - Verbose logging vs. disk I/O - Tracing granularity vs. CPU overhead - Metric cardinality vs. memory usage
Default Configuration:
- Logging: INFO level (adjustable to DEBUG for troubleshooting)
- Metrics: Enabled, /metrics endpoint ready
- Tracing: Disabled (enable with OTEL_ENABLED=1)
⚡ Energy Distribution¶
P0 Critical (40%)¶
- Health endpoint reliability (15%)
- Error logging coverage (15%)
- Log write performance (10%)
P1 High (35%)¶
- Metrics collection accuracy (15%)
- Structured log format (12%)
- Alert integration (8%)
P2 Medium (15%)¶
- Tracing instrumentation (10%)
- Dashboard templates (5%)
P3 Low (10%)¶
- Advanced analytics
- Custom metric exporters
🧠 Redundancy Patterns¶
Rollback Strategies¶
Scenario 1: Logging Overhead Excessive
# Rollback: Reduce log level
export LOG_LEVEL=WARNING # From DEBUG
# Or disable verbose middleware logging
export MIDDLEWARE_LOGGING=false
# Restart service
systemctl restart mcp-server
Scenario 2: Metrics Scraping Causes Performance Degradation
# Rollback: Disable Prometheus endpoint
# Comment out metrics middleware in http.py
# Temporary: Block /metrics endpoint
# iptables -A INPUT -p tcp --dport 8000 -m string --string "/metrics" --algo bm -j DROP
# Fix: Optimize metric collection, reduce cardinality
Scenario 3: Tracing Overhead Unacceptable
# Rollback: Disable OpenTelemetry (default state)
export OTEL_ENABLED=0
# Or reduce sampling rate
export OTEL_SAMPLING_RATE=0.01 # From 1.0 (100%)
# Restart with minimal tracing
Recovery Procedures¶
Log Rotation Failure:
# Detect: Disk full, logs not rotating
df -h | grep /var/log
# Recover: Manual rotation
logrotate -f /etc/logrotate.d/mcp-server
# Verify: Check log file sizes
ls -lh /var/log/mcp-server/
Metrics Endpoint 500 Error:
# Detect: /metrics returns 500
# Diagnose: Check middleware errors
journalctl -u mcp-server | grep metrics
# Recover: Restart metrics registry
# (Implementation-specific, may require service restart)
Tracing Data Loss:
# Detect: Missing spans in trace backend
# Diagnose: Check OTEL collector connectivity
curl http://otel-collector:4318/v1/traces -I
# Recover: Restart OTEL exporter
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318
# Restart service
Circuit Breakers¶
Log Write Circuit: - If log write latency >100ms: Buffer logs in memory - If buffer full: Drop oldest logs (with warning metric) - Resume normal logging when latency normalizes
Metrics Scrape Circuit: - If scrape duration >1s: Reduce metric granularity - If scraper fails 3 times: Disable non-critical metrics - Resume full metrics after successful scrape
Trace Export Circuit: - If trace export fails: Queue spans locally - If queue size >10MB: Drop low-priority spans - Resume export when backend available
Observability Framework¶
Observability is built around FastAPI middleware and JSON-RPC logging hooks.
Metrics and logs¶
- HTTP prototype exposes
statuspayload via/mcp/v1/health. - Add Prometheus scraping by mounting
/metrics(placeholder insrc/mcp/server/http.pyready for instrumentation hooks). - JSON-RPC server logs unknown notifications and transport errors via
logging.
Tracing hooks¶
- Wrap
InMemoryVectorStore.querywith tracing decorators when enabling OpenTelemetry; keep disabled by default to stay offline-first.
Validation¶
python scripts/validate_mcp.py --run-http-smokeexercises health and query endpoints.PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest tests/mcp/test_http_server.py -qensures health payload shape is stable.
Document Version: 2.0.0
Last Updated: 2026-01-23T11:45:00Z
Implementation: src/mcp/server/http.py (FastAPI middleware)
Validation: scripts/validate_mcp.py --run-http-smoke
Iteration Alignment: Phase 12.3+ compatible
MCP Protocol: 2024-11-05 specification
Related Documentation: - MCP API Schema - MCP Capabilities Reference - Traversal Workflow
Configuration:
- Default log level: INFO
- Metrics endpoint: /metrics (placeholder)
- Tracing: Disabled by default (OTEL_ENABLED=0)
- Log format: Structured JSON (production), plain text (development)