Production Deployment Runbook¶
Prerequisites¶
- Kubernetes cluster with GPU node pool available
- Helm 3.x installed locally
- Docker registry credentials configured via
docker login
Deployment Steps¶
- Build image:
- Push to registry:
- Deploy Helm chart:
- Verify health probes:
- Run smoke tests:
Rollback Procedure¶
- Identify last known good chart version using
helm history codex. - Roll back:
- Validate
/readyendpoint until status returns 200.
Observability¶
| Metric | SLO | Monitoring |
|---|---|---|
| Availability | 99.9% | Prometheus + Alertmanager |
| Latency p95 | < 200ms | Grafana dashboards |
| Error Rate | < 0.1% | Sentry alerts |
Incident Response¶
Escalation and communication steps are documented in docs/security/incident_response.md.