Distributed Training Troubleshooting¶
Guide for diagnosing and resolving distributed training initialization issues
Overview¶
This guide helps troubleshoot common issues when using distributed training features, including PyTorch DDP and Hugging Face Accelerate.
Quick Diagnostics¶
Check Distributed Availability¶
from codex_ml.distributed import is_distributed_available
if is_distributed_available():
print("✓ Distributed training is available")
else:
print("✗ Distributed training not available (CPU-only mode)")
Check Accelerate Installation¶
Check CUDA Availability¶
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA devices: {torch.cuda.device_count()}")
Common Issues¶
1. ImportError: No module named 'accelerate'¶
Symptom:
Solution: Install accelerate with the appropriate extras:# For CPU-only
pip install accelerate
# For GPU with CUDA
pip install "accelerate>=0.20"
# Or install codex with training extras
pip install -e ".[train]"
2. Accelerate Version Compatibility¶
Symptom:
Cause: Mixing accelerate API versions (pre-0.30 vs 0.30+).Solution: The codebase includes compatibility shims. Ensure you're using a supported version:
3. NCCL Backend Errors on CPU¶
Symptom:
Solution: Use thegloo backend for CPU-only distributed training:
Or in your training config:
4. Distributed Initialization Timeout¶
Symptom:
Solutions:- Check network connectivity between nodes
- Increase timeout:
- Verify environment variables:
5. Mixed Precision Errors¶
Symptom:
Solution: Ensure consistent dtype usage. Disable mixed precision if needed:6. Out of Memory (OOM) in Distributed Training¶
Solutions:
-
Reduce batch size:
-
Enable gradient checkpointing:
-
Use CPU offloading:
7. Uneven Batch Distribution¶
Symptom: Some GPUs idle while others process data.
Solution:
Ensure even_batches=True and split_batches are configured appropriately:
from accelerate import Accelerator
accelerator = Accelerator(
even_batches=True,
split_batches=False,
)
CPU-Only Fallback¶
The codebase is designed to gracefully fall back to CPU-only mode when distributed training is unavailable.
Testing CPU Fallback¶
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "" # Hide GPUs
from codex_ml.distributed import (
init_distributed_if_needed,
get_rank,
get_world_size,
)
# Should return False and provide safe defaults
assert init_distributed_if_needed() is False
assert get_rank() == 0
assert get_world_size() == 1
Skip Distributed Tests¶
When running tests in CI or minimal environments:
pytest tests/ -k "not distributed"
# Or set environment variable
export CODEX_SKIP_DISTRIBUTED_TESTS=1
pytest tests/
Environment Variables Reference¶
| Variable | Default | Description |
|---|---|---|
CODEX_DDP |
0 |
Enable DDP mode (1=enabled) |
CODEX_DIST_BACKEND |
nccl |
Distributed backend (nccl/gloo) |
CODEX_SKIP_DISTRIBUTED_TESTS |
0 |
Skip distributed tests |
MASTER_ADDR |
localhost |
Master node address |
MASTER_PORT |
29500 |
Master node port |
RANK |
0 |
Process rank |
WORLD_SIZE |
1 |
Total number of processes |
LOCAL_RANK |
0 |
Local process rank (per node) |
Multi-GPU Training¶
Single Node, Multiple GPUs¶
# Using torchrun (recommended)
torchrun --nproc_per_node=2 -m codex_ml.cli.train \
--config configs/training/base.yaml
# Using accelerate
accelerate launch --num_processes=2 -m codex_ml.cli.train \
--config configs/training/base.yaml
Multi-Node Training¶
# On master node (rank 0)
torchrun \
--nproc_per_node=4 \
--nnodes=2 \
--node_rank=0 \
--master_addr=master.example.com \
--master_port=29500 \
-m codex_ml.cli.train --config configs/training/base.yaml
# On worker node (rank 1)
torchrun \
--nproc_per_node=4 \
--nnodes=2 \
--node_rank=1 \
--master_addr=master.example.com \
--master_port=29500 \
-m codex_ml.cli.train --config configs/training/base.yaml
Debugging Tips¶
1. Enable Verbose Logging¶
import logging
logging.basicConfig(level=logging.DEBUG)
from codex_ml.distributed import init_distributed_if_needed
init_distributed_if_needed() # Will log detailed initialization steps
2. Check Process Group Status¶
import torch.distributed as dist
if dist.is_initialized():
print(f"Rank: {dist.get_rank()}/{dist.get_world_size()}")
print(f"Backend: {dist.get_backend()}")
else:
print("Distributed not initialized")
3. Test Communication¶
import torch
import torch.distributed as dist
if dist.is_initialized():
# Test all-reduce
tensor = torch.ones(1)
dist.all_reduce(tensor)
assert tensor.item() == dist.get_world_size()
print("✓ Communication test passed")
4. Monitor GPU Usage¶
Safe Accelerate Initialization¶
The repository includes a safe initialization guard that handles CPU-only environments and missing dependencies gracefully.
Using the Accelerate Init Guard¶
from training.accelerate_init_guard import safe_accelerate_init
# Try to initialize accelerate with graceful fallback
result = safe_accelerate_init(cpu_fallback=True)
if result.success:
print(f"✓ Accelerate initialized with {result.backend}")
print(f" World size: {result.world_size}, Rank: {result.rank}")
elif result.skip_reason:
print(f"⊘ Skipped: {result.skip_reason}")
# Continue with CPU-only training
else:
print(f"✗ Error: {result.error}")
# Handle error or fall back to CPU
Diagnostic Mode¶
Run the guard in diagnostic mode to check your environment:
Output example:
============================================================
Accelerate Init Guard - Diagnostic Mode
============================================================
Distributed Environment Variables:
MASTER_ADDR: <not set>
MASTER_PORT: <not set>
WORLD_SIZE: <not set>
RANK: <not set>
LOCAL_RANK: <not set>
ACCELERATE_TEST: <not set>
CUDA_VISIBLE_DEVICES: <not set>
Availability Checks:
Accelerate available: True
GPU available: False
Initialization Test:
Result: AccelerateInitResult(skipped, reason=cpu_only)
Skip-Safe Integration Tests¶
The integration tests use pytest markers to skip on CPU-only runners:
# Run all tests (skips distributed on CPU-only)
pytest tests/integration/test_distributed_init.py
# Run with GPU and ACCELERATE_TEST flag
ACCELERATE_TEST=1 pytest tests/integration/test_distributed_init.py
# Skip integration tests entirely
pytest -m "not integration"
Environment Variable Reference for Safe Init¶
| Variable | Purpose | Example |
|---|---|---|
ACCELERATE_TEST |
Enable GPU-gated distributed tests | ACCELERATE_TEST=1 |
CUDA_VISIBLE_DEVICES |
Control GPU visibility | CUDA_VISIBLE_DEVICES=0,1 |
WORLD_SIZE |
Number of processes | WORLD_SIZE=4 |
RANK |
Process rank | RANK=0 |
MASTER_ADDR |
Master node address | MASTER_ADDR=localhost |
MASTER_PORT |
Master node port | MASTER_PORT=29500 |
Common Init Guard Results¶
CPU-Only Environment:
Accelerate Not Installed:
Successful Initialization:
Performance Optimization¶
1. Choose the Right Backend¶
- NCCL: Best for GPU-to-GPU communication
- Gloo: Best for CPU or mixed CPU/GPU
- MPI: Enterprise HPC environments
2. Tune Batch Size and Accumulation¶
training:
# Effective batch size = per_device * num_gpus * accumulation
per_device_batch_size: 4
gradient_accumulation_steps: 4
# Effective batch = 4 * 2 GPUs * 4 = 32
3. Enable Compilation (PyTorch 2.0+)¶
Related Documentation¶
Support¶
If issues persist:
- Check the GitHub Issues
- Review the test suite in
tests/distributed/andtests/training/ - Enable debug logging and share output
- Report the issue with environment details: