Distributed Training Troubleshooting¶

Guide for diagnosing and resolving distributed training initialization issues

Overview¶

This guide helps troubleshoot common issues when using distributed training features, including PyTorch DDP and Hugging Face Accelerate.

Quick Diagnostics¶

Check Distributed Availability¶

from codex_ml.distributed import is_distributed_available

if is_distributed_available():
    print("✓ Distributed training is available")
else:
    print("✗ Distributed training not available (CPU-only mode)")

Check Accelerate Installation¶

python -c "import accelerate; print(f'Accelerate {accelerate.__version__}')"

Check CUDA Availability¶

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA devices: {torch.cuda.device_count()}")

Common Issues¶

1. ImportError: No module named 'accelerate'¶

Symptom:

ImportError: No module named 'accelerate'

Solution: Install accelerate with the appropriate extras:

# For CPU-only
pip install accelerate

# For GPU with CUDA
pip install "accelerate>=0.20"

# Or install codex with training extras
pip install -e ".[train]"

2. Accelerate Version Compatibility¶

Symptom:

TypeError: Accelerator.__init__() got an unexpected keyword argument 'dataloader_config'

Cause: Mixing accelerate API versions (pre-0.30 vs 0.30+).

Solution: The codebase includes compatibility shims. Ensure you're using a supported version:

pip install "accelerate>=0.20,<1.0"

3. NCCL Backend Errors on CPU¶

Symptom:

RuntimeError: NCCL is not available on CPU-only builds

Solution: Use the gloo backend for CPU-only distributed training:

export CODEX_DIST_BACKEND=gloo

Or in your training config:

training:
  distributed:
    backend: gloo

4. Distributed Initialization Timeout¶

Symptom:

RuntimeError: Timed out initializing process group

Solutions:

Check network connectivity between nodes
Increase timeout:
```
export NCCL_TIMEOUT=1800  # 30 minutes
```

Verify environment variables:

echo $MASTER_ADDR
echo $MASTER_PORT
echo $RANK
echo $WORLD_SIZE

5. Mixed Precision Errors¶

Symptom:

RuntimeError: expected scalar type Float but found Half

Solution: Ensure consistent dtype usage. Disable mixed precision if needed:

from accelerate import Accelerator

accelerator = Accelerator(mixed_precision="no")

6. Out of Memory (OOM) in Distributed Training¶

Solutions:

Reduce batch size:

training:
  per_device_batch_size: 1
  gradient_accumulation_steps: 8

Enable gradient checkpointing:

training:
  gradient_checkpointing: true

Use CPU offloading:

accelerator = Accelerator(
    cpu_offload=True,
    device_placement=True,
)

7. Uneven Batch Distribution¶

Symptom: Some GPUs idle while others process data.

Solution: Ensure even_batches=True and split_batches are configured appropriately:

from accelerate import Accelerator

accelerator = Accelerator(
    even_batches=True,
    split_batches=False,
)

CPU-Only Fallback¶

The codebase is designed to gracefully fall back to CPU-only mode when distributed training is unavailable.

Testing CPU Fallback¶

import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""  # Hide GPUs

from codex_ml.distributed import (
    init_distributed_if_needed,
    get_rank,
    get_world_size,
)

# Should return False and provide safe defaults
assert init_distributed_if_needed() is False
assert get_rank() == 0
assert get_world_size() == 1

Skip Distributed Tests¶

When running tests in CI or minimal environments:

pytest tests/ -k "not distributed"

# Or set environment variable
export CODEX_SKIP_DISTRIBUTED_TESTS=1
pytest tests/

Environment Variables Reference¶

Variable	Default	Description
`CODEX_DDP`	`0`	Enable DDP mode (1=enabled)
`CODEX_DIST_BACKEND`	`nccl`	Distributed backend (nccl/gloo)
`CODEX_SKIP_DISTRIBUTED_TESTS`	`0`	Skip distributed tests
`MASTER_ADDR`	`localhost`	Master node address
`MASTER_PORT`	`29500`	Master node port
`RANK`	`0`	Process rank
`WORLD_SIZE`	`1`	Total number of processes
`LOCAL_RANK`	`0`	Local process rank (per node)

Multi-GPU Training¶

Single Node, Multiple GPUs¶

# Using torchrun (recommended)
torchrun --nproc_per_node=2 -m codex_ml.cli.train \
    --config configs/training/base.yaml

# Using accelerate
accelerate launch --num_processes=2 -m codex_ml.cli.train \
    --config configs/training/base.yaml

Multi-Node Training¶

# On master node (rank 0)
torchrun \
    --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr=master.example.com \
    --master_port=29500 \
    -m codex_ml.cli.train --config configs/training/base.yaml

# On worker node (rank 1)
torchrun \
    --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=1 \
    --master_addr=master.example.com \
    --master_port=29500 \
    -m codex_ml.cli.train --config configs/training/base.yaml

Debugging Tips¶

1. Enable Verbose Logging¶

import logging
logging.basicConfig(level=logging.DEBUG)

from codex_ml.distributed import init_distributed_if_needed
init_distributed_if_needed()  # Will log detailed initialization steps

2. Check Process Group Status¶

import torch.distributed as dist

if dist.is_initialized():
    print(f"Rank: {dist.get_rank()}/{dist.get_world_size()}")
    print(f"Backend: {dist.get_backend()}")
else:
    print("Distributed not initialized")

3. Test Communication¶

import torch
import torch.distributed as dist

if dist.is_initialized():
    # Test all-reduce
    tensor = torch.ones(1)
    dist.all_reduce(tensor)
    assert tensor.item() == dist.get_world_size()
    print("✓ Communication test passed")

4. Monitor GPU Usage¶

# In a separate terminal
watch -n 1 nvidia-smi

# Or use nvtop for better visualization
nvtop

Safe Accelerate Initialization¶

The repository includes a safe initialization guard that handles CPU-only environments and missing dependencies gracefully.

Using the Accelerate Init Guard¶

from training.accelerate_init_guard import safe_accelerate_init

# Try to initialize accelerate with graceful fallback
result = safe_accelerate_init(cpu_fallback=True)

if result.success:
    print(f"✓ Accelerate initialized with {result.backend}")
    print(f"  World size: {result.world_size}, Rank: {result.rank}")
elif result.skip_reason:
    print(f"⊘ Skipped: {result.skip_reason}")
    # Continue with CPU-only training
else:
    print(f"✗ Error: {result.error}")
    # Handle error or fall back to CPU

Diagnostic Mode¶

Run the guard in diagnostic mode to check your environment:

python training/accelerate_init_guard.py

Output example:

============================================================
Accelerate Init Guard - Diagnostic Mode
============================================================

Distributed Environment Variables:
  MASTER_ADDR: <not set>
  MASTER_PORT: <not set>
  WORLD_SIZE: <not set>
  RANK: <not set>
  LOCAL_RANK: <not set>
  ACCELERATE_TEST: <not set>
  CUDA_VISIBLE_DEVICES: <not set>

Availability Checks:
  Accelerate available: True
  GPU available: False

Initialization Test:
  Result: AccelerateInitResult(skipped, reason=cpu_only)

Skip-Safe Integration Tests¶

The integration tests use pytest markers to skip on CPU-only runners:

# Run all tests (skips distributed on CPU-only)
pytest tests/integration/test_distributed_init.py

# Run with GPU and ACCELERATE_TEST flag
ACCELERATE_TEST=1 pytest tests/integration/test_distributed_init.py

# Skip integration tests entirely
pytest -m "not integration"

Environment Variable Reference for Safe Init¶

Variable	Purpose	Example
`ACCELERATE_TEST`	Enable GPU-gated distributed tests	`ACCELERATE_TEST=1`
`CUDA_VISIBLE_DEVICES`	Control GPU visibility	`CUDA_VISIBLE_DEVICES=0,1`
`WORLD_SIZE`	Number of processes	`WORLD_SIZE=4`
`RANK`	Process rank	`RANK=0`
`MASTER_ADDR`	Master node address	`MASTER_ADDR=localhost`
`MASTER_PORT`	Master node port	`MASTER_PORT=29500`

Common Init Guard Results¶

CPU-Only Environment:

AccelerateInitResult(
    success=False,
    skip_reason='cpu_only',
    gpu_available=False,
    ...
)

Accelerate Not Installed:

AccelerateInitResult(
    success=False,
    skip_reason='no_accelerate',
    accelerate_available=False,
    ...
)

Successful Initialization:

AccelerateInitResult(
    success=True,
    backend='nccl',
    world_size=2,
    rank=0,
    gpu_available=True,
    ...
)

Performance Optimization¶

1. Choose the Right Backend¶

NCCL: Best for GPU-to-GPU communication
Gloo: Best for CPU or mixed CPU/GPU
MPI: Enterprise HPC environments

2. Tune Batch Size and Accumulation¶

training:
  # Effective batch size = per_device * num_gpus * accumulation
  per_device_batch_size: 4
  gradient_accumulation_steps: 4
  # Effective batch = 4 * 2 GPUs * 4 = 32

3. Enable Compilation (PyTorch 2.0+)¶

import torch

# Compile model for faster execution
model = torch.compile(model)

Support¶

If issues persist:

Check the GitHub Issues
Review the test suite in tests/distributed/ and tests/training/
Enable debug logging and share output
Report the issue with environment details:

python -c "import torch; print(f'PyTorch: {torch.__version__}')"
python -c "import accelerate; print(f'Accelerate: {accelerate.__version__}')"
nvidia-smi  # If using GPU

Distributed Training Troubleshooting¶

Overview¶

Quick Diagnostics¶

Check Distributed Availability¶

Check Accelerate Installation¶

Check CUDA Availability¶

Common Issues¶

1. ImportError: No module named 'accelerate'¶

2. Accelerate Version Compatibility¶

3. NCCL Backend Errors on CPU¶

4. Distributed Initialization Timeout¶

5. Mixed Precision Errors¶

6. Out of Memory (OOM) in Distributed Training¶

7. Uneven Batch Distribution¶

CPU-Only Fallback¶

Testing CPU Fallback¶

Skip Distributed Tests¶

Environment Variables Reference¶

Multi-GPU Training¶

Single Node, Multiple GPUs¶

Multi-Node Training¶

Debugging Tips¶

1. Enable Verbose Logging¶

2. Check Process Group Status¶

3. Test Communication¶

4. Monitor GPU Usage¶

Safe Accelerate Initialization¶

Using the Accelerate Init Guard¶

Diagnostic Mode¶

Skip-Safe Integration Tests¶

Environment Variable Reference for Safe Init¶

Common Init Guard Results¶

Performance Optimization¶

1. Choose the Right Backend¶

2. Tune Batch Size and Accumulation¶

3. Enable Compilation (PyTorch 2.0+)¶

Related Documentation¶

Support¶