Skip to content

Evaluation Metrics Guide

This guide documents the built-in metrics available in the codex_ml evaluation framework and how to use them for model assessment.

Overview

The metrics registry (codex_ml.metrics.registry) provides a collection of deterministic, reproducible metrics for evaluating model performance. All metrics are designed to work offline without requiring external API calls.

Available Metrics

Token-Level Metrics

accuracy@token / token_accuracy

Computes token-level accuracy for sequence predictions.

Usage: ```python from codex_ml.metrics.registry import get_metric

metric = get_metric("accuracy@token") score = metric(predictions=[1, 2, 3], targets=[1, 2, 4], ignore_index=-100)

Returns: 0.666 (2 out of 3 tokens correct)

```text

Parameters: - preds: Sequence of predicted token IDs - targets: Sequence of target token IDs - ignore_index: Token ID to ignore (default: -100)

Text-Level Metrics

exact_match

Computes exact string match after normalization (lowercase, whitespace collapse).

Usage: ```python metric = get_metric("exact_match") score = metric( preds=["hello world", "test"], targets=["Hello World", "test"] )

Returns: 1.0 (both match after normalization)

```text

Parameters: - remove_punct: Whether to remove punctuation before comparison (default: False)

f1

Computes average per-example F1 score over whitespace-tokenized words (bag-of-words).

Usage: ```python metric = get_metric("f1") score = metric( preds=["the cat sat on mat"], targets=["the cat sat on the mat"] )

Returns F1 based on token overlap

```text

Generative Metrics

â„šī¸ When running via the Typer evaluation CLI, provide --prediction-transform / --target-transform (or configure evaluation.prediction_transform) so that raw model outputs are decoded to text before these metrics are computed.

bleu

Computes corpus-level BLEU using a built-in, offline implementation with uniform n-gram weighting and exponential brevity penalty.

Usage: ```python metric = get_metric("bleu") score = metric( preds=["the cat sat on the mat", "hello world"], targets=["a cat sat on a mat", "hello there"] )

Returns: BLEU score (0.0-1.0)

```text

Notes: - No external dependencies required. - Suitable for deterministic offline evaluation.

rouge_l

Computes ROUGE-L F-measure using an internal longest-common-subsequence implementation.

Usage: ```python metric = get_metric("rouge_l") score = metric( preds=["the quick brown fox jumps"], targets=["the quick brown fox jumped"] )

Returns: ROUGE-L F-measure (0.0-1.0)

```text

Notes: - Offline and dependency-free. - Captures fluency-sensitive overlap via LCS.

chrf

Character-level F-score metric.

Optional dependencies: sacrebleu (preferred) or nltk

Usage: ```python metric = get_metric("chrf") score = metric(preds=["hello"], targets=["helo"])

Returns: chrF score or None

```text

Diversity Metrics

dist-1 / dist-2

Measures lexical diversity as the ratio of unique unigrams (dist-1) or bigrams (dist-2) to total tokens.

Usage: ```python metric = get_metric("dist-1") score = metric(preds=["the cat the dog", "test test"], targets=None)

Returns: proportion of unique unigrams

```text

Notes: - Higher values indicate more diverse vocabulary - Useful for assessing generation quality - targets parameter is ignored (not used for diversity)

Language Model Metrics

ppl / perplexity

Computes perplexity from negative log-likelihood values.

Usage: ```python metric = get_metric("ppl")

From sequence of NLL values

score = metric([2.3, 1.8, 2.1]) # Returns: exp(mean(nll))

From sum and count

score = metric(nll_sum=100.0, n_tokens=50) # Returns: exp(100/50) ```text

Offline Metrics

offline:weighted-accuracy

Weighted accuracy that loads class weights from a local JSON fixture.

Configuration: Set CODEX_ML_WEIGHTED_ACCURACY_PATH or CODEX_ML_OFFLINE_METRICS_DIR environment variable to point to a JSON file with class weights:

json { "class_a": 1.0, "class_b": 2.0, "class_c": 0.5 }text

Usage: python metric = get_metric("offline:weighted-accuracy") score = metric( preds=["class_a", "class_b"], targets=["class_a", "class_c"], weights_path="/path/to/weights.json" )text

Using Metrics in Evaluation

With run_evaluation

```python from codex_ml.config import EvaluationConfig from codex_ml.eval.runner import run_evaluation

cfg = EvaluationConfig( dataset_path="data/eval.jsonl", dataset_format="jsonl", metrics=["exact_match", "f1", "bleu", "rougeL"], output_dir="results/eval_001", seed=42, prediction_field="prediction", target_field="target", text_field="text", )

results = run_evaluation(cfg) print(results["metrics"])

Output:

```text

Listing Available Metrics

```python from codex_ml.metrics.registry import list_metrics

available = list_metrics() print(available)

['accuracy@token', 'token_accuracy', 'ppl', 'exact_match', 'f1', 'bleu', 'rougeL', ...]

```text

Reproducibility

All built-in metrics are deterministic and produce identical results given the same inputs. Key features:

  1. Deterministic text normalization: Consistent lowercase, whitespace handling
  2. Fixed random seeds: Not applicable (metrics are deterministic functions)
  3. No external API calls: All computations are local
  4. Offline-first: Optional dependencies gracefully degrade to None returns

Custom Metrics

Registering a Custom Metric

```python from codex_ml.metrics.registry import register_metric

@register_metric("custom_score") def my_custom_metric(preds, targets): """Compute custom metric.""" # Your implementation return score ```text

Plugin-Based Metrics

For distributable custom metrics, use entry points in your pyproject.toml:

toml [project.entry-points."codex_ml.metrics"] my_metric = "my_package.metrics:my_metric_function"text

The metric will be automatically discovered and registered.

Testing Metrics

When adding new metrics, include tests that verify:

  1. Correctness: Known inputs produce expected outputs
  2. Edge cases: Empty inputs, None values, identical pred/target
  3. Graceful degradation: Missing optional dependencies return None
  4. Determinism: Multiple calls with same inputs return same result

Example test:

```python def test_bleu_metric_correctness(): from codex_ml.metrics.registry import get_metric

metric = get_metric("bleu")

# Perfect match
score = metric(preds=["hello"], targets=["hello"])
assert score == 1.0 or score is None  # None if nltk missing

# No match
score = metric(preds=["hello"], targets=["goodbye"])
assert score is not None and score < 0.3 or score is None

```text

Troubleshooting

Metric returns None

Some metrics require optional dependencies. Install them:

bash pip install nltk rouge_score sacrebleutext

ImportError for metric dependencies

The registry gracefully handles missing dependencies. Check which dependencies are needed:

  • bleu: requires nltk
  • rougeL: requires rouge_score
  • chrf: requires sacrebleu or nltk

Different results across runs

Ensure all metrics are called with the same: - Input normalization settings - Text preprocessing - Seed values (though built-in metrics are deterministic)

See Also