Evaluation Metrics Guide¶

This guide documents the built-in metrics available in the codex_ml evaluation framework and how to use them for model assessment.

Overview¶

The metrics registry (codex_ml.metrics.registry) provides a collection of deterministic, reproducible metrics for evaluating model performance. All metrics are designed to work offline without requiring external API calls.

Available Metrics¶

Token-Level Metrics¶

`accuracy@token` / `token_accuracy`¶

Computes token-level accuracy for sequence predictions.

Usage: ```python from codex_ml.metrics.registry import get_metric

metric = get_metric("accuracy@token") score = metric(predictions=[1, 2, 3], targets=[1, 2, 4], ignore_index=-100)

Returns: 0.666 (2 out of 3 tokens correct)¶

```text

Parameters: - preds: Sequence of predicted token IDs - targets: Sequence of target token IDs - ignore_index: Token ID to ignore (default: -100)

Text-Level Metrics¶

`exact_match`¶

Computes exact string match after normalization (lowercase, whitespace collapse).

Usage: ```python metric = get_metric("exact_match") score = metric( preds=["hello world", "test"], targets=["Hello World", "test"] )

Returns: 1.0 (both match after normalization)¶

```text

Parameters: - remove_punct: Whether to remove punctuation before comparison (default: False)

`f1`¶

Computes average per-example F1 score over whitespace-tokenized words (bag-of-words).

Usage: ```python metric = get_metric("f1") score = metric( preds=["the cat sat on mat"], targets=["the cat sat on the mat"] )

Returns F1 based on token overlap¶

```text

Generative Metrics¶

ℹ️ When running via the Typer evaluation CLI, provide --prediction-transform / --target-transform (or configure evaluation.prediction_transform) so that raw model outputs are decoded to text before these metrics are computed.

`bleu`¶

Computes corpus-level BLEU using a built-in, offline implementation with uniform n-gram weighting and exponential brevity penalty.

Usage: ```python metric = get_metric("bleu") score = metric( preds=["the cat sat on the mat", "hello world"], targets=["a cat sat on a mat", "hello there"] )

Returns: BLEU score (0.0-1.0)¶

```text

Notes: - No external dependencies required. - Suitable for deterministic offline evaluation.

`rouge_l`¶

Computes ROUGE-L F-measure using an internal longest-common-subsequence implementation.

Usage: ```python metric = get_metric("rouge_l") score = metric( preds=["the quick brown fox jumps"], targets=["the quick brown fox jumped"] )

Returns: ROUGE-L F-measure (0.0-1.0)¶

```text

Notes: - Offline and dependency-free. - Captures fluency-sensitive overlap via LCS.

`chrf`¶

Character-level F-score metric.

Optional dependencies: sacrebleu (preferred) or nltk

Usage: ```python metric = get_metric("chrf") score = metric(preds=["hello"], targets=["helo"])

Returns: chrF score or None¶

```text

Diversity Metrics¶

`dist-1` / `dist-2`¶

Measures lexical diversity as the ratio of unique unigrams (dist-1) or bigrams (dist-2) to total tokens.

Usage: ```python metric = get_metric("dist-1") score = metric(preds=["the cat the dog", "test test"], targets=None)

Returns: proportion of unique unigrams¶

```text

Notes: - Higher values indicate more diverse vocabulary - Useful for assessing generation quality - targets parameter is ignored (not used for diversity)

Language Model Metrics¶

`ppl` / `perplexity`¶

Computes perplexity from negative log-likelihood values.

Usage: ```python metric = get_metric("ppl")

From sequence of NLL values¶

score = metric([2.3, 1.8, 2.1]) # Returns: exp(mean(nll))

From sum and count¶

score = metric(nll_sum=100.0, n_tokens=50) # Returns: exp(100/50) ```text

Offline Metrics¶

`offline:weighted-accuracy`¶

Weighted accuracy that loads class weights from a local JSON fixture.

Configuration: Set CODEX_ML_WEIGHTED_ACCURACY_PATH or CODEX_ML_OFFLINE_METRICS_DIR environment variable to point to a JSON file with class weights:

json { "class_a": 1.0, "class_b": 2.0, "class_c": 0.5 }text

Usage: python metric = get_metric("offline:weighted-accuracy") score = metric( preds=["class_a", "class_b"], targets=["class_a", "class_c"], weights_path="/path/to/weights.json" )text

Using Metrics in Evaluation¶

With `run_evaluation`¶

```python from codex_ml.config import EvaluationConfig from codex_ml.eval.runner import run_evaluation

cfg = EvaluationConfig( dataset_path="data/eval.jsonl", dataset_format="jsonl", metrics=["exact_match", "f1", "bleu", "rougeL"], output_dir="results/eval_001", seed=42, prediction_field="prediction", target_field="target", text_field="text", )

results = run_evaluation(cfg) print(results["metrics"])

Output:¶

```text

Listing Available Metrics¶

```python from codex_ml.metrics.registry import list_metrics

available = list_metrics() print(available)

['accuracy@token', 'token_accuracy', 'ppl', 'exact_match', 'f1', 'bleu', 'rougeL', ...]¶

```text

Reproducibility¶

All built-in metrics are deterministic and produce identical results given the same inputs. Key features:

Deterministic text normalization: Consistent lowercase, whitespace handling
Fixed random seeds: Not applicable (metrics are deterministic functions)
No external API calls: All computations are local
Offline-first: Optional dependencies gracefully degrade to None returns

Custom Metrics¶

Registering a Custom Metric¶

```python from codex_ml.metrics.registry import register_metric

@register_metric("custom_score") def my_custom_metric(preds, targets): """Compute custom metric.""" # Your implementation return score ```text

Plugin-Based Metrics¶

For distributable custom metrics, use entry points in your pyproject.toml:

toml [project.entry-points."codex_ml.metrics"] my_metric = "my_package.metrics:my_metric_function"text

The metric will be automatically discovered and registered.

Testing Metrics¶

When adding new metrics, include tests that verify:

Correctness: Known inputs produce expected outputs
Edge cases: Empty inputs, None values, identical pred/target
Graceful degradation: Missing optional dependencies return None
Determinism: Multiple calls with same inputs return same result

Example test:

```python def test_bleu_metric_correctness(): from codex_ml.metrics.registry import get_metric

metric = get_metric("bleu")

# Perfect match
score = metric(preds=["hello"], targets=["hello"])
assert score == 1.0 or score is None  # None if nltk missing

# No match
score = metric(preds=["hello"], targets=["goodbye"])
assert score is not None and score < 0.3 or score is None

```text

Troubleshooting¶

Metric returns None¶

Some metrics require optional dependencies. Install them:

bash pip install nltk rouge_score sacrebleutext

ImportError for metric dependencies¶

The registry gracefully handles missing dependencies. Check which dependencies are needed:

bleu: requires nltk
rougeL: requires rouge_score
chrf: requires sacrebleu or nltk

Different results across runs¶

Ensure all metrics are called with the same: - Input normalization settings - Text preprocessing - Seed values (though built-in metrics are deterministic)

Evaluation Metrics Guide¶

Overview¶

Available Metrics¶

Token-Level Metrics¶

accuracy@token / token_accuracy¶

Returns: 0.666 (2 out of 3 tokens correct)¶

Text-Level Metrics¶

exact_match¶

Returns: 1.0 (both match after normalization)¶

f1¶

Returns F1 based on token overlap¶

Generative Metrics¶

bleu¶

Returns: BLEU score (0.0-1.0)¶

rouge_l¶

Returns: ROUGE-L F-measure (0.0-1.0)¶

chrf¶

Returns: chrF score or None¶

Diversity Metrics¶

dist-1 / dist-2¶

Returns: proportion of unique unigrams¶

Language Model Metrics¶

ppl / perplexity¶

From sequence of NLL values¶

From sum and count¶

Offline Metrics¶

offline:weighted-accuracy¶

Using Metrics in Evaluation¶

With run_evaluation¶

Output:¶

Listing Available Metrics¶

['accuracy@token', 'token_accuracy', 'ppl', 'exact_match', 'f1', 'bleu', 'rougeL', ...]¶

Reproducibility¶

Custom Metrics¶

Registering a Custom Metric¶

Plugin-Based Metrics¶

Testing Metrics¶

Troubleshooting¶

Metric returns None¶

ImportError for metric dependencies¶

Different results across runs¶

See Also¶

`accuracy@token` / `token_accuracy`¶

`exact_match`¶

`f1`¶

`bleu`¶

`rouge_l`¶

`chrf`¶

`dist-1` / `dist-2`¶

`ppl` / `perplexity`¶

`offline:weighted-accuracy`¶

With `run_evaluation`¶