Metrics

Metric	Meaning
Success rate	Whether the task met its expected outcome.
Recall@K	Whether relevant items appear in the top K results.
MRR	How early the first relevant result appears.
nDCG	Whether useful results rank near the top.
Assertion recall	Whether expected factual assertions were recovered.
Token cost	Model context or generation cost used by a run.
Latency	How long retrieval, answer generation, or eval work took.

Interpret carefully

Metric improvement is evidence for a bounded claim about the suite, model, and configuration used. It is not universal proof that every future agent task will improve.

Read Reproducibility.

Metrics

Interpret carefully

Next

On this page