Chapter 1: Evaluating pass@k, and what it doesn't tell you

Problem

Every modern ML benchmark paper reports pass@1, pass@5, or pass@10. The reader is meant to understand intuitively that a higher number is better. The subtler question is what the metric is measuring, when it reflects real ability, and when it overstates it.

This chapter builds a small reproducible benchmark where we control the generators and the "solvers", and uses it to answer three questions with numbers:

What does pass@k measure, and how is it computed correctly?
Why does pass@k depend on whether samples from the solver are independent?
How much does sample correlation change the answer in practice?

The rig is under 400 lines of Python, uses only the standard library, and runs in about one second. The public copy ships with this book as run.py; it does not require access to the private genass workspace.

Eval

The pass@k estimator

Given a solver that can be sampled repeatedly on a problem and a way to check whether a sample is correct, pass@k is the probability that at least one of $k$ independent samples is correct. The naive estimator is:

\widehat{\text{pass@}k}_{\text{naive}} \;=\; \mathbb{1}\!\left[\exists\, i \in \{1, \ldots, k\} : \text{correct}(s_i)\right]

which is 0 or 1 per problem, and then averaged across problems. The problem with the naive estimator is variance: a single run of $k$ samples is a very noisy estimate of the underlying probability.

The standard fix, from the HumanEval paper (Chen et al., 2021), is to draw $n \gg k$ samples per problem, count the number correct $c$ , and use the unbiased estimator

\widehat{\text{pass@}k} \;=\; 1 \;-\; \frac{\binom{n - c}{k}}{\binom{n}{k}} \;=\; 1 \;-\; \prod_{i=0}^{k-1} \frac{n - c - i}{n - i}

when $n - c \geq k$ , and 1 otherwise. The right-hand product form is what we implement — it is numerically stable and does not overflow for the $n$ and $k$ we care about.

The math home for this estimator is Chapter 7 of Mathematical Awakening: it is a straight application of sampling without replacement and the hypergeometric distribution. There is nothing ML-specific about it.

Reference implementation

import math

def pass_at_k(n: int, c: int, k: int) -> float:
    """Unbiased estimator of pass@k from n samples, c of which are correct."""
    if n - c < k:
        return 1.0
    return 1.0 - math.prod((n - c - i) / (n - i) for i in range(k))

Three lines. Copy it into any evaluation harness and you will be reporting a better number than most ML papers that predate 2021.

Method

The benchmark

We build 120 grade-school math word problems across three difficulty tiers:

Easy (40 problems): single-operation problems, e.g.\ "Anna has 15 apples and picks 3 more. How many does she have?"
Medium (40 problems): two-operation problems, e.g.\ "Anna has 12 apples. She gives 4 to Ben, then picks 3 more. How many does she have?"
Hard (40 problems): multi-step with multiplication and division, e.g.\ "Anna picks 4 apples from each of 3 trees. She gives 2 to Ben, then splits the rest equally with 3 friends. How many apples does each person get?"

Each problem is generated deterministically from a seed and carries a canonical operator program so we can verify correctness exactly — there is no grading ambiguity.

The solvers

Four solvers exercise different points in the quality–stochasticity space.

regex_baseline — extracts all numbers from the problem, guesses the operation from a keyword ("gives", "picks"), and returns. It is mostly deterministic; small randomisation only kicks in when no numbers are found. This is the "what if we ignored the problem structure?" baseline.
parser_solver — executes the canonical operator program. On easy and medium, it is perfect. On hard it simulates a 50/50 chance of misidentifying the division step (a real failure mode when parsing grade-school word problems with regex-style tooling). That simulated 50/50 is what makes this solver stochastic on hard problems.
noisy_oracle_p70_corr0 — a synthetic LLM substitute with per-tier accuracy {easy: 0.85, medium: 0.70, hard: 0.55} and independent samples. Each draw is unconditional on prior draws. This is the "ideal iid sampler" baseline.
noisy_oracle_p70_corr50 — same accuracy as (3), but with correlation 0.5: if the first sample on a problem was wrong, subsequent samples have a 50% probability of repeating that specific wrong answer. This simulates the "stuck on a mode" failure of real LLM samplers, where the model has a wrong strong prior and temperature > 0 fails to escape it.

No actual LLMs are called. That is deliberate: the chapter is about how the metric behaves, and the synthetic oracles expose the behaviour without introducing API-key or compute confounds. A reader with access to a real LLM can swap in a one-line adapter and re-run the same harness.

Rig

A single public script, run.py. Save it anywhere and run:

python3 run.py --out results

The relevant loop:

for solver_name, solver_fn in solvers.items():
    rng = random.Random(seed + SOLVER_SEEDS[solver_name])
    per_problem = {"easy": [], "medium": [], "hard": []}
    for problem in problems:
        correct = 0
        for _ in range(n_samples):
            guess = solver_fn(problem, rng)
            if guess == problem.answer:
                correct += 1
        per_problem[problem.difficulty].append((n_samples, correct))
    # Per tier and overall: apply pass_at_k across problems,
    # then bootstrap a 95% CI.

Default configuration: 120 problems, 20 samples per problem, seed 42, 1000 bootstrap resamples. All tunable from the command line.

The rig emits two files:

results/metrics.json — full per-solver, per-tier, per-k results with bootstrap CIs.
results/headline.json — a compact summary of the numbers that land in the chapter intro and on the aresalab web page.

Numbers

All numbers below are from the rig as of the most recent run.

Overall pass@k across the benchmark

Solver	pass@1	pass@5	pass@10
`regex_baseline`	0.383	0.383	0.383
`parser_solver`	0.835	0.990	1.000
`noisy_oracle_p70_corr0` (iid)	0.702	0.993	1.000
`noisy_oracle_p70_corr50` (sticky)	0.580	0.968	0.998

Three observations fall out immediately.

pass@k = pass@1 for a deterministic solver. The regex baseline's pass@k is flat. This is the first gotcha of the metric: if you report pass@10 on a greedy-decoded model (temperature = 0), you are reporting pass@1 with ten times more compute spent. Many 2023–2024 papers did this and reported "pass@k improvements" that were just decoding-budget increases.

pass@k scales dramatically under independence. At pass@1 = 0.702, the iid oracle reaches pass@5 ≈ 0.993 and pass@10 = 1.000 on this benchmark. The classical bound $\text{pass@}k \approx 1 - (1 - p)^k$ predicts $1 - 0.3^{10} \approx 0.99994$ for $p = 0.7$ , which matches the empirical number within bootstrap noise. If the samples are genuinely independent and the base rate is decent, pass@10 saturates.

Correlation breaks the classical bound. The sticky oracle, with the same per-sample accuracy on the first draw, loses about 2.6 percentage points at pass@5 relative to the iid version, while pass@10 is already near saturation for both solvers. More strikingly, sticky's pass@1 itself drops — from 0.701 to 0.580 — because once the oracle commits to a wrong answer, subsequent draws on the same problem reinforce it, dragging the empirical per-problem accuracy down.

By difficulty tier

Where the degradation is concentrated:

Solver	Easy pass@1	Medium pass@1	Hard pass@1
`regex_baseline`	1.000	0.025	0.125
`parser_solver`	1.000	1.000	0.504
`noisy_oracle_p70_corr0` (iid)	0.855	0.700	0.549
`noisy_oracle_p70_corr50` (sticky)	0.763	0.563	0.415

The numbers clarify a separate point: the headline pass@1 is a mix of tiers. On easy problems the regex baseline is perfect; on medium it is essentially zero (it cannot handle two-operation problems). Headline pass@1 of 0.38 looks like "about a third right"; the breakdown shows the solver has a capability cliff, which is a very different thing. Any benchmark that averages across difficulty tiers without breaking them out is hiding information the reader needs.

Bootstrap confidence intervals

The full metrics.json carries 95% bootstrap CIs on every number. Representative widths from the iid oracle run, at 40 problems per tier with 1000 bootstrap resamples:

Tier	pass@1 95% CI width	pass@5 95% CI width
Easy	±2.3 pp	±0.0 pp (saturated)
Medium	±3.4 pp	±0.2 pp
Hard	±3.8 pp	±0.6 pp

A useful rule of thumb from this rig: 40 problems per tier gives roughly ±3 percentage points on pass@1 and well under ±1 percentage point on pass@5 at 95% confidence. The pass@5 band narrows dramatically because the metric is already saturated for any solver with decent per-sample accuracy, which is itself an argument for reporting pass@1 (or an unsaturated pass@k) alongside any headline pass@5.

What fails, and why

This is the section of every chapter that matters most.

pass@k ≠ test-time ability

pass@10 = 1.000 on this benchmark for the iid oracle does not mean the oracle has solved math. It means that if you are willing to spend 10× inference, the oracle gets there most of the time. That trade is useful to know about when budgeting compute for a production system, but it is not a claim about reasoning ability. The right way to read a paper that reports pass@10 is: "how much does the extra sampling help, and would I have actually deployed it this way?"

The iid assumption is almost never true

Real LLM samples at temperature > 0 are not iid. Models have mode-seeking behaviour: once a prompt tips the model into one reasoning path, subsequent samples often follow the same path with small variations. The sticky-oracle run shows that this is not a small effect — a correlation of 0.5 costs 2.6 percentage points at pass@5 in this synthetic setup. In practice, the effective sample-level correlation can be substantially higher on hard problems, and the classical $1 - (1 - p)^k$ bound routinely overstates observed pass@k.

The diagnostic the chapter recommends: compute pass@k empirically on a tier where the samples should be diverse, and compare to $1 - (1 - \text{pass@1})^k$ . If the empirical number is materially below the bound, you are seeing correlated sampling, and your scaling story is weaker than the bound suggests.

Grading is the silent confound

This rig grades by exact integer match against a known answer — the cleanest possible setting. Every other benchmark has grading noise: the string-match pass rate on HumanEval is sensitive to formatting; rouge-L scores on summarisation benchmarks disagree with human judgement at the margins; even exact-match pass rates on MATH depend on how boxed expressions are extracted. A pass@k number without a grading-noise estimate is half a number. The extension exercise at the end of this chapter is to add 2% random grading noise to this rig and see what happens to the CIs.

"Strong baselines" cost more to compute than strong models

The parser solver reaches pass@10 = 1.000 with zero LLM calls. On the easy and medium tiers, a 30-line Python parser is better than any language model anyone has shipped. The lesson is older than LLMs — for tasks with clean structure, the strongest baseline is often not a model at all — but it keeps getting forgotten. Before reporting pass@k on a new benchmark, build the parser-solver analogue. If it wins, the benchmark is not measuring what you want it to.

pass@k tells you nothing about which problems are hard

The overall pass@1 of 0.70 for the iid oracle is an average. The per-tier breakdown (0.85 / 0.71 / 0.55) tells a different story. A lower-variance benchmark would report stratified pass@k by problem category, by solution length, by required operation type, or by failure-mode taxonomy. This rig does the first of those; the next project in this book (retrieval for QA) will do the others.

Extensions

The ways to take this rig further, in increasing order of effort:

Plug in a real LLM. Swap the noisy_oracle with a one-line adapter that calls any chat-completion API. The harness, grading, and CIs all work unchanged. Expect pass@1 to be similar to the synthetic oracle on easy and medium, and substantially lower on hard — in the 0.35–0.55 range for a strong current model — with much higher sample correlation than the sticky-50% synthetic.
Add grading noise. Introduce a configurable probability of grading error and regenerate the CIs. Observe how quickly the "signal" of pass@k on a small benchmark is drowned out by grading variance.
Vary $n$ and $k$ jointly. The HumanEval estimator is unbiased for any $n \geq k$ , but the variance of the estimator is not. Plot variance of pass@k as a function of $n$ for fixed $k$ . This is a Chapter 8 estimator-theory exercise.
Correlation-aware pass@k. Propose an estimator that discounts for observed sample-level correlation on a per-problem basis. Compare to the naive iid bound. This is an open problem worth a short paper.
Swap in a retrieval layer. Many of the errors the synthetic oracle makes are formally correct answers to misread problems. Hand the solver a short reference passage (e.g.\ a reminder that "split equally with $n$ friends" means dividing by $n + 1$ ) and remeasure. That is the retrieval baseline the next chapter will build on.

Key Takeaways

pass@k is a probability estimator, not a rank statistic. The unbiased HumanEval form requires $n \gg k$ and comes straight out of sampling without replacement (Chapter 7 of Mathematical Awakening).
pass@k = pass@1 for a deterministic solver. Reporting pass@10 on a greedy-decoded run is a compute multiplier dressed up as a capability number.
The iid bound $1 - (1 - p)^k$ is an upper bound on real LLM pass@k. Sample correlation is the gap. A 0.5 stickiness costs ~2.6 percentage points at pass@5 in this rig; in practice it can be larger.
Always break pass@k out by difficulty tier. A headline pass@1 averaged across easy and hard hides capability cliffs that matter for deployment.
Bootstrap 95% CIs on every number, and report them. 40 problems per tier gives ±2 percentage points at pass@5; smaller benchmarks give ±5 or worse.
A strong non-ML baseline is the first thing to build. On structured tasks, a 30-line parser routinely beats a model. Before shipping pass@k on a benchmark, check whether the benchmark is even measuring what you want.
Reproducibility is cheap when the rig is small. The whole pipeline — generation, four solvers, estimator, bootstrap — fits in under 400 lines of Python and runs in one second. That is the bar the rest of this book holds itself to.