Chapter 10: Reading a Modern ML Paper — DeepSeek-R1 and the Return of RL

Why this chapter is here

The first nine chapters built a toolkit — calculus, linear algebra, probability, statistics, and the algorithmic ideas that build on them — and walked through the math of the training recipes you see in ML today. This chapter does one thing: it takes a single paper published in 2025 and reads it end to end, tagging every significant equation back to a section of this book.

The paper is DeepSeek-R1 (DeepSeek, January 2025). It was chosen for three reasons.

It uses almost every idea from Chapter 9. The training recipe is group-relative PPO (GRPO), the alignment step is DPO-flavoured, the distillation step is cross-entropy SFT, the attention in the base model is multi-head with RoPE and grouped-query caching, and the overall compute allocation sits on the Chinchilla scaling law.
It is self-contained and honest about what it does. The paper reports a lot of negative and partial results alongside the headline numbers, which makes it a good read for learning how to read papers.
It was the paper that made strong reasoning-model training affordable for labs outside the largest three or four, so the recipe has been copied, stress-tested, and adapted widely in the year since. Whatever you read next on reasoning, alignment, or reward modelling will almost certainly cite this paper's recipe as a baseline.

The goal of this chapter is not to teach you DeepSeek-R1 specifically. It is to demonstrate that, with the math from the first nine chapters, you can read a current frontier paper and follow the equations without guessing. That is the capability worth having.

How to read this chapter

Each section names a claim in the paper, reproduces the relevant equation, and maps it back to the chapter where the math lives. You do not need to have read the paper to follow along. If you have not, read this chapter first; it will make the paper much cheaper to read afterwards.

10.1 The one-paragraph summary

DeepSeek-R1 demonstrates that a base language model (DeepSeek-V3-Base, 671B parameters, MoE) can develop strong chain-of-thought reasoning primarily through reinforcement learning, with comparatively little supervised fine-tuning. The paper introduces two artifacts:

R1-Zero: the base model trained with pure RL (no SFT at all), using rule-based rewards for correctness and output format. It reaches frontier performance on maths and coding benchmarks but has readability problems — it mixes languages and formats erratically.
R1: the same base model trained with a multi-stage recipe that mixes a small amount of cold-start SFT, reasoning-focused RL (as in R1-Zero), rejection sampling on the RL model's own outputs, and a final RL stage that includes preference data. The result reads cleanly and reaches comparable benchmark scores.

The recipe also shows that the reasoning capability distils cleanly into smaller open models via supervised fine-tuning on R1's generated traces. This is the part of the paper that spread through the field in the weeks after release.

Two claims in the abstract are worth holding on to. First: reasoning behaviour emerges without being explicitly trained for — longer chains of thought, self-verification, backtracking. Second: the training recipe is modest by frontier-lab standards — no custom reward model, no PPO critic, no heavy infrastructure. Both claims are consequences of the math in Chapter 9, not accidents.

10.2 The R1-Zero training loop

The paper's headline loss is GRPO. For each prompt $x$ , sample a group of $G$ completions $\{y_1, \ldots, y_G\}$ from the current policy $\pi_\theta$ , receive rule-based rewards $r_i$ , and compute the advantage by z-scoring within the group:

\hat A_i \;=\; \frac{r_i - \operatorname{mean}(r_1, \ldots, r_G)}{\operatorname{std}(r_1, \ldots, r_G)}.

Then update the policy with the standard PPO clipped objective plus a KL penalty back to a frozen reference policy $\pi_{\text{ref}}$ :

\mathcal{L}_{\text{GRPO}}(\theta) \;=\; \mathbb{E}\!\left[\, \min\!\big(\rho_i(\theta)\, \hat A_i,\; \operatorname{clip}(\rho_i(\theta),\, 1-\epsilon,\, 1+\epsilon)\, \hat A_i\big) \;-\; \beta \operatorname{KL}\!\big(\pi_\theta \,\|\, \pi_{\text{ref}}\big) \,\right]

where $\rho_i(\theta) = \pi_\theta(y_i \mid x) / \pi_{\theta_{\text{old}}}(y_i \mid x)$ is the probability ratio between the current and the rollout-time policy.

Take this equation apart by what each piece is doing:

The clip. This is the standard PPO trust-region surrogate from Section 9.1. It forbids any single update from moving $\pi_\theta$ too far from $\pi_{\theta_{\text{old}}}$ on the samples you just collected. Without the clip, a single lucky or unlucky rollout can wreck the policy. Math home: Chapter 2–4 (optimisation), Chapter 9.1 (PPO).
The advantage $\hat A_i$ . Purely statistical. It is the within-group z-score of the reward. This is a control variate: subtracting the group mean removes the largest component of the variance of the gradient estimator, while dividing by the standard deviation rescales to unit variance. Math home: Chapter 8 (variance reduction, regression fundamentals).
The KL penalty. A constraint that keeps the policy from drifting too far from the base model over many updates. KL is just the expected log-ratio, which is Chapter 7 material. Math home: Chapter 7 (information-theoretic divergences).
The expectation $\mathbb{E}$ . Over prompts and sampled completions. In practice a minibatch average, with all the caveats Chapter 8 spends time on (bias, variance, finite-sample noise).

There is no learned value function here. That is the first noteworthy choice the paper makes. Standard PPO uses a critic $V_\phi(s)$ to estimate the advantage; GRPO replaces it with the within-group statistic. This is a direct cost reduction — a 671B-parameter critic would double training memory — but it only works because the group-relative advantage is a good enough estimator when rewards are crisp. If rewards were noisy and continuous (a learned reward model, for example), the value-function critic would be worth the cost. Chapter 8's framing of "when to trust the mean and variance of your sample" is the right lens for this decision.

The reward function

The R1-Zero reward has two components:

r(y) \;=\; r_{\text{correct}}(y) \;+\; r_{\text{format}}(y).

Both are rule-based, which is the second noteworthy choice. For maths problems, $r_{\text{correct}}$ is 1 if the final answer matches the ground truth and 0 otherwise — a simple regex or symbolic match. For coding, it is whether the generated program passes the unit tests. The format reward $r_{\text{format}}$ checks that the output contains the expected <think>...</think> tags, giving the model a stable structural scaffold to reason inside.

From a statistical standpoint, rule-based rewards are a gift. They are unbiased by construction (the ground truth is the ground truth), low-variance (the same answer always gets the same reward), and cheap (a regex, not a forward pass through a 10B reward model). That combination — unbiased, low-variance, cheap — is what makes GRPO work without a critic. If you tried the same trick with a noisy learned reward model, the group-relative z-score would not carry enough signal, and you would need a critic to recover stability. The design space of "when do I need a value function" maps directly onto Chapter 8's design space of "when is the sample mean a good estimator".

10.3 Why reasoning emerges, mechanistically

One of the paper's most-quoted claims is that reasoning behaviour emerges from GRPO on the rule-based reward: longer chains of thought, self-verification steps ("wait, let me double-check..."), backtracking when a dead-end is reached. The paper calls one of these moments an "aha moment" and displays it as a milestone during training.

You do not need a theory of mind to explain this. The math of the policy gradient explains it.

Consider the advantage $\hat A_i$ . Within a group of $G$ rollouts of the same prompt, completions that reach the correct final answer get advantage $> 0$ ; those that do not, get advantage $< 0$ . The policy gradient step pushes up the log-probability of tokens in the high-advantage rollouts and pushes down the log-probability of tokens in the low-advantage ones. If longer chains of thought correlate with correctness on hard problems — which they do, empirically — then tokens that produce longer chains are disproportionately present in the high-advantage rollouts, and their probability goes up. Over enough gradient steps, the policy learns to produce longer chains on harder prompts.

That is the mechanism. Chapter 9.1 showed this in the abstract — policy gradient as $\nabla \log \pi \cdot \hat A$ — and here is the concrete consequence. "Emergent" in this paper means "the reward did not specify it and the policy found it anyway", not "the model discovered a new cognitive faculty". Reading emergence claims this way is a useful discipline.

The corollary is also useful: rewards do not care about process. The R1-Zero reward only checks final answers and format. If a shorter, uglier chain of thought produced the same correct answer, it would get the same reward. The reason R1-Zero produces chains of reasoning at all is that short chains fail more often on hard problems, so the gradient steers away from them. If the reward explicitly rewarded short chains (faster inference, lower cost), you would get a different equilibrium — one the paper briefly explores in its ablations.

10.4 The R1 recipe: cold start, rejection sampling, final RL

R1-Zero's language-mixing and format fragility are real problems, so the paper also introduces R1, which uses a multi-stage recipe:

Cold-start SFT. A small amount (on the order of thousands of examples) of high-quality, human-curated reasoning traces are used to SFT the base model. This is plain maximum-likelihood estimation (Chapter 8): minimise cross-entropy between the model's distribution and the curated distribution over tokens. The cold-start data primes the model for the format it will be rewarded on during RL.
Reasoning-focused GRPO. The same loss as R1-Zero, with the SFT-primed model as the starting point. Everything from 10.2 applies.
Rejection sampling to build a larger SFT corpus. Once the RL model is strong, generate many completions per prompt for a large pool of prompts, keep only those that are correct and well-formatted, and use those as a new SFT dataset. Train a fresh copy of the base model (or the intermediate checkpoint) on this data.

Rejection sampling is importance sampling with an indicator weight: draw $y \sim \pi_{\text{RL}}(\cdot \mid x)$ , keep $y$ iff $r(y) = 1$ . The resulting distribution is $\pi_{\text{RL}}(y \mid x) \cdot \mathbb{1}[r(y) = 1] / Z$ , which is a reward-weighted posterior. That posterior is the target you would like to SFT onto, and you can approximate samples from it for free by just throwing away the failures. Math home: Chapter 7 (Bayes' rule, importance sampling), Chapter 8 (MLE).
Final RL with preference data. A last round of training that also incorporates DPO-style preference pairs for helpfulness and harmlessness. This is Section 9.2 of this book, in production.

The interesting fact about this recipe is that each stage is doing a different math job. Stage 1 (cold start) is SFT = MLE. Stage 2 (reasoning RL) is policy gradient with a crisp reward. Stage 3 (rejection-sample SFT) is sampling from the reward-weighted posterior via importance sampling. Stage 4 is DPO. A reader armed with Chapter 8 can look at each stage and understand why it is there: not because of some opaque design intuition, but because each stage is the right statistical tool for what that stage is trying to do.

10.5 Distillation: getting reasoning into smaller models

The paper's most broadly cited contribution is that the reasoning behaviour distils into smaller models cleanly. The recipe is a one-liner:

Take R1 as a teacher. Run it on a large pool of prompts, collecting its reasoning traces.
Take a smaller open model (7B, 14B, 32B dense; or a smaller MoE).
SFT the smaller model on the teacher's traces with cross-entropy loss.

That is it. No further RL, no reward model. The loss is

\mathcal{L}_{\text{distill}}(\theta_{\text{student}}) \;=\; -\, \mathbb{E}_{y \sim \pi_{\text{teacher}}(\cdot \mid x)}\!\left[\, \log \pi_{\text{student}}(y \mid x) \,\right].

This is standard cross-entropy on the teacher's samples. It is not knowledge distillation in the KL-on-soft-logits sense — it is behavioural cloning. The reason it works is that the teacher has already discovered good reasoning chains via RL, and the student only needs to imitate them; the student does not need to discover them from scratch with its smaller compute budget. The variance cost of RL is paid once, by the teacher. Math home: Chapter 8 (MLE, cross-entropy loss).

The practical consequence of this step is what made R1 important for the field. A 32B dense model trained on R1's traces beats considerably larger models on reasoning benchmarks. The reasoning capability was genuinely transferable, and open releases of the distilled models let every lab and startup use that capability without owning a 671B-parameter training run.

10.6 Four questions to ask when reading any RL paper

The sections above showed how the math in this book lets you read DeepSeek-R1. That specific skill generalises. When you open any RL paper published from 2024 onwards, the questions below will get you 90% of the way to understanding it.

What is the reward? Rule-based, learned, preference-derived, model-based? A learned reward model is usually a large hidden cost and the source of most instabilities. Rule-based rewards are a tell that the authors had a clean task where correctness was cheap to check.
What is the variance-reduction strategy? Learned value function (classical PPO), group-relative statistics (GRPO), or something else? The answer almost always correlates with how noisy the rewards are.
What is the exploration mechanism? Entropy bonus, KL penalty to a reference, temperature scheduling, sampling strategy. The absence of an explicit exploration bonus usually means the task is one where the base model already has reasonable coverage.
How does the reference policy evolve? Frozen for the whole run, refreshed periodically, or updated every step? This controls how far the trained policy can drift from the base.

These four questions cover PPO, GRPO, DPO, RLAIF, and the various reasoning-focused variants that appeared in 2025. They will not tell you if a paper is right, but they will tell you what kind of paper it is and what to expect from its experiments.

10.7 What the paper does not claim

A good habit when reading ML papers is to notice what the authors are not claiming. DeepSeek-R1 is a useful example because the abstract is careful in a way that its social-media reception was not.

The paper does not claim that pure RL from a base model beats all alternatives — the R1 recipe exists precisely because R1-Zero has real problems that SFT cold-start fixes. It does not claim that rule-based rewards replace learned reward models in general — it works because the chosen tasks (maths, code) have cheap correctness oracles. It does not claim that the distilled models equal R1 — it claims they inherit most of the reasoning behaviour at a fraction of the inference cost.

Reading a paper well means separating the headline chart from the ablations and the caveats. The math chapters in this book are a tool for that: if the paper's claim depends on an unusual variance-reduction trick or an unusual reward design, you will spot it, and you will know what question to ask next.

10.8 What to read after this

This book stops here, but the field does not. The directions that are worth your time — and that this book has left you equipped to read — branch out from the ideas of the last two chapters.

If the RL sections interested you. Read the original PPO paper (Schulman et al., 2017), then the RLAIF / Constitutional AI lineage, then the GRPO paper (Shao et al., 2024). Then read the R1 paper itself with the notes from this chapter in hand. After that, any new reasoning-model paper will read as a variation on themes you already know.
If attention and transformers interested you. Read the original Attention is All You Need, then the FlashAttention series (Dao et al., 2022, 2023), then the RoPE paper (Su et al., 2021), then one of the long-context papers (YaRN, LongRoPE, or whichever is current when you read this). The architectural knobs are small in number; most papers tune one and report.
If generative modelling and diffusion interested you. Read DDPM (Ho et al., 2020), then the score-matching line (Song et al., 2020), then the flow-matching paper (Lipman et al., 2022). The mathematical centre of gravity has shifted from denoising to velocity fields over the past three years, and flow matching is the cleanest way in.
If the statistics chapter interested you. Read the Chinchilla paper (Hoffmann et al., 2022), then the broken-scaling-law paper (Caballero et al., 2022), then any of the 2024–2025 compute-optimal-for-inference analyses. Scaling laws are the most statistically mature part of modern ML and they read like the cleanest chapter of an applied statistics textbook.

None of these are textbooks. They are papers. That is deliberate: current research lives in papers, and this book was written to let you read them.

Closing

The promise at the start of this book was to teach the math that holds up. Ten years from now, the frontier model will not be R1 — the frontier model changes every six months. But the math that describes it will still be here: policy gradients, KL divergence, softmax, cross-entropy, least squares, power laws, Gaussian regression, rotation matrices. The algorithms that win tomorrow will be recombinations of these ideas, not replacements of them.

The reason to learn this math is not because it will let you build the next frontier model — most of us will not. The reason is that, when the next frontier model is announced, you will be able to read the paper, see what it is doing, and know what to believe. That is a durable skill, and it is the one thing a textbook can give you that a news cycle cannot.

If you made it this far — thank you for reading. The book was a gift first and a reference second, and it has been written so that the reference part ages well.

Key Takeaways

A current frontier paper (DeepSeek-R1, 2025) can be read end to end using only the math in this book — calculus, linear algebra, probability, statistics, and the algorithms of Chapter 9.
GRPO's defining move is removing PPO's learned value function and replacing it with the group-sample z-score as the advantage estimate. This is a statistical trick (control variate + variance normalisation) that only works when rewards are crisp; read it as Chapter 8, not as a new learning algorithm.
Rule-based rewards are a gift: unbiased, low-variance, cheap. Their availability is what makes simplified training recipes possible. When a paper uses a learned reward model, expect more machinery around it (value functions, heavier KL control, larger groups).
Emergent reasoning from RL has a mechanical explanation in the policy gradient: if longer chains correlate with correctness, the gradient promotes tokens in those chains. "Emergent" = "not specified by the reward, found anyway".
Multi-stage training (cold-start SFT → reasoning RL → rejection-sample SFT → final RL with preferences) is an application of the right statistical tool for each job: MLE, policy gradient, importance sampling, DPO. Each stage is in the book already.
Distillation into smaller open models is cross-entropy SFT on the teacher's rollouts. The variance cost of RL is paid once, by the teacher; the student imitates. This is why capabilities spread quickly across the field once a recipe is published.
Four questions to ask of any RL paper: what is the reward, what is the variance-reduction strategy, what is the exploration mechanism, how does the reference policy evolve. They cover most of the 2025 literature.
Notice what a paper does not claim. The social-media summary of R1 was much stronger than the paper's careful abstract. Math literacy plus careful reading is the combination that protects you from hype.
The math in this book — PPO clipping (2017), DPO (2023), attention (2017), scaling laws (2020) — is already 5–10 years old and will still be on the table in 2030. The algorithms will recombine; the math does not get replaced.