Adaptive speculative decoding: picking draft lengths at runtime
Last time we discussed the changing economics of speculative decoding. The strategy for getting the most tokens out of a running model has become more complex as the “market” for tokens in the running inference engine has become more dynamic. The price of dropped draft tokens is nonzero, and even verified draft tokens don’t come for free. The result is that there is space for mechanisms that choose how far we speculate at runtime, depending on dynamic, online policies.
In this post, we want to take some steps to figure out what the optimal policy is for speculating in this fast-changing environment.
First, a new model
Let’s swap the model from the last post, for variety’s sake. Qwen3.6-35B-A3B is a hybrid mixture-of-experts model from the Qwen team.
The expert half is pretty much the same as we worked out for DeepSeek Flash: see the last post for the full expert maths. Every layer routes each token to Contrast of for Deepseek Flash, the knee when all experts are active, such as it is, arrives sooner. of experts plus one shared expert, which is the same coupon-collector picture from last time: at small batch each token tends to drag in its own fresh experts and amortises almost nothing, and the marginal token only rides resident experts for free once the batch size is large enough to have triggered most of them.
The attention half is pretty different. Recent Qwen models have bet on ‘hybrid attention’: mixing both novel linear attention mechanisms (specifically, GatedDeltaNet) with traditional (GQA) attention. Qwen alternates its layers three to one: thirty of the forty are GatedDeltaNet linear-attention layers, and only ten are conventional full attentionThis is another path to KV cache compression, different from DeepSeek's maybe more ambitious modifications of the standard attention mechanism.. The upshot is that the result from last time — that MLA becomes compute bound when speculating — doesn’t apply: both the linear-attention and GQA layers have an arithmetic intensity that doesn’t saturate at any reasonable draft length, so speculation keeps paying for long sequences.
So for Qwen, it comes down to: the expert tax from last time, plus an attention bill that is quartered and, for most layers, flat in context. The full roofline maths is in the appendix.
What we’ve left out, as we did before, is the cost of producing the speculated tokens.
The changing face of speculation
There are two research threads that have changed how draft models are built:
One major boost in the performance of draft models has been to condition them on richer outputs from the target model. Conditioning eases the training objective of the speculator, making higher accept lengths easier to achieveConditioning on the hidden states has a drawback though, in that the speculator must run in series with the target model (since it needs the hidden states in order to run. Generally, hidden states from closer to the end of the speculator model are more useful than those from the start, and the speculator can only run once they're available). So there's little potential for overlap between speculator and target..
The other half of the story powering the step change in speculative decoding is the hardware sympathy of the drafter. DFlash makes use of the conditioning on the hidden states to make diffusion, which has generally given poor performance for pure text generation, work for speculation generation. The drafter workload is then much closer to its ridge point, produces its own tokens much faster, and the result is higher throughput for the same accept length.
Both factors are driving massive improvements in throughput See this great work from the Modal, SGLang, and Z Lab teams..
We discussed last time that there are two costs to pay during speculation: the cost of the draft model, and the cost of the verify. We focussed on the cost of the verify, and held the draft cost as a constant fraction of the target model.
This is bad modelling.
The drafter has its own roofline
There are two different draft model architectures widely used at the moment.
The MTP head Qwen ships is a single transformer layer that drafts autoregressively. This is the EAGLE lineage, but in this case pretrained alongside the model. To propose a draft of tokens it runs times in sequence, each pass taking the last token and producing the next, each pass a single layer followed by a projection through the -entry vocabulary. So the drafter’s cost is linear in .
DFlash, on the other hand, is an eight-layer block-diffusionDiffusion is a bit of an overloaded term here, the architecture and training methodology is pretty similar to the MLM loss from the BERT paper. model that drafts a fixed block in one forward pass, every position at once. In the setup here that block is . A policy can choose to verify fewer than those positions, but it still paid the draft cost of the full block.
Each has a cost to run at each point in batch-size and usable-depth space. At some points in that space, DFlash forward passes are cheaper, at some points EAGLE/MTP is cheaper:
The chart shows us only the cost of running the drafter. But different drafters are also different in their ability to generate acceptable tokens. In order to model it correctly, we curate datasets by running Qwen3.6-35B-A3B over the qualitative split of SPEED-Bench on Modal GPUs.
We capture fine-grained per-round acceptance/rejection data, which we’ll explore below. The headline: drafting against matched prompts at temperature , DFlash commits more — about tokens a round to MTP’s .Matched collection on Qwen3.6-35B-A3B, over half a million draft rounds each: MTP commits of drafted, DFlash of , DFlash ahead across the board -- to on predictable material (code, retrieval, maths, multilingual) and only roughly level on the hardest, highest-entropy categories (reasoning, STEM, humanities, roleplay). Details in the acceptance appendix. This is very dependent on training for the DFlash head -- in between gathering the data and publishing this post, the Modal team retrained the DFlash drafter model, promising longer accept lengths, and higher throughput.
A study in simulation
With the cost model for the drafter in hand, and a detailed model of acceptance, we can begin to explore where its possible to profit from adaptive speculation. To do so without building the feature first, this kind of cost model has to go inside an engine, with a scheduler, a queue, and a stream of requests that come and go.
I do that in a discrete-event simulator of a vLLM-shaped engine, inference-lab. It’s designed to split the difference between two different ways of experimenting with inference systems. The pen and paper analytics gives a per-step cost but no system: no batching, no queueing, no sense of load or data dependence. A real engine has all of that, but it also carries confounds that are frustrating to litigate case by case. Like boot-to-boot variation in which kernel shapes got captured, or CUDA graphs, availability of specific GPU SKUs, or the million other incidental but important parts of a running inference engine.
In inference-lab, each scheduler step, a batch is composed from a facsimile of vLLM’s scheduler logic — for aggregated, chunked prefill, we do their decode preference scheduling, but we also support disaggregated prefill. Then, the simulator prices that batch according to the modelling, and increments a timing counter. Then eviction, token generation, admission, KV cache management, etc. kick in to produce the next step. It’s purely event driven, so it’s really fastIt also compiles to WASM, so you can do fun stuff like this: https://inference-lab.doubleword.ai/..
The resulting performance is a target to hit. It tells us, if the idea is right, what should we achieve once we sit down and write it. These kinds of simulations are increasingly powerful as agents get better at software development and research — since we can use them to ground our intuitions on what should and shouldn’t matter, such that we can communicate the achievable & unachievable constraints of our ideas to agents.
The result measures some things, and models some things. Both the performance of the hardware, and the constraints of the inference engine’s software are modelled, not measured. The acceptance is measured, and supplied as an input to the simulator. And then standard measures of achievable performance — TPOT, TTFT, throughput, end to end latency — are measured as the output of the simulation.
The headroom
How to pick a draft length at runtime
When we pick a draft length , we’re taking a bet: we pay in wall clock time to produce and verify speculated tokens for the whole batch, and get back however many the target accepts. The right bet maximises total committed output per unit time, which is just steady-state throughput. Let be the wall-clock cost of the whole scheduler step for a batch of decode sequences, and let be the per-sequence accepted draft tokens. Then a homogeneous draft depth has
Each step the policy takes , over from (don’t speculate) up to the drafter’s usable depth.
This is basically just the speedup objective from last time, . The idea is that now we’re optimizing it directly, at runtime, by searching over to find the optimum. The numerator is the average number of tokens that actually commit per sequence at each depthWe did all simulations with real per position (i.e. acceptance rate after token, after tokens etc.). Turns out we needn't have bothered: is close enough to that the throughput barely moves, even though the measured curves do bend off geometric (see the appendix).. The denominator is the real roofline cost of the whole step: the expert tax and the hybrid-attention bill for the live batch of sequences each verifying tokens, plus the drafter charged according to its shape: serial passes for MTP, or the fixed block pass for DFlash.
Simulated priced matches best-in-hindsight tuning
The ultimate goal for a dynamic policy is to be better than hindsight. For each concurrency, the best fixed is the one you would have chosen if you knew that exact operating point ahead of time. The simulator, running on the policy we’ve described, shows we can win on this metricThe best dynamic policy could switch between the heads at runtime and achieve the envelope of these curves at the cost of having both weights on disk. The actual weights of the spec models are pretty small: most of them are shared with each other and with the target model (i.e. the vocab head), so this is somewhat reasonable. The thing that would make it difficult is that both drafters need to 'keep up' -- i.e. extend their KV cache over time with new tokens, even when they're not speculating.:
Represented as the standard throughput-latency curve, top right is good:
We’re doing input sequence length (ISL) of one token, output sequence length (OSL) of 1024Chunked prefill adds some interesting quirks, since you have to figure out how to tradeoff additional speculation vs. additional prefill chunks (which are required, in order for requests to progress into the decode phase). ISL=1 OSL=1024 sidesteps this by representing the disaggregated prefill setup, i.e. assume a perfectly paced and scaled prefiller on separate GPUs.. The chart is also a nice illustration of why DFlash is preferred even for similar accept lengths, the TPOT (per user latency) at matched throughput is substantially higher, because the diffusion model sits further up its roofline.
This is nice! It tells us, if we build this speculation policy perfectly, then we need never never mistune our speculation parameter against the workload. The throughput latency curve shows us (lighter curves) how wrong this can go: choosing not to speculate (close to optimal at batch size ) can decrease our TPOT at fixed throughput by a factor of 4 or more at batch size .
What to build
To make it real, you have to figure out how to build this policy in real time. There are a couple of issues:
- You don’t know . This one’s pretty easy - you just do an EMA estimate of the target acceptance rate over time, or use the drafter confidence.
- You don’t know . This one’s much harder. You can just use the roofline model - since hopefully, the performance in the real engine is at least correlated with the modelled performance, you can pick max from max modelled without going too far off-policy. You can also build a model of the real engine on real hardware, pointwise with batch size, , either online EMA or with a profiling step.
Once you’ve done that what do you get? A couple of things:
- For homogeneous workloads: you don’t ever have to tune the speculation length, and so speculation can be enabled by default. ‘Configuration tax’ is a real thing in inference engines: no matter what features you add to push performance, they’re useless unless people can find and enable them, and even AI coding agents aren’t very good at doing that.
- There’s a win across the board for heterogeneous, bursty, dynamic workloads. If your system lives between two operating points (i.e. half at concurrency , half at concurrency ), any static will be mistuned. Even better, you can use the dynamism you’ve afforded yourself to accept more flexible workloads: switch the same deployment from Claude fast mode to regular speed just by increasing batch size, without having to relaunch or reconfigure how you’re doing speculation.
It’s gonna be hard to build though. The benefit of simulation is you don’t have to think about implementation difficulty, the drawback is you haven’t thought about implementation difficulty. If you can run speculation at any static length up to , you need to figure out how to do CUDA graphs for each speculative shape. Estimators need to be updated online, accurately, account for hysteresis, outliers, warmups, JIT compilation, etc. Plus, in the model I’ve picked here, speculation has some gnarly complications: rewinding linear layers when you mis-speculate is not simpleSome nice work on making this easier: https://dao-lab.ai/blog/2026/replayssm/.
But, on the balance of the simulation evidence at least, it’s worth it.
Appendix: the setup
Verifier
Qwen3.6-35B-A3B: 3 linear layers then 1 full-attention layer, repeated: 30 GatedDeltaNet linear and 10 GQA full-attention; every layer MoE ( routed shared, per token); hidden , vocab ; bf16 weights and KV.
Drafters
Both use the target’s B vocab head rather than shipping their own.
| MTP | DFlash | |
|---|---|---|
| Drafter | autoregressive, 1 MoE layer + EAGLE fusion ( M active) | block-diffusion, 8 dense SwiGLU- layers + 5-layer fusion ( M) |
| Draft | tokens over passes | fixed -block, one pass; verify may truncate |
Hardware
Simulated: one B200, TP1/EP1: bf16 , fp8 , fp4 PFLOP/s; TB/s HBM; GB.
Appendix: the verifier roofline
These are computed costs, derived from the specs above and the hardware: the price the simulator charges each decode step.
The expert term the same as last time: coupon-collector routing, of , knee at . The attention is where the model differs.
The ten full-attention layers behave in the standard attention way. They are grouped-query query heads, key/value heads, head dimension , and the KV cache is re-read on every decode step and grows linearly with context. So the arithmetic intensity of the standard attention layers goes as , with the verify length.
The thirty linear layers do not keep a growing cache at all. A GatedDeltaNet layer carries a fixed-size recurrent state, a small matrix per head128 by 128, across 32 value heads that it updates in place as it consumes each token. The state summarises the whole history in constant space, so there is nothing that grows with context to re-read. Across the thirty layers that state comes to roughly 33 megabytes per sequence, re-read every decode step no matter how long the sequence is.
At short context the fixed megabytes of recurrent state dominate the attention memory reads, and the growing KV cache of the ten full-attention layers only overtakes it past about tokens of context. Above that, the bill still climbs, but on only a quarter of the layers, so it climbs four times more slowly than full dense-attention.
Details: the full-attention layers
These are ordinary softmax attention, so last post’s arithmetic-intensity model applies. For verify tokens over context tokens,
with the FLOPs per query-context pair, the bytes read per context token, and the bytes per query token. The layers are grouped-query ( query heads, key/value heads, head dimension ), and Qwen is not trained for an fp8 cache, so KV stays in bf16 () or we pay an accuracy cost:
At decode the KV read dominates the denominator (), so the intensity rises with context to the plateau
Against a B200’s bf16 ridge of these layers stay memory-bound until . This is different from MLA, which crossed the ridge at .
Details: the GatedDeltaNet layers
Per layer, there is fixed state and per-token recurrence over it. The state is a matrix per value head, read and updated once per step:
independent of . The full simulator state also includes the small convolution state. The compute is the gated delta-rule update plus the read-out:
counting the rank-one write , the delta correction , and the output (the exact constant follows from the GatedDeltaNet update equations). The intensity is the ratio,
with no in it.
Comparing the two
For speculation, it works out the same: both layers amortise a per-sequence memory read over verify width . Neither intensity falls with context, the softmax one rises with towards and the linear one sits flat at .
The difference is their maintained state. Per sequence the attention-memory read per step for Qwen-35b-a3b is:
Below the fixed GatedDeltaNet state dominates the read; above it the dense-attention model does.
Appendix: the drafter roofline
MTP
The MTP head is a single decoder layer: an EAGLE-style fusion projection, grouped-query attention ( M params), and the layer’s MoE MLP. It drafts autoregressively: a draft of tokens is forward passes in series, one token produced per pass.
The dense part read on every pass is the output head plus attention and fusion, B. The active expert work adds params, with routed experts, one shared expert, and params per expert, so the active per-token pass is still B. But the resident expert read grows with batch, with coupon-collector:
So its per-pass roofline is
which is close to only while the resident expert set is still small. Across the batch range in the chart it remains below the B200 bf16 ridge of , so the weight read dominates and the cost of a -deep draft is
At this is .
DFlash
DFlash is an eight-layer block-diffusion drafter. Its layers are dense (SwiGLU width , grouped-query attention with query and key/value heads), so the eight come to B, and a small fusion that ingests hidden states from five of the target’s layers () adds another B. It carries no head of its own either, borrowing the target’s, so B. The deployed head here drafts a fixed block in one forward pass. Choosing a shallower changes the verifier width, not the draft pass, so positions ride that one read:
At decode () the block does not cross the ridge until , so the read dominates for every depth one would ever draft and the cost is
flat in the requested verify depth for this fixed block size.
Appendix: the acceptance data (measured)
This is the only thing we measure on real hardware. The banks come from draft
rounds, collected by running Qwen3.6-35B-A3B (on modal) as the target with the
two drafters above, over the qualitative split of
SPEED-Bench at temperature , and
logging every speculation round. The banks are published as the
Doubleword/qwen3.6-specdec-calibration
dataset.
Each round records both the drafter’s per-depth confidence (read off its draft-time logprobs) and the outcome — the number of tokens the target actually committed. The curve below is the outcome side: how many drafted tokens survive verification at each depth. On a log axis a single is a straight line, so the bend in each measured curve is its distance from geometric — both decay faster than geometric early, and DFlash’s deep tail is heavier than geometric, not lighter.
Appendix: the simulation
Everything in The headroom is simulator output — not measured.
The figure in The headroom is a closed-loop sweep over one setup: decode-only (ISL , OSL , so prefill is disaggregated), a single request size, concurrency to in powers of two, fixed seed. Acceptance is replayed from the banks above, one round per decode step, so it carries the real per-round variation rather than a single .
The sweep pits the best fixed draft length, chosen by simulating the E2E
benchmark at every and picking the best in hindsight, against the
priced policy run once choosing live. The code is in
inference-lab; step-by-step
reproduction of every figure here — pull the dataset, export the banks, run the
sweep — is in
examples/specdec/README.md.
