Behind the Stack, Ep 13: Faster Inference: Speculative Decoding for Batched Workloads

Jamie Dborin
Jamie Dborin
Founder & Member of Technical Staff, Doubleword

Speculative decoding is usually framed as a technique for making real time LLMs feel faster. But the story changes when you move from single request latency to high throughput, batched inference. Modern sparse MoE models shift the bottlenecks, and speculative decoding becomes useful in places where it was previously dismissed.

In this episode of Behind the Stack, Jamie unpacks:

  • How speculative decoding works
  • Why dense models stop benefiting at high batch size
  • Why MoE architectures keep the bandwidth bound regime open much longer
  • A batch API specific variant that uses queues for offline draft generation

A quick recap

Inference has two phases:

  • Prefill: parallel, compute bound
  • Decoding: sequential, bandwidth bound

Decoding requires repeatedly loading model weights from high bandwidth memory. Speculative decoding reintroduces parallelism by using a small draft model to generate several likely tokens ahead of time. The large model then verifies them in one forward pass. As long as you are bandwidth bound, you can get multiple tokens for roughly the price of one.

Why dense models lose the benefit at high batch

High batch inference changes the shape of the computation. The key concept is arithmetic intensity. At small token counts per layer, dense models are bandwidth bound. But as you increase batch size, the MLP compute grows and eventually overtakes weight loading time.

This transition usually happens at a few hundred tokens. After that, dense models become compute bound. In this domain, every extra speculative token adds real cost, and speculative decoding no longer offers a speedup.

This is why dense models like Llama 3 or Gemma rarely benefit from speculative decoding in high batch settings.

Why MoE models are different

Sparse MoE architectures activate only a few experts per token. This changes arithmetic intensity:

  • Multiply intensity by number of experts
  • Divide by number of experts per token

Because expert counts are large and active experts are few, MoE models remain bandwidth bound at much higher token counts. The threshold moves from hundreds of tokens to several thousand.

The result: MoEs continue benefiting from speculative decoding even at batch sizes where dense models would have flipped into the compute bound regime.

When KV cache becomes the real bottleneck

In high throughput workloads, the limiter is often KV cache capacity, not compute. A GPU might hold a million tokens of KV state. That could mean:

  • 1000 concurrent requests at 1k tokens each
  • Only 10 concurrent requests at 100k tokens each

Once the cache is full, you cannot raise batch size any further. On top of that, the bandwidth cost of moving KV cache can dominate inference at long context lengths.

This context is exactly where speculative decoding helps. If you are stuck with a fixed KV bound batch size and only getting, say, 10 accepted tokens per decode step, speculative decoding might raise that to 30. You amortize KV cache movement over more useful work.

Dense models rarely benefit here because they would already be compute bound at these token counts. MoEs stay bandwidth bound, so the opportunity remains.

A batch API specific variant

Traditional speculative decoding blocks the large model while the draft model runs, and vice versa. That is fine for low latency inference, but inefficient for batch systems where queue depth is high and SLAs are relaxed.

A better pattern for batch APIs:

  • While requests wait in the queue, run a small model or prefix tree to produce draft tokens.
  • Attach those tokens to the queued request.
  • When the main model finally processes it, it verifies those draft tokens without needing to wait for them to be generated.

This asynchronous design means:

  • The big model never waits
  • The draft model runs continuously on idle requests
  • You pipeline both systems to maximise global throughput

The draft mechanism does not need to be GPU heavy. It can run on cheaper hardware or even CPU, depending on size and workload.

Takeaways

For high throughput MoE inference:

  • Dense models hit the compute bound wall early
  • MoEs stay bandwidth bound at much larger token counts
  • KV cache capacity and bandwidth often dominate at scale
  • Speculative decoding can be used to amortize KV cache movement
  • Queue based draft generation is a natural fit for batched APIs

If you are working on batched LLM workloads and want to explore this architecture in practice, we are opening a private preview of our batched inference API in January. If you would like early access or want to compare notes, sign up today.