Anatomy of a Diffusion Language Model

Diffusion Language Models (dLLMs) are a very different paradigm for text generation than the more familiar auto-regressive (AR) LLMs. They are a relatively new paradigm, and as such there is much more variety in the approaches taken to dLLM development than with AR LLMs which have mostly converged on a standard recipe. Diffusion language models are interesting for a simple reason: they offer a more flexible way to spend inference compute. Autoregressive models generate text one token at a time. Diffusion language models try to generate multiple tokens at once, and then resolve the mistakes that come from doing that.

This post looks at that tradeoff through three recent innovations in dLLMs: DFlash, DiffusionGemma, and Nemotron-Labs-Diffusion. The goal is to understand the main design choices that are emerging and comparing their relative merits.

What are Diffusion Language Models

I am going to use a broad definition of diffusion models. dLLMs are language models where the next input is not the previous sampled token, for example it might be a mask token or a noise token, and the input is chosen in a way that you can inference multiple tokens in parallel ¹. In a standard autoregressive model, the model samples one token, then that sampled token becomes part of the input for the next step, alonside the cached KV tensors. It is inherently serial and makes parallel sampling of future tokens impossible.

Diffusion Language models can start from a block of uncertain tokens and predict many positions in the block at the same time. The starting point might be all mask tokens, random noise, or random tokens sampled from the vocabulary. The important part is that the model is not forced to generate strictly left to right.

Why Diffusion LLMs Are Interesting

The main attraction is speed. An autoregressive model needs one forward pass per token that it generates. A diffusion model can spend one forward pass on a whole block of tokens. If it can accept more than one of those tokens per forward pass then the model will be faster than its autoregressive cousins. The difficulty is that language is highly conditional. Later words often depend strongly on earlier choices. If a model predicts several positions at once, it can make choices that are individually plausible but jointly inconsistent.

Problem with Consistency

The core problem is that parallel sampling does not automatically preserve consistency. Consider a prompt that asks the model to repeat exactly either Sentence A or Sentence B. Sentence A has Prefix A and Suffix A. Sentence B has Prefix B and Suffix B. Since both are valid completions, a diffusion model may have high confidence in both options at once.

consistencyimage Parallel sampling can choose locally plausible pieces from different completions, producing a globally inconsistent answer.

When each position is sampled separately, the result can combine pieces from different completions. The model might sample Prefix A and Suffix B. Each part came from a high probability completion, but the final sequence is inconsistent.

Autoregressive generation doesn’t have the same issue. Once the prefix is sampled, the suffix is conditioned on that prefix. The model no longer has to keep both completions alive. The earlier sample removes the ambiguity.

Diffusion models usually handle this by turning generation into an iterative process. The model predicts a block, accepts the positions it is confident about, and then runs again while conditioning on those accepted positions. Easy tokens can be fixed early. Hard tokens can receive more compute.

This improves quality, but it also loses some of the speed advantage. Single shot generation is fastest but least consistent. Many denoising steps are more consistent but slower. The useful operating point is somewhere between those extremes.

Diffusion Models as Speculators

This tradeoff makes diffusion models especially natural as speculative models. A speculator does not need to be perfect. It only needs to propose tokens that the target autoregressive model will often accept.

That changes the tradeoffs in design space. If the target model is going to verify the output, then the diffusion model can prioritize speed. Some inconsistencies are acceptable because the verifier will reject bad tokens. The diffusion model is being used to cheaply propose a block, not to be the final authority on the sequence.

This matters most when the expected acceptance length is large. An autoregressive speculator still has to generate its proposed tokens one at a time. A diffusion speculator can propose the whole block in one pass, at least in principle. If many of those tokens are accepted, the cost profile is much better. When using speculative decoding we usually choose a very small speculator relative to the verifier, so we tend not to care about the cost profile of the speculator, but if we are generating ~20 tokens to target > 10 accepted tokens at a time, the overhead can become significant.

This was a motivator for the original DFlash paper:

DFlash frames diffusion drafting as a way to reduce speculative decoding overhead when many proposed tokens are accepted. From The DFlash Paper.

Three dLLM Approaches

The three dLLM approaches we will look at here use the same broad idea in different ways. DFlash is built specifically for speculation. DiffusionGemma is a standalone encoder-decoder diffusion model, designed in such a way as to enable the model to correct itself during diffusion. Nemotron-Labs-Diffusion combines autoregressive and diffusion objectives inside one model.

The differences are useful because they show the main open questions. What should the diffusion token starting point be? Are diffusion models good enough on their own or should they be paired with an AR model? How many tokens per forward pass can you safely target?

DFlash

DFlash takes the narrowest version of the diffusion idea. It is not trying to replace the target model. It is trying to make the target model faster.

The target autoregressive model still does the prefill and produces hidden states that act as the KV Cache for the DFlash model. DFlash uses that cache and predicts a block of future tokens from mask tokens in a single forward pass. Those predicted tokens are then verified by the target model.

This keeps the diffusion model focused on one task: generate a useful speculation quickly. Since the target model checks the result, DFlash can tolerate lower standalone quality than a model that has to produce final text by itself.

DiffusionGemma

DiffusionGemma is a more complete diffusion language model, designed as a standalone model. It uses an encoder-decoder structure. The encoder handles the prefix and is trained with an autoregressive loss. The decoder is the diffusion part, and it generates the continuation.

One important choice is that DiffusionGemma does not start from mask tokens. It starts from randomly sampled vocabulary tokens. That means the decoder is trained to decide both which input tokens are noise and what they should become.

This gives the model a way to revise earlier choices. If a token looked good in one step but becomes uncertain after more context is available, it can be turned back into noise and resampled. That is harder for mask-based diffusion, because a mask-based model is mostly trained to map from mask tokens into text tokens, not from one text token into another.

DiffusionGemma also uses self-conditioning. The model feeds information from the previous prediction into the next denoising step. This gives the decoder a memory of its own earlier guesses while it continues refining the block.

The cost is that generation takes many steps. In the described setup, the model generates a block of 256 tokens over ~50 denoising steps. That is still parallel across the block, but it is not a single forward pass.

Nemotron-Labs-Diffusion

Nemotron-Labs-Diffusion tries to combine the two modes inside one backbone. The same model is trained with a normal autoregressive next-token loss and a block-wise diffusion denoising loss. It can use diffusion to propose blocks, and it can use autoregressive mode to verify.

The joint objective is meant to make the two modes help each other. The autoregressive loss teaches the model strong left-to-right structure. The diffusion loss teaches it to use lookahead and make parallel predictions across a block.

The most interesting inference idea is Quadratic Self Speculation. During verification, the model also receives masked positions for future tokens. This means that the verification pass can verify the current speculation and generate the next speculation at the same time.

In ordinary speculative decoding, there is a separate cost for producing the speculative tokens. Quadratic Self Speculation tries to hide that cost inside the verification pass. The version described in the work does not keep the bonus token from the verification step, so it is not exactly the same as speculative-speculative decoding, but the motivation is very close.

Quadratic Self Speculation decoding flow — Quadratic Self Speculation verifies the current draft while simultaneously preparing masked positions for the next draft. Left: the figure from the original paper. Right: a formulation of quadratic self speculation closer to Speculative Speculative Decoding. Rather than throwing away the bonus tokens, you try to predict the bonus token and speculate from that.

How Good Are They?

The key metric to judge the models are how many tokens they can reliably denoise per forward pass. The standalone diffusion models tend to perform worse on this metric, as they have higher thresholds for accepting tokens, to avoid accuracy degredation. The speculator models can denoise more aggressively with the confidence that their AR partner will correct the mistakes of their over exuberance.

results Across these approaches, the practical question is how many useful tokens each forward pass can produce without unacceptable quality loss.

In standalone mode, both Nemotron Diffusion and Diffusion Gemma show an approx 3x speedup over their equivalent-sized AR model, where the tradeoff is slightly degraded benchmark performance.

DiffusionGemma accuracy and speed comparison

Standalone diffusion models are faster than comparable AR baselines, but their quality tradeoff makes speculation the more immediately compelling use case.

Top: DiffusionGemma has slightly lower benchmark scores than its AR equivalent model, but it is 3x faster. Adapted from the original blog post.

Bottom: Nemotron-Diffusion has slightly lower benchmark scores in diffusion mode compared to AR mode, but is 2.7x faster. In the speculative mode it is 6x faster without benchmark loss because of the AR verification. Adapted from the original paper.

It doesn’t look like standalone diffusion models are quite ready to replace AR models. They are significantly faster but their competition is distillation of large models into smaller models, which themselves can use speculative decoding. So being 3x faster with degraded performance is a tricky place to find yourself. However as speculators they are extremely promising. DFlash is already becoming something of an industry standard, and it is already about as simple as a dLLM could be. If you add in multiple forward passes, self-correction, self-conditioning, and the other tricks from the standalone diffusion models you should be able to push up canvas sizes, acceptance rates, and tokens-per-forward and make even better speculators.