Discrete Diffusion vs GPT β same architecture, same data, different generation
Click Generate to start the race
Click Generate to start the race
| # | What Changes | GPT | Diffusion |
|---|---|---|---|
| 1 | Vocabulary | Standard chars | + 1 MASK token |
| 2 | Attention | Causal (sees left β) | Bidirectional (sees all β) |
| 3 | Training | Predict next token | Denoise masked tokens |
| 4 | Loss scope | All positions | Masked positions only |
| 5 | Generation | Sequential LβR | Parallel by confidence |
Starts from all masks. At each step, the model sees the entire sequence (bidirectional) and predicts all positions simultaneously. It reveals the most confident predictions first β spaces and common patterns emerge before ambiguous positions.
~39 steps
Starts from a prompt. The model can only see tokens to its left (causal attention) and generates strictly one token at a time, left to right. Like typing on a keyboard.
~225 steps
GPT produces better text quality than diffusion at this scale. This is expected β diffusion gets ~50% training signal per batch (masked tokens only), and has a train-test distribution gap. The quality gap narrows at larger scale (MDLM, SEDD).
What diffusion does demonstrate: ~6Γ fewer forward passes, parallel decoding, and a fundamentally different approach to language modeling.