πŸ”¬ MicroDiffusion LM

Discrete Diffusion vs GPT β€” same architecture, same data, different generation

PyTorch 2.0+ 10.7M params Tiny Shakespeare 6L / 6H / 384E
40ms
🎨 DIFFUSION Ready
Parallel unmasking β€” reveals by confidence
0 / 224
Click Generate to start the race
⌨️ GPT Ready
Sequential left-to-right β€” one token at a time
0 / 224
Click Generate to start the race

The 5 Changes from GPT β†’ Diffusion

#What ChangesGPTDiffusion
1VocabularyStandard chars+ 1 MASK token
2AttentionCausal (sees left ←)Bidirectional (sees all ↔)
3TrainingPredict next tokenDenoise masked tokens
4Loss scopeAll positionsMasked positions only
5GenerationSequential L→RParallel by confidence

How It Works

🎨 Diffusion

Starts from all masks. At each step, the model sees the entire sequence (bidirectional) and predicts all positions simultaneously. It reveals the most confident predictions first β€” spaces and common patterns emerge before ambiguous positions.

~39 steps

⌨️ GPT

Starts from a prompt. The model can only see tokens to its left (causal attention) and generates strictly one token at a time, left to right. Like typing on a keyboard.

~225 steps

⚠️ Honest Assessment

GPT produces better text quality than diffusion at this scale. This is expected β€” diffusion gets ~50% training signal per batch (masked tokens only), and has a train-test distribution gap. The quality gap narrows at larger scale (MDLM, SEDD).

What diffusion does demonstrate: ~6Γ— fewer forward passes, parallel decoding, and a fundamentally different approach to language modeling.