🔬 MicroDiffusion LM — Diffusion vs GPT Race

🎨 DIFFUSION Ready

Parallel unmasking — reveals by confidence

0 / 224

Click Generate to start the race

⌨️ GPT Ready

Sequential left-to-right — one token at a time

0 / 224

Click Generate to start the race

The 5 Changes from GPT → Diffusion

#	What Changes	GPT	Diffusion
1	Vocabulary	Standard chars	+ 1 MASK token
2	Attention	Causal (sees left ←)	Bidirectional (sees all ↔)
3	Training	Predict next token	Denoise masked tokens
4	Loss scope	All positions	Masked positions only
5	Generation	Sequential L→R	Parallel by confidence

Same transformer, same RoPE, same RMSNorm, same ReluSquared MLP. ~80% of the code is shared.

How It Works

🎨 Diffusion

Starts from all masks. At each step, the model sees the entire sequence (bidirectional) and predicts all positions simultaneously. It reveals the most confident predictions first — spaces and common patterns emerge before ambiguous positions.

~39 steps

⌨️ GPT

Starts from a prompt. The model can only see tokens to its left (causal attention) and generates strictly one token at a time, left to right. Like typing on a keyboard.

~225 steps

⚠️ Honest Assessment

GPT produces better text quality than diffusion at this scale. This is expected — diffusion gets ~50% training signal per batch (masked tokens only), and has a train-test distribution gap. The quality gap narrows at larger scale (MDLM, SEDD).

What diffusion does demonstrate: ~6× fewer forward passes, parallel decoding, and a fundamentally different approach to language modeling.