Attention Distribution
Full n×n attention distribution with causal masking
Each row shows how one token attends to all previous tokens (autoregressive)
"John"
"sat"
"then"
"he"
"ran"
"John"
1.00
John → John: 1.000
—
—
—
—
"sat"
0.64
sat → John: 0.640
0.36
sat → sat: 0.360
—
—
—
"then"
0.45
then → John: 0.450
0.32
then → sat: 0.320
0.23
then → then: 0.230
—
—
"he"
0.72
he → John: 0.720
0.15
he → sat: 0.150
0.08
he → then: 0.080
0.05
he → he: 0.050
—
"ran"
0.22
ran → John: 0.220
0.28
ran → sat: 0.280
0.25
ran → then: 0.250
0.15
ran → he: 0.150
0.10
ran → ran: 0.100