Loading
Loading
When generating text word by word, the model needs to remember everything it's already said. The KV cache is like a scratchpad that saves computed values so we don't have to recalculate them every time.
When generating text, each new token needs to know about ALL previous tokens.
Without caching: Each token recomputes K and V for ALL previous tokens = up to 100x slower!
Process the entire prompt at once. Compute all K, V values and save them.
Generate one token at a time. Reuse cached K, V. Only compute new token.
Each block = saved K and V for one token. Grows with each new word!
GQA shares K and V across query heads, dramatically reducing cache size.
| Context | Standard | With GQA | 4-bit |
|---|---|---|---|
| 4K | 2 GB | 0.25 GB | 0.06 GB |
| 8K | 4 GB | 0.5 GB | 0.13 GB |
| 32K | 16 GB | 2 GB | 0.5 GB |
| 128K | 64 GB | 8 GB | 2 GB |