LESSON 5

KV Cache

The memory system that makes text generation possible.

🧠What is KV cache?

When generating text word by word, the model needs to remember everything it's already said. The KV cache is like a scratchpad that saves computed values so we don't have to recalculate them every time.

Without Cache

Each new word recalculates ALL previous words = super slow! 🐢

With Cache

Save computed values, only calculate new word = fast! 🐇

The problem it solves

When generating text, each new token needs to know about ALL previous tokens.

Token 1: Compute attention over 1 token

Token 2: Compute attention over 2 tokens

Token 3: Compute attention over 3 tokens

... → Gets slower and slower!

Without caching: Each token recomputes K and V for ALL previous tokens = up to 100x slower!

Two phases of generation

Prefill Phase

Process the entire prompt at once. Compute all K, V values and save them.

SpeedFast (parallel)

MemoryHigh (fills cache)

Decode Phase

Generate one token at a time. Reuse cached K, V. Only compute new token.

SpeedSlows over time

MemoryGrows with context

Live generation demo

Cache Size

0.00 MB

Click start to generate...

KV Cache (saved computations)

Each block = saved K and V for one token. Grows with each new word!

Saving memory with GQA

GQA shares K and V across query heads, dramatically reducing cache size.

Query Heads

(what we want)

4-8

KV Heads

(shared!)

4-8x

Memory Saved

(vs full attention)

How much memory?

Context	Standard	With GQA	4-bit
4K	2 GB	0.25 GB	0.06 GB
8K	4 GB	0.5 GB	0.13 GB
32K	16 GB	2 GB	0.5 GB
128K	64 GB	8 GB	2 GB

Previous: Transformer Block Next: Modern Innovations

Back to The Transformer