Building agents · 4 of 7

Memory — the agent's notebook

An agent that forgets what it did yesterday is severely limited. An agent that remembers everything is expensive, slow, and brittle. Designing the right memory architecture is the second-most-impactful decision after tool design, and it is far less understood.

Where the binding constraint sits today

The context window is the agent's only true working memory. Everything else — vector stores, structured DBs, document indexes — is a workaround. The question is which workaround fits which job.

The context window is the only true working memory

When an agent reasons, it reasons over the contents of its context window. Anything not in the window does not exist for the duration of that decision. This is the single most important fact about agent memory and the one most often forgotten when teams design memory systems. A vector store is not memory. A database is not memory. Both are sources from which content is retrieved into the context window. Memory, for the model, is exactly what is in the window when it generates the next token.

Two implications follow. First, every memory architecture is a retrieval problem: how does the right information end up in the window at the right moment? Second, every memory architecture is a cost problem: putting more into the window costs more, runs slower, and may degrade reasoning quality if irrelevant context pushes out relevant context. Good memory design is about getting the right slice in, not about storing more.

When to just put it in the prompt

The default memory architecture in 2026 is no memory architecture at all. Put everything the agent needs into the system prompt or the conversation history, send the model the whole picture, and let the long-context capability of the underlying model do the work. Frontier models with one-million-token context windows can absorb a remarkable amount before performance degrades. For agents that handle a single bounded task per execution, this is usually enough.

The decision rule: if the relevant context fits comfortably below the model's effective working window (which is meaningfully smaller than the advertised maximum context — typically 50-70% of the stated window in practice), and if the latency and cost are tolerable, do not build a memory system. Save the engineering effort. The longer the model's context capability grows, the broader the set of agents this rule covers.

Where the default fails: when the relevant context grows over time (an agent that has talked to a user for months), when it varies dramatically per request (an agent that may need any of thousands of documents), or when the cost of putting everything in the window every call is prohibitive at scale. In those cases, build a memory system. Otherwise, the simpler the better.

Episodic vs semantic memory

A useful conceptual split, borrowed from cognitive science: episodic memory is what the agent did, semantic memory is what the agent learned. The two have different storage requirements and different retrieval patterns.

Episodic memory is the log of past conversations, decisions, and tool calls. It is timestamp-indexed, ordered, and naturally fits a structured store (a database, an append-only log, a JSON file). Retrieval into the context window is usually by recency, by user, or by topic. A customer-support agent's memory of past tickets for this user is episodic.

Semantic memory is the distilled knowledge an agent has accumulated. It is the result of summarizing many episodes, the curated facts the agent should remember about a user or a domain. It often fits a key-value or embedding-indexed store. Retrieval is usually by semantic similarity to the current question. A personal assistant's memory of 'this user prefers concise replies' or 'this customer is on the enterprise tier' is semantic.

Most production agents need a small amount of both. The episodic side is usually built first because the log naturally exists. The semantic side is usually built by an offline process that periodically reads episodes and writes consolidated facts to a separate store. The agent reads from both at the start of each new task.

Vector stores — when they help, when they don't

Vector stores, and the retrieval-augmented-generation (RAG) pattern more broadly, are the most over-applied tool in agent design as of 2026. The pattern works well when the corpus is large, the queries are semantic (not exact-match), and the agent's job is to surface relevant passages or facts. It works badly when the corpus is small (just put it in the prompt), when queries are structured (use a database), or when the agent needs the entire corpus, not a slice (long context).

The common failure mode is to default to a vector store for every memory need. The cost is high: vector stores require an embedding pipeline, a serving infrastructure, retrieval tuning, and they introduce their own latency and own failure modes. A vector store should be a conscious choice for a specific job, not a default. The decision rule: if the alternative of putting the document in the prompt directly fails on size or cost, and the queries are genuinely semantic, then a vector store is the right answer. Otherwise, choose a simpler structure.

When a vector store is the right answer, the embedding model and the chunking strategy matter more than the vector database choice. Most production retrieval failures trace to bad chunking (splitting documents at the wrong granularity), not to which vector store was selected. The store is a commodity. The chunking and embedding are the engineering work.

Structured memory — the underrated default

For most production agents, the right memory backend is a structured database with explicit schemas, not a vector store. User preferences, account details, conversation summaries, named entities the agent has seen, dates and amounts from past transactions — these all fit a structured store and retrieve faster and more reliably than they would from semantic search.

The pattern: a per-user (or per-account) record in a normal SQL database, updated by the agent at the end of relevant conversations, queried at the start of new ones. The agent's prompt includes the user's record at the top. The record is small (kilobytes, not megabytes) and the retrieval is microseconds.

This pattern works because most of what an agent needs to remember about a user is small, well-structured, and updates infrequently. Treating it as a vector retrieval problem would add complexity without benefit. The vector-store-by-default instinct came from the early RAG era when document QA was the canonical use case; agent memory is mostly not document QA.

Does long context obsolete vector RAG?

Frontier models in 2026 advertise context windows in the millions of tokens. Anthropic, Google, OpenAI, and several open-weights labs all ship long-context capability. The honest assessment of whether this obsoletes vector RAG: mostly no, but the boundary moves.

For agents whose total knowledge base fits in the window — a small company's internal docs, a single user's history, a focused codebase — long context plus careful selection is now usually simpler and more reliable than building RAG. For agents whose knowledge base is too large to fit (a multi-hundred-thousand-document enterprise knowledge base, the full public web, very long-running user history), retrieval is still required.

The practical impact is that the threshold at which RAG is worth building has shifted upward. Many agents that would have warranted a vector store in 2023 can now skip it. The agents that still need retrieval are larger-scope, and the engineering of retrieval is therefore more concentrated in fewer, more sophisticated systems.

The cost dimension

Memory is not free. A long-context prompt costs more per call. A vector retrieval adds latency. A structured database adds an extra hop. At low scale, none of this matters. At high scale, the cost of running an agent is dominated by inference cost, and inference cost is dominated by input token count, and input token count is dominated by how much context the agent is loading per call. A memory architecture that puts 50,000 tokens of context into every call is materially more expensive to operate than one that puts 5,000.

The discipline is to load only what the current task needs, not the maximum the agent might conceivably need. This is the practical reason most production agents have a memory system at all — not because the model needs the memory, but because feeding the model a smaller window per call is cheaper. The memory system is, in effect, a cost-optimization layer that decides what to retrieve into context.

Strategic read

The pattern most production agents converge to in 2026: a small structured user record, an episodic log of past actions, and (if the corpus warrants it) a focused vector retrieval over a specific document base. Most do not need vector retrieval at all. Most do benefit from a structured memory record. All of them benefit from being conservative about how much they load into the window per call.

For a team building agents, the rule is to start with the simplest possible memory architecture (no memory, just prompt) and add components only when a specific failure forces them to. For a buyer evaluating agents, ask the vendor what memory architecture they use and why. A vendor whose answer is 'we use a vector database' without further specifics is not engineering memory; they are bolting on a default. The teams that have thought about memory hard can describe what they chose, what they rejected, and why.

Building agents · 4 of 7

Memory — the agent's notebook

Where the binding constraint sits today

The context window is the agent's only true working memory. Everything else — vector stores, structured DBs, document indexes — is a workaround. The question is which workaround fits which job.