Why Your AI Chatbot Forgets Everything (And How to Fix It)
Architecture of persistent user memory using vector search for LLM applications. Embedding strategies, retrieval patterns, memory decay, and per-user scoping.
Transactional Team
Feb 26, 2026
{ }
11 min read
Share
The Goldfish Problem
Every AI chatbot has the same infuriating limitation. Tell it something in conversation one, and by conversation two it has completely forgotten. Your preferences. Your context. Your name. Gone.
This is not a model limitation. GPT-4, Claude, Gemini -- they can all maintain context brilliantly within a single conversation. The problem is architectural. When the conversation ends, the context window is discarded. The next conversation starts from zero.
Persistent user memory solves this problem. A user who has told a support bot their account ID, their tech stack, and their deployment environment should not have to repeat that information ever again.
Here is the architecture.
Memory Architecture Key Metrics
83%User Satisfaction (After)
5.1Avg Messages per Conversation
500Max Memories per User
0.7Min Confidence Threshold
Why Session-Based Memory Fails
The naive approach to memory is simple: store the conversation history and include it in the next conversation's context.
System Prompt + Previous Conversations + Current Message → LLM
This falls apart immediately:
Context window limits. A user with 50 previous conversations has hundreds of messages. You cannot fit them all into a context window. Even if you could, the cost would be absurd.
Relevance decay. A conversation from three months ago about a billing issue is not relevant to today's technical question. Including it wastes tokens and can confuse the model.
Information density. Conversations are verbose. The user's preference for Python over JavaScript is buried in a 20-message conversation about SDK setup. Retrieving the full conversation to get that one fact is wildly inefficient.
Cross-conversation synthesis. The user mentioned their company uses Kubernetes in conversation 3, and asked about webhook reliability in conversation 12. These facts should be connected -- but session-based memory treats each conversation as an isolated document.
The Architecture: Vector-Based Memory
Our memory system extracts, embeds, and retrieves facts -- not conversations. Here is the flow:
User Message
|
v
Memory Retrieval ──── Search vector DB for relevant memories
| using the current message as query
v
Context Assembly ──── System Prompt + Retrieved Memories
| + Conversation History + User Message
v
LLM Response
|
v
Memory Extraction ──── Extract new facts from the conversation
| and store as vector embeddings
v
Memory Store ────────── Vector DB (scoped per user)
Memory Extraction
After each conversation turn, we extract discrete facts from the exchange:
interface MemoryFact { id: string; userId: string; organizationId: string; content: string; // The fact as a natural language statement embedding: number[]; // Vector embedding of the content source: { conversationId: string; messageId: string; extractedAt: Date; }; category: MemoryCategory; confidence: number; // How confident we are this is a real fact lastAccessedAt: Date; // For memory decay accessCount: number; // How often this memory has been retrieved}enum MemoryCategory { PREFERENCE = "preference", // "Prefers Python over JavaScript" CONTEXT = "context", // "Uses Kubernetes for deployment" IDENTITY = "identity", // "Works at Acme Corp" HISTORY = "history", // "Had a billing issue in January" TECHNICAL = "technical", // "Running PostgreSQL 15" BEHAVIORAL = "behavioral", // "Prefers concise responses"}
The extraction itself uses an LLM call with a specialized prompt:
const extractionPrompt = `Analyze this conversation turn and extract discrete, factualstatements about the user. Each fact should be:- Self-contained (understandable without conversation context)- Specific (not vague generalizations)- Useful for future conversationsConversation:User: ${userMessage}Assistant: ${assistantResponse}Previously known facts about this user:${existingMemories.map(m => `- ${m.content}`).join('\n')}Extract new facts not already captured above.Return as JSON array of { content: string, category: string, confidence: number }.Do not re-extract facts that are already known.`;
The confidence field filters out uncertain extractions. "The user uses Python" (high confidence, directly stated) versus "the user might be interested in machine learning" (low confidence, inferred) -- we only store facts above 0.7 confidence.
Memory Retrieval
When a new message arrives, we search the user's memory store for relevant context:
async function retrieveRelevantMemories( userId: string, currentMessage: string, limit: number = 10): Promise<MemoryFact[]> { // Embed the current message const queryEmbedding = await embed(currentMessage); // Search for similar memories const results = await vectorDb.search({ vector: queryEmbedding, topK: limit, filter: { userId }, minScore: 0.7, }); // Update access timestamps await updateAccessTimestamps(results.map(r => r.id)); return results.map(r => r.metadata as MemoryFact);}
We retrieve the top 10 most relevant memories and include them in the system prompt:
You are a support assistant for Transactional.
What you know about this user:
- Works at Acme Corp as a backend engineer
- Uses Python with FastAPI for their API
- Deploys on Kubernetes (GKE)
- Running PostgreSQL 15
- Had a webhook reliability issue in January (resolved)
- Prefers concise, technical responses
Use this context naturally. Do not explicitly reference
"your memory" unless the user asks.
Memory Decay
Not all memories are equally valuable over time. A preference stated yesterday is more relevant than a technical detail from six months ago. We implement memory decay using a combination of recency and access frequency:
Memories that are never accessed gradually fade. Memories that are frequently relevant stay prominent. This mimics how human memory works -- the details you use often stay sharp, while rarely accessed information fades.
Memory Consolidation
Over time, a user accumulates redundant or overlapping memories. We run periodic consolidation:
async function consolidateMemories(userId: string): Promise<void> { const allMemories = await getMemoriesForUser(userId); // Find clusters of similar memories const clusters = await clusterByEmbedding(allMemories, threshold: 0.9); for (const cluster of clusters) { if (cluster.length <= 1) continue; // Merge cluster into a single, updated memory const merged = await mergeMemoryCluster(cluster); // Replace cluster with merged memory await deleteMemories(cluster.map(m => m.id)); await storeMemory(merged); }}
For example, three separate memories -- "Uses Python", "Prefers Python over JavaScript", "Backend is in Python with FastAPI" -- consolidate into "Uses Python with FastAPI for backend development, prefers Python over JavaScript."
Scoping: Who Remembers What
Memory scoping determines the boundaries of what gets remembered and for whom:
Organization Level
└── Bot Level (support bot vs. sales bot)
└── User Level (individual user)
└── Conversation Level (session context)
Organization scoping means memories never leak between tenants. A user's data in Organization A is invisible to Organization B, even if it is the same user email.
Bot scoping means the support bot and the sales bot maintain separate memory stores. The support bot remembers your technical issues. The sales bot remembers your pricing discussions. They do not cross-contaminate.
User scoping is the primary boundary. Each user has their own memory store that persists across conversations.
Privacy and Control
Persistent memory raises immediate privacy concerns. Users must have control:
// User-facing memory management APIinterface MemoryAPI { // View what the AI remembers about you listMemories(userId: string): Promise<MemoryFact[]>; // Delete a specific memory deleteMemory(userId: string, memoryId: string): Promise<void>; // Delete all memories clearAllMemories(userId: string): Promise<void>; // Opt out of memory entirely disableMemory(userId: string): Promise<void>; // Temporary amnesia (memories exist but aren't used) pauseMemory(userId: string): Promise<void>;}
Automatic PII detection is also important. If a memory extraction contains what looks like a credit card number, SSN, or password, it should be automatically discarded:
Support bot asked for account ID in 78% of conversations
Average conversation length: 8.2 messages
User satisfaction (thumbs up rate): 64%
After persistent memory:
Account ID asked in 12% of conversations (new users only)
Average conversation length: 5.1 messages
User satisfaction: 83%
The reduction in conversation length is the clearest signal. Users spend fewer messages providing context the bot should already know.
Edge Cases
Contradictory Memories
A user says "I use PostgreSQL" in January and "We switched to MySQL" in March. The memory system needs to handle this:
async function handleContradiction( newMemory: MemoryFact, existingMemory: MemoryFact): Promise<void> { // If the new memory is more recent, supersede the old one if (newMemory.source.extractedAt > existingMemory.source.extractedAt) { await deleteMemory(existingMemory.id); await storeMemory(newMemory); }}
We detect contradictions by checking if a new memory has high embedding similarity to an existing memory but different factual content. When detected, the newer memory wins.
Memory Hallucinations
The extraction LLM can occasionally "infer" facts that were not actually stated. A user asking about Redis does not mean they use Redis. We mitigate this by requiring high confidence scores and only extracting explicitly stated facts, not inferences.
Memory Bloat
Power users can accumulate thousands of memories. We cap at 500 active memories per user and use the decay function to prune the least relevant ones.
Key Takeaway
The difference between a useful AI assistant and an annoying one is memory. Stateless chatbots force users to repeat themselves endlessly. Persistent memory, built on vector search with proper scoping and decay, makes AI feel like it actually knows you.
The architecture is not complicated: extract facts, embed them, retrieve what is relevant, forget what is not. The hard part is the scoping, privacy, and contradiction handling. Get those right and your bot stops being a goldfish.