Engineering

11 min read

Why Your AI Chatbot Forgets Everything (And How to Fix It)

Architecture of persistent user memory using vector search for LLM applications. Embedding strategies, retrieval patterns, memory decay, and per-user scoping.

Transactional Team

Feb 26, 2026

{ }

11 min read

Why Your AI Chatbot Forgets Everything (And How to Fix It)

The Goldfish Problem

Every AI chatbot has the same infuriating limitation. Tell it something in conversation one, and by conversation two it has completely forgotten. Your preferences. Your context. Your name. Gone.

This is not a model limitation. GPT-4, Claude, Gemini -- they can all maintain context brilliantly within a single conversation. The problem is architectural. When the conversation ends, the context window is discarded. The next conversation starts from zero.

Persistent user memory solves this problem. A user who has told a support bot their account ID, their tech stack, and their deployment environment should not have to repeat that information ever again.

Here is the architecture.

Memory Architecture Key Metrics

83%User Satisfaction (After)

5.1Avg Messages per Conversation

500Max Memories per User

0.7Min Confidence Threshold

Why Session-Based Memory Fails

The naive approach to memory is simple: store the conversation history and include it in the next conversation's context.

System Prompt + Previous Conversations + Current Message → LLM

This falls apart immediately:

Context window limits. A user with 50 previous conversations has hundreds of messages. You cannot fit them all into a context window. Even if you could, the cost would be absurd.

Relevance decay. A conversation from three months ago about a billing issue is not relevant to today's technical question. Including it wastes tokens and can confuse the model.

Information density. Conversations are verbose. The user's preference for Python over JavaScript is buried in a 20-message conversation about SDK setup. Retrieving the full conversation to get that one fact is wildly inefficient.

Cross-conversation synthesis. The user mentioned their company uses Kubernetes in conversation 3, and asked about webhook reliability in conversation 12. These facts should be connected -- but session-based memory treats each conversation as an isolated document.

The Architecture: Vector-Based Memory

Our memory system extracts, embeds, and retrieves facts -- not conversations. Here is the flow:

User Message
     |
     v
 Memory Retrieval ──── Search vector DB for relevant memories
     |                  using the current message as query
     v
 Context Assembly ──── System Prompt + Retrieved Memories
     |                  + Conversation History + User Message
     v
 LLM Response
     |
     v
 Memory Extraction ──── Extract new facts from the conversation
     |                   and store as vector embeddings
     v
 Memory Store ────────── Vector DB (scoped per user)

Memory Extraction

After each conversation turn, we extract discrete facts from the exchange:

interface MemoryFact {
  id: string;
  userId: string;
  organizationId: string;
  content: string;           // The fact as a natural language statement
  embedding: number[];       // Vector embedding of the content
  source: {
    conversationId: string;
    messageId: string;
    extractedAt: Date;
  };
  category: MemoryCategory;
  confidence: number;        // How confident we are this is a real fact
  lastAccessedAt: Date;      // For memory decay
  accessCount: number;       // How often this memory has been retrieved
}
 
enum MemoryCategory {
  PREFERENCE = "preference",       // "Prefers Python over JavaScript"
  CONTEXT = "context",             // "Uses Kubernetes for deployment"
  IDENTITY = "identity",           // "Works at Acme Corp"
  HISTORY = "history",             // "Had a billing issue in January"
  TECHNICAL = "technical",         // "Running PostgreSQL 15"
  BEHAVIORAL = "behavioral",      // "Prefers concise responses"
}

The extraction itself uses an LLM call with a specialized prompt:

const extractionPrompt = `
Analyze this conversation turn and extract discrete, factual
statements about the user. Each fact should be:
- Self-contained (understandable without conversation context)
- Specific (not vague generalizations)
- Useful for future conversations
 
Conversation:
User: ${userMessage}
Assistant: ${assistantResponse}
 
Previously known facts about this user:
${existingMemories.map(m => `- ${m.content}`).join('\n')}
 
Extract new facts not already captured above.
Return as JSON array of { content: string, category: string, confidence: number }.
Do not re-extract facts that are already known.
`;

The confidence field filters out uncertain extractions. "The user uses Python" (high confidence, directly stated) versus "the user might be interested in machine learning" (low confidence, inferred) -- we only store facts above 0.7 confidence.

Memory Retrieval

When a new message arrives, we search the user's memory store for relevant context:

async function retrieveRelevantMemories(
  userId: string,
  currentMessage: string,
  limit: number = 10
): Promise<MemoryFact[]> {
  // Embed the current message
  const queryEmbedding = await embed(currentMessage);
 
  // Search for similar memories
  const results = await vectorDb.search({
    vector: queryEmbedding,
    topK: limit,
    filter: { userId },
    minScore: 0.7,
  });
 
  // Update access timestamps
  await updateAccessTimestamps(results.map(r => r.id));
 
  return results.map(r => r.metadata as MemoryFact);
}

We retrieve the top 10 most relevant memories and include them in the system prompt:

You are a support assistant for Transactional.

What you know about this user:
- Works at Acme Corp as a backend engineer
- Uses Python with FastAPI for their API
- Deploys on Kubernetes (GKE)
- Running PostgreSQL 15
- Had a webhook reliability issue in January (resolved)
- Prefers concise, technical responses

Use this context naturally. Do not explicitly reference
"your memory" unless the user asks.

Memory Decay

Not all memories are equally valuable over time. A preference stated yesterday is more relevant than a technical detail from six months ago. We implement memory decay using a combination of recency and access frequency:

function calculateMemoryRelevance(
  memory: MemoryFact,
  similarityScore: number
): number {
  const daysSinceAccess = daysBetween(memory.lastAccessedAt, new Date());
  const recencyFactor = Math.exp(-0.01 * daysSinceAccess);  // Exponential decay
  const frequencyFactor = Math.min(memory.accessCount / 10, 1);  // Caps at 10 accesses
 
  return similarityScore * 0.6 + recencyFactor * 0.25 + frequencyFactor * 0.15;
}

Memories that are never accessed gradually fade. Memories that are frequently relevant stay prominent. This mimics how human memory works -- the details you use often stay sharp, while rarely accessed information fades.

Memory Consolidation

Over time, a user accumulates redundant or overlapping memories. We run periodic consolidation:

async function consolidateMemories(userId: string): Promise<void> {
  const allMemories = await getMemoriesForUser(userId);
 
  // Find clusters of similar memories
  const clusters = await clusterByEmbedding(allMemories, threshold: 0.9);
 
  for (const cluster of clusters) {
    if (cluster.length <= 1) continue;
 
    // Merge cluster into a single, updated memory
    const merged = await mergeMemoryCluster(cluster);
 
    // Replace cluster with merged memory
    await deleteMemories(cluster.map(m => m.id));
    await storeMemory(merged);
  }
}

For example, three separate memories -- "Uses Python", "Prefers Python over JavaScript", "Backend is in Python with FastAPI" -- consolidate into "Uses Python with FastAPI for backend development, prefers Python over JavaScript."

Scoping: Who Remembers What

Memory scoping determines the boundaries of what gets remembered and for whom:

Organization Level
  └── Bot Level (support bot vs. sales bot)
       └── User Level (individual user)
            └── Conversation Level (session context)

Organization scoping means memories never leak between tenants. A user's data in Organization A is invisible to Organization B, even if it is the same user email.

Bot scoping means the support bot and the sales bot maintain separate memory stores. The support bot remembers your technical issues. The sales bot remembers your pricing discussions. They do not cross-contaminate.

User scoping is the primary boundary. Each user has their own memory store that persists across conversations.

Privacy and Control

Persistent memory raises immediate privacy concerns. Users must have control:

// User-facing memory management API
interface MemoryAPI {
  // View what the AI remembers about you
  listMemories(userId: string): Promise<MemoryFact[]>;
 
  // Delete a specific memory
  deleteMemory(userId: string, memoryId: string): Promise<void>;
 
  // Delete all memories
  clearAllMemories(userId: string): Promise<void>;
 
  // Opt out of memory entirely
  disableMemory(userId: string): Promise<void>;
 
  // Temporary amnesia (memories exist but aren't used)
  pauseMemory(userId: string): Promise<void>;
}

Automatic PII detection is also important. If a memory extraction contains what looks like a credit card number, SSN, or password, it should be automatically discarded:

const piiPatterns = [
  /\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b/,  // Credit card
  /\b\d{3}-\d{2}-\d{4}\b/,                          // SSN
  /password\s*[:=]\s*\S+/i,                          // Passwords
];
 
function containsPII(content: string): boolean {
  return piiPatterns.some(pattern => pattern.test(content));
}

Real-World Impact

Before persistent memory:

Support bot asked for account ID in 78% of conversations
Average conversation length: 8.2 messages
User satisfaction (thumbs up rate): 64%

After persistent memory:

Account ID asked in 12% of conversations (new users only)
Average conversation length: 5.1 messages
User satisfaction: 83%

The reduction in conversation length is the clearest signal. Users spend fewer messages providing context the bot should already know.

Edge Cases

Contradictory Memories

A user says "I use PostgreSQL" in January and "We switched to MySQL" in March. The memory system needs to handle this:

async function handleContradiction(
  newMemory: MemoryFact,
  existingMemory: MemoryFact
): Promise<void> {
  // If the new memory is more recent, supersede the old one
  if (newMemory.source.extractedAt > existingMemory.source.extractedAt) {
    await deleteMemory(existingMemory.id);
    await storeMemory(newMemory);
  }
}

We detect contradictions by checking if a new memory has high embedding similarity to an existing memory but different factual content. When detected, the newer memory wins.

Memory Hallucinations

The extraction LLM can occasionally "infer" facts that were not actually stated. A user asking about Redis does not mean they use Redis. We mitigate this by requiring high confidence scores and only extracting explicitly stated facts, not inferences.

Memory Bloat

Power users can accumulate thousands of memories. We cap at 500 active memories per user and use the decay function to prune the least relevant ones.

Key Takeaway

The difference between a useful AI assistant and an annoying one is memory. Stateless chatbots force users to repeat themselves endlessly. Persistent memory, built on vector search with proper scoping and decay, makes AI feel like it actually knows you.

The architecture is not complicated: extract facts, embed them, retrieve what is relevant, forget what is not. The hard part is the scoping, privacy, and contradiction handling. Get those right and your bot stops being a goldfish.

Learn more about our approach to AI Memory.

Sources & References

[1]pgvector: Open-Source Vector Similarity Search for Postgres — pgvector
[2]Pinecone Documentation — Pinecone
[3]Weaviate Vector Database Documentation — Weaviate
[4]OpenAI Embeddings Guide — OpenAI

Written by

Transactional Team

Tags:

memory

vector-search

Industry Insights

We Evaluated 12 LLM Observability Tools. Most of Them Do Not Matter.

A practical evaluation of LLM observability tools across tracing, cost tracking, quality monitoring, and prompt management. What matters, what is marketing, and what to actually look for.

Transactional TeamMar 5, 2026

Case Studies

An Enterprise Team Was Shipping Hallucinations to Users. Traces Showed Them Where.

How an enterprise company with AI-powered customer support reduced hallucination rates from 8% to 0.3% and cut AI issue MTTR from days to minutes using LLM observability and trace-level analysis.

Transactional TeamMar 4, 2026

Tutorials

Your AI Agent Will Crash in Production. Plan for It.

Common AI agent failure modes and how to handle them: tool execution failures, context window overflow, infinite loops, and hallucinated function calls. Production-ready error patterns with code.

Transactional TeamMar 3, 2026

YOUR AGENTS DESERVE
REAL INFRASTRUCTURE.

START BUILDING AGENTS THAT DO REAL WORK.

Deploy Your First Agent