Engineering

9 min read

How Semantic Caching Can Cut Your LLM Costs by 60%

A practical guide to implementing semantic caching with vector embeddings to reduce LLM API costs. Covers architecture, similarity thresholds, cache invalidation, and production considerations.

Transactional Team

Feb 24, 2026

{ }

9 min read

How Semantic Caching Can Cut Your LLM Costs by 60%

The Math Behind LLM Costs

Consider a typical LLM-powered application processing 800,000 requests per month at an average cost of $0.015 per request. That is a $12,000 monthly bill -- and it climbs with every new feature.

The standard engineering response is "cache it." Exact-match caching helps, but barely. It might bring the bill down to $11,400 -- a 5% improvement. Hardly worth celebrating.

The problem is that LLM requests are almost never identical. "What is your refund policy?" and "What's the refund policy?" and "How do I get a refund?" are all the same question. Exact-match caching sees three distinct requests. Semantic caching sees one.

With a well-tuned semantic cache, you can achieve a 60%+ hit rate on repetitive workloads. That means only ~40% of requests actually hit the LLM provider: 320,000 requests at $0.015 = roughly $4,800/month plus ~$130 in embedding and vector DB overhead. That is a 60% cost reduction from a single infrastructure change.

LLM Cost Impact: Exact-Match vs Semantic Caching

Exact-Match OnlyWith Semantic Cache

Effective Cache Hit Rate~5%~60%

Monthly LLM Spend (800K reqs)~$11,400~$4,900

Avg Response Latency2.3s~1.0s (hits: 50ms)

Infrastructure Overhead~$0~$130/mo

Why Exact-Match Caching Fails for LLMs

Traditional caching works by hashing the request and checking if that exact hash exists in the cache. For LLM requests, this requires the exact same model, messages, temperature, and all other parameters.

The hit rate is low because:

Natural language variation. Humans phrase the same question differently every time.
Conversation context. Even identical questions have different conversation histories.
Whitespace and formatting. A trailing space creates a different hash.
Parameter sensitivity. Changing temperature from 0.7 to 0.71 is a cache miss.

Exact-match hit rates vary dramatically by use case. Automated classification pipelines with fixed prompts can see 20%+ hit rates, while support bots handling free-form user input typically see low single digits. Content generation and code assistance are nearly zero.

The only scenario where exact matching works well is automated pipelines that send identical requests. For anything involving human input, it is nearly useless.

How Semantic Caching Works

Semantic caching replaces the hash comparison with a similarity comparison. Instead of asking "is this the exact same request?" it asks "is this request similar enough to a cached request that the cached response would be appropriate?"

The architecture has three components:

Incoming Request
       |
       v
  Embed the prompt ──── Generate vector embedding
       |                 of the user's message
       v
  Vector Search ──────── Find nearest neighbors
       |                 in the cache
       |
  ┌────┴────┐
  │ Score   │
  │ > 0.95? │
  │         │
  Yes      No
  │         │
  v         v
Return    Forward to
Cached    LLM Provider
Response      │
              v
         Store response
         + embedding
         in cache

Step 1: Embedding Generation

When a request arrives, generate a vector embedding of the user's message (not the full request including system prompt -- just the user-facing content that varies between requests).

async function getRequestEmbedding(messages: Message[]): Promise<number[]> {
  // Extract user messages only -- system prompts are constant
  const userContent = messages
    .filter(m => m.role === "user")
    .map(m => m.content)
    .join("\n");
 
  const response = await embeddingModel.embed(userContent);
  return response.embedding;  // 1536-dimensional vector
}

A lightweight embedding model costs roughly $0.0001 per request -- negligible compared to the LLM call it might save.

Step 2: Vector Similarity Search

The embedding is compared against all cached embeddings using cosine similarity. Use a vector database for this lookup:

async function findSimilarCachedRequest(
  embedding: number[],
  threshold: number,
  scope: CacheScope
): Promise<CachedResponse | null> {
  const results = await vectorDb.search({
    vector: embedding,
    topK: 1,
    filter: {
      organizationId: scope.organizationId,
      model: scope.model,
      systemPromptHash: scope.systemPromptHash,
    },
    minScore: threshold,
  });
 
  if (results.length === 0) return null;
 
  return {
    response: results[0].metadata.response,
    similarity: results[0].score,
    originalPrompt: results[0].metadata.prompt,
    cachedAt: results[0].metadata.cachedAt,
  };
}

Step 3: Threshold Decision

The similarity threshold determines the trade-off between cache hit rate and response accuracy. This is the most important tunable parameter.

Threshold 0.99: Very conservative. Only near-identical phrasings match.
                 Hit rate: ~15%. Almost no incorrect cache hits.

Threshold 0.95: Balanced. Different phrasings of the same question match.
                 Hit rate: ~45%. Rare incorrect cache hits.

Threshold 0.90: Aggressive. Related but not identical questions may match.
                 Hit rate: ~65%. Some incorrect cache hits.

Threshold 0.85: Too aggressive. Unrelated questions start matching.
                 Hit rate: ~80%. Frequent incorrect cache hits.

A good default is 0.95, adjustable per use case. Classification pipelines can safely go lower (0.92). Creative generation should go higher (0.98) or disable semantic caching entirely.

Cache Scoping

A cache hit must match on more than just semantic similarity. Scope caches along several dimensions:

interface CacheScope {
  organizationId: string;     // Never cross tenant boundaries
  model: string;              // GPT-4 and GPT-4o-mini give different answers
  systemPromptHash: string;   // Different system prompts = different behavior
  temperature: number;        // Only cache deterministic requests (temp <= 0.1)
}

Temperature gating is critical. If temperature is above 0.1, skip semantic caching entirely. High-temperature requests are intentionally non-deterministic -- caching them defeats their purpose.

Tenant isolation is non-negotiable in multi-tenant systems. One organization's cache should never serve another's requests, even if the questions are identical. Different organizations may have different knowledge bases, policies, and contexts.

Cache Invalidation

LLM caches have unique invalidation challenges:

Time-Based Expiration

Set a default TTL of 24 hours. This handles cases where the underlying data changes (product updates, policy changes, price adjustments).

const DEFAULT_CACHE_TTL = 24 * 60 * 60 * 1000; // 24 hours
 
function isCacheValid(entry: CacheEntry): boolean {
  return Date.now() - entry.cachedAt.getTime() < DEFAULT_CACHE_TTL;
}

Event-Based Invalidation

When a knowledge base is updated, invalidate all cache entries for that organization. This prevents stale answers to questions about content that has changed.

Feedback-Based Invalidation

If a user rates a cached response negatively (thumbs down, reported as wrong), invalidate that specific cache entry. This is a powerful self-correcting mechanism.

What to Expect

The actual savings depend entirely on your workload. Applications with high query repetition benefit the most:

Workload Type	Expected Semantic Hit Rate	Estimated Cost Savings
FAQ / Support bot	50-70%	40-60%
Classification pipeline	60-80%	50-70%
Code assistance	10-20%	5-15%
Content generation	5-15%	3-10%
Document summarization	15-30%	10-25%

The latency improvement is an additional bonus. Cached responses return in under 50ms versus 2-3 seconds for a round trip to the LLM provider. For support bots and FAQ systems, this makes the AI feel instant.

Edge Cases and Pitfalls

The Synonym Trap

"Cancel my subscription" and "delete my account" have high semantic similarity but are different actions with different consequences. Handle this by including the action context (not just the question) in the embedding:

// Instead of just embedding the user message:
const content = userMessage;
 
// Embed the message with its action context:
const content = `[intent: ${detectedIntent}] ${userMessage}`;

The Temporal Trap

"What time is it?" should never be cached. "What are your business hours?" can be cached. Maintain a list of temporal indicators that bypass caching:

const temporalPatterns = [
  /what time/i, /right now/i, /currently/i,
  /today's date/i, /this moment/i,
];

The Multi-Turn Trap

In conversations, the same question in different contexts should get different answers. "Tell me more" after "Explain Kubernetes" is different from "Tell me more" after "Explain DNS." Include the last N turns of conversation context in the embedding to differentiate these cases.

Implementation Recommendations

Start with exact-match caching. It is simpler and handles the easy cases. Add semantic caching when you have volume data showing low exact-match hit rates.

Monitor incorrect cache hits obsessively. Even a low incorrect rate adds up at scale. Track user feedback on cached responses specifically.

Let teams set their own thresholds. The right threshold depends entirely on the use case. Provide a sensible default but make it configurable.

Do not cache streaming responses. Cache the final assembled response and serve it as a complete response. Simulating streaming from cache adds complexity for minimal benefit.

Key Takeaway

If your LLM application handles any kind of repetitive query pattern -- support, FAQ, classification, analysis -- you are paying full price for answers you have already generated. Semantic caching is not a premature optimization. At $0.01-0.03 per request, it pays for itself quickly.

The embedding and vector search overhead is negligible compared to the LLM calls you avoid. And the latency improvement for cached responses makes your application feel fundamentally faster.

Explore semantic caching with AI Gateway.

Sources & References

[1]GPTCache: An Open-Source Semantic Cache for LLM Applications — arXiv
[2]Efficient and Effective Text Encoding for Retrieval — arXiv
[3]OpenAI Embeddings Guide — OpenAI
[4]Redis Vector Similarity Search — Redis

Written by

Transactional Team

Tags:

caching

cost-optimization

Industry Insights

We Evaluated 12 LLM Observability Tools. Most of Them Do Not Matter.

A practical evaluation of LLM observability tools across tracing, cost tracking, quality monitoring, and prompt management. What matters, what is marketing, and what to actually look for.

Transactional TeamMar 5, 2026

Case Studies

An Enterprise Team Was Shipping Hallucinations to Users. Traces Showed Them Where.

How an enterprise company with AI-powered customer support reduced hallucination rates from 8% to 0.3% and cut AI issue MTTR from days to minutes using LLM observability and trace-level analysis.

Transactional TeamMar 4, 2026

Tutorials

Your AI Agent Will Crash in Production. Plan for It.

Common AI agent failure modes and how to handle them: tool execution failures, context window overflow, infinite loops, and hallucinated function calls. Production-ready error patterns with code.

Transactional TeamMar 3, 2026

YOUR AGENTS DESERVE
REAL INFRASTRUCTURE.

START BUILDING AGENTS THAT DO REAL WORK.

Deploy Your First Agent