How Semantic Caching Can Cut Your LLM Costs by 60%
A practical guide to implementing semantic caching with vector embeddings to reduce LLM API costs. Covers architecture, similarity thresholds, cache invalidation, and production considerations.
Transactional Team
Feb 24, 2026
{ }
9 min read
Share
The Math Behind LLM Costs
Consider a typical LLM-powered application processing 800,000 requests per month at an average cost of $0.015 per request. That is a $12,000 monthly bill -- and it climbs with every new feature.
The standard engineering response is "cache it." Exact-match caching helps, but barely. It might bring the bill down to $11,400 -- a 5% improvement. Hardly worth celebrating.
The problem is that LLM requests are almost never identical. "What is your refund policy?" and "What's the refund policy?" and "How do I get a refund?" are all the same question. Exact-match caching sees three distinct requests. Semantic caching sees one.
With a well-tuned semantic cache, you can achieve a 60%+ hit rate on repetitive workloads. That means only ~40% of requests actually hit the LLM provider: 320,000 requests at $0.015 = roughly $4,800/month plus ~$130 in embedding and vector DB overhead. That is a 60% cost reduction from a single infrastructure change.
LLM Cost Impact: Exact-Match vs Semantic Caching
Exact-Match OnlyWith Semantic Cache
Effective Cache Hit Rate~5%~60%
Monthly LLM Spend (800K reqs)~$11,400~$4,900
Avg Response Latency2.3s~1.0s (hits: 50ms)
Infrastructure Overhead~$0~$130/mo
Why Exact-Match Caching Fails for LLMs
Traditional caching works by hashing the request and checking if that exact hash exists in the cache. For LLM requests, this requires the exact same model, messages, temperature, and all other parameters.
The hit rate is low because:
Natural language variation. Humans phrase the same question differently every time.
Conversation context. Even identical questions have different conversation histories.
Whitespace and formatting. A trailing space creates a different hash.
Parameter sensitivity. Changing temperature from 0.7 to 0.71 is a cache miss.
Exact-match hit rates vary dramatically by use case. Automated classification pipelines with fixed prompts can see 20%+ hit rates, while support bots handling free-form user input typically see low single digits. Content generation and code assistance are nearly zero.
The only scenario where exact matching works well is automated pipelines that send identical requests. For anything involving human input, it is nearly useless.
How Semantic Caching Works
Semantic caching replaces the hash comparison with a similarity comparison. Instead of asking "is this the exact same request?" it asks "is this request similar enough to a cached request that the cached response would be appropriate?"
The architecture has three components:
Incoming Request
|
v
Embed the prompt ──── Generate vector embedding
| of the user's message
v
Vector Search ──────── Find nearest neighbors
| in the cache
|
┌────┴────┐
│ Score │
│ > 0.95? │
│ │
Yes No
│ │
v v
Return Forward to
Cached LLM Provider
Response │
v
Store response
+ embedding
in cache
Step 1: Embedding Generation
When a request arrives, generate a vector embedding of the user's message (not the full request including system prompt -- just the user-facing content that varies between requests).
async function getRequestEmbedding(messages: Message[]): Promise<number[]> { // Extract user messages only -- system prompts are constant const userContent = messages .filter(m => m.role === "user") .map(m => m.content) .join("\n"); const response = await embeddingModel.embed(userContent); return response.embedding; // 1536-dimensional vector}
A lightweight embedding model costs roughly $0.0001 per request -- negligible compared to the LLM call it might save.
Step 2: Vector Similarity Search
The embedding is compared against all cached embeddings using cosine similarity. Use a vector database for this lookup:
The similarity threshold determines the trade-off between cache hit rate and response accuracy. This is the most important tunable parameter.
Threshold 0.99: Very conservative. Only near-identical phrasings match.
Hit rate: ~15%. Almost no incorrect cache hits.
Threshold 0.95: Balanced. Different phrasings of the same question match.
Hit rate: ~45%. Rare incorrect cache hits.
Threshold 0.90: Aggressive. Related but not identical questions may match.
Hit rate: ~65%. Some incorrect cache hits.
Threshold 0.85: Too aggressive. Unrelated questions start matching.
Hit rate: ~80%. Frequent incorrect cache hits.
A good default is 0.95, adjustable per use case. Classification pipelines can safely go lower (0.92). Creative generation should go higher (0.98) or disable semantic caching entirely.
Cache Scoping
A cache hit must match on more than just semantic similarity. Scope caches along several dimensions:
interface CacheScope { organizationId: string; // Never cross tenant boundaries model: string; // GPT-4 and GPT-4o-mini give different answers systemPromptHash: string; // Different system prompts = different behavior temperature: number; // Only cache deterministic requests (temp <= 0.1)}
Temperature gating is critical. If temperature is above 0.1, skip semantic caching entirely. High-temperature requests are intentionally non-deterministic -- caching them defeats their purpose.
Tenant isolation is non-negotiable in multi-tenant systems. One organization's cache should never serve another's requests, even if the questions are identical. Different organizations may have different knowledge bases, policies, and contexts.
Cache Invalidation
LLM caches have unique invalidation challenges:
Time-Based Expiration
Set a default TTL of 24 hours. This handles cases where the underlying data changes (product updates, policy changes, price adjustments).
When a knowledge base is updated, invalidate all cache entries for that organization. This prevents stale answers to questions about content that has changed.
Feedback-Based Invalidation
If a user rates a cached response negatively (thumbs down, reported as wrong), invalidate that specific cache entry. This is a powerful self-correcting mechanism.
What to Expect
The actual savings depend entirely on your workload. Applications with high query repetition benefit the most:
Workload Type
Expected Semantic Hit Rate
Estimated Cost Savings
FAQ / Support bot
50-70%
40-60%
Classification pipeline
60-80%
50-70%
Code assistance
10-20%
5-15%
Content generation
5-15%
3-10%
Document summarization
15-30%
10-25%
The latency improvement is an additional bonus. Cached responses return in under 50ms versus 2-3 seconds for a round trip to the LLM provider. For support bots and FAQ systems, this makes the AI feel instant.
Edge Cases and Pitfalls
The Synonym Trap
"Cancel my subscription" and "delete my account" have high semantic similarity but are different actions with different consequences. Handle this by including the action context (not just the question) in the embedding:
// Instead of just embedding the user message:const content = userMessage;// Embed the message with its action context:const content = `[intent: ${detectedIntent}] ${userMessage}`;
The Temporal Trap
"What time is it?" should never be cached. "What are your business hours?" can be cached. Maintain a list of temporal indicators that bypass caching:
In conversations, the same question in different contexts should get different answers. "Tell me more" after "Explain Kubernetes" is different from "Tell me more" after "Explain DNS." Include the last N turns of conversation context in the embedding to differentiate these cases.
Implementation Recommendations
Start with exact-match caching. It is simpler and handles the easy cases. Add semantic caching when you have volume data showing low exact-match hit rates.
Monitor incorrect cache hits obsessively. Even a low incorrect rate adds up at scale. Track user feedback on cached responses specifically.
Let teams set their own thresholds. The right threshold depends entirely on the use case. Provide a sensible default but make it configurable.
Do not cache streaming responses. Cache the final assembled response and serve it as a complete response. Simulating streaming from cache adds complexity for minimal benefit.
Key Takeaway
If your LLM application handles any kind of repetitive query pattern -- support, FAQ, classification, analysis -- you are paying full price for answers you have already generated. Semantic caching is not a premature optimization. At $0.01-0.03 per request, it pays for itself quickly.
The embedding and vector search overhead is negligible compared to the LLM calls you avoid. And the latency improvement for cached responses makes your application feel fundamentally faster.