Case Studies

8 min read

A Fintech Team Cut Their LLM Bill by 70%. Here is Exactly How.

How a fintech company processing loan applications with LLMs reduced their monthly AI spend from $15K to $4.5K using semantic caching, fallback routing, and token optimization.

Transactional Team

Mar 1, 2026

8 min read

A Fintech Team Cut Their LLM Bill by 70%. Here is Exactly How.

$15,000 a Month and Growing

A mid-size fintech company -- let us call them LendFlow -- came to us with a problem that was getting worse every month. The following is a composite scenario illustrating common cost optimization patterns, not a specific customer engagement. They were processing loan applications using LLMs and spending $15,000 a month on AI API calls. That number was growing 20% month over month.

Their pipeline was straightforward. Every loan application went through three AI steps:

Document classification: Determine document type (pay stub, bank statement, tax return, etc.)
Data extraction: Pull structured data from the document
Risk assessment: Generate a preliminary risk score with reasoning

Each step used GPT-4o. Each application submitted an average of 8 documents. They processed roughly 12,000 applications per month. The math was ugly.

12,000 applications x 8 documents x 3 AI steps = 288,000 LLM calls per month.

At an average of $0.05 per call, that is $14,400. And because application volume was growing, the bill was growing with it.

LendFlow: Before vs. After AI Gateway Optimization

BeforeAfter

Monthly LLM Cost$14,400$4,850

Avg Latency2,100ms890ms

Cost per Application$1.20$0.29

Classification Cost$2,880/mo$230/mo

Cache Hit Rate0%64%

The Diagnosis

When LendFlow connected their pipeline to our AI Gateway and turned on LLM Observability, the problems became obvious within the first hour.

Problem 1: Identical Documents, Zero Caching

Loan applicants frequently submit the same types of documents. A Chase bank statement from January 2026 looks structurally identical to another Chase bank statement from January 2026. The classification step was sending nearly identical inputs to GPT-4o every time.

Observability showed that 62% of document classification requests had a semantic similarity score above 0.97 with a previous request. That is 178,000 calls per month that could have been served from cache.

Problem 2: GPT-4o for Everything

Document classification does not need GPT-4o. Deciding whether a document is a pay stub or a bank statement is a straightforward classification task. GPT-4o-mini handles it with 99.2% accuracy. Mistral Small handles it with 98.7% accuracy.

LendFlow was using GPT-4o for all three steps because it was the first model they integrated, and nobody had tested cheaper alternatives. The cost difference is massive: GPT-4o charges $2.50 per million input tokens. GPT-4o-mini charges $0.15. That is a 16x difference.

Problem 3: Bloated Prompts

The data extraction prompt included 4,200 tokens of instructions, examples, and edge case handling. Observability showed that 1,800 of those tokens were examples for document types that represented less than 2% of their volume. Every single request paid for those 1,800 tokens whether they were relevant or not.

The Fix

We worked with LendFlow to implement three changes over a two-week period. No application code was rewritten. Everything was configured through AI Gateway.

Week 1: Semantic Caching

We enabled semantic caching for the document classification step with a similarity threshold of 0.96.

// AI Gateway configuration for classification step
{
  model: "openai/gpt-4o",
  cache: {
    enabled: true,
    type: "semantic",
    similarityThreshold: 0.96,
    ttl: 86400 // 24 hours
  }
}

Results were immediate. On day one, the cache hit rate for classification was 58%. By day three, as the cache warmed up, it stabilized at 64%.

That is 64% of 96,000 classification calls per month that no longer hit the LLM. At $0.03 per classification call, that saved roughly $1,840 per month from this single change.

Week 1: Fallback Routing to Cheaper Models

We reconfigured the classification step to use GPT-4o-mini as the primary model with GPT-4o as a fallback for low-confidence responses.

// Classification step configuration
{
  model: "openai/gpt-4o-mini",
  fallbacks: ["openai/gpt-4o"],
  routing: {
    strategy: "confidence",
    confidenceThreshold: 0.90,
    // If GPT-4o-mini confidence < 90%, retry with GPT-4o
  }
}

GPT-4o-mini handled 94% of classifications without needing the fallback. The 6% that fell through to GPT-4o were edge cases like handwritten documents or poor quality scans. Classification accuracy stayed above 99%.

Cost per classification call dropped from $0.03 to $0.002. Combined with caching, the classification step went from $2,880/month to $230/month.

Week 2: Token Optimization

We analyzed the data extraction prompt using our token usage dashboard. The prompt breakdown showed:

System instructions: 1,200 tokens (necessary)
Output format specification: 800 tokens (necessary)
Common examples: 400 tokens (necessary)
Rare edge case examples: 1,800 tokens (removable)

We split the prompt into a base prompt (2,400 tokens) and document-type-specific appendices. The gateway dynamically includes only the relevant appendix based on the classification result from step 1.

// Dynamic prompt assembly
{
  model: "openai/gpt-4o",
  systemPrompt: baseExtractionPrompt,
  dynamicContext: {
    source: "classification_result",
    appendices: {
      "pay_stub": payStubExamples,        // 200 tokens
      "bank_statement": bankStatementExamples,  // 250 tokens
      "tax_return": taxReturnExamples,    // 300 tokens
      // ... other types
    }
  }
}

Average prompt size dropped from 4,200 tokens to 2,700 tokens. A 36% reduction in input tokens for every extraction call.

The Numbers

Here is the before and after, measured over a 30-day period at the same application volume (12,000 applications):

Document Classification

Metric	Before	After	Change
Monthly calls	96,000	34,560	-64% (caching)
Model	GPT-4o	GPT-4o-mini	-94% cost/call
Monthly cost	$2,880	$230	-92%

Data Extraction

Metric	Before	After	Change
Monthly calls	96,000	96,000	No change
Avg prompt tokens	4,200	2,700	-36%
Monthly cost	$7,200	$4,600	-36%

Risk Assessment

Metric	Before	After	Change
Monthly calls	96,000	96,000	No change
Model	GPT-4o	GPT-4o	No change
Monthly cost	$4,320	$4,320	No change

Risk assessment stayed on GPT-4o. This step requires the strongest reasoning capability, and the cost was justified by the quality requirements.

Total

Metric	Before	After	Change
Total monthly cost	$14,400	$4,850	-66%
Avg latency	2,100ms	890ms	-58%
Classification accuracy	99.1%	99.2%	+0.1%
Extraction accuracy	97.8%	97.6%	-0.2%

The final cost reduction was 66%, not quite 70%. But the following month, as cache hit rates improved and they optimized a few more prompts, the number hit 71%.

The Timeline

Day 1: Connected pipeline to AI Gateway, enabled observability
Day 1-3: Analyzed cost breakdown, identified optimization targets
Day 4-5: Enabled semantic caching for classification
Day 6-8: Configured model fallback routing
Day 9-12: Refactored extraction prompts, tested accuracy
Day 14: Full rollout, monitoring

Two weeks of configuration changes. No application code rewritten. No models retrained. No infrastructure changes.

What LendFlow Did Next

With their AI costs under control, LendFlow actually expanded their AI usage. They added a fourth step -- document fraud detection -- that they had previously rejected as too expensive. With the cost savings from optimization, the new step was budget-neutral.

Their LLM bill is now stable at around $5,500/month despite processing 40% more applications than when they started. Cost per application dropped from $1.20 to $0.29.

The Takeaway

Most teams overspend on LLMs not because the technology is expensive, but because they have no visibility into what is expensive and why. They use the same model for every task because testing alternatives requires rewriting integration code. They send bloated prompts because nobody has profiled token usage. They make redundant calls because there is no caching layer.

The fix is not cheaper models. It is visibility and routing. Know what each call costs, send each call to the right model, cache what you can, and trim what you do not need.

Start with the AI Gateway to get the routing and caching. Add LLM Observability to see where the money goes. The rest follows.

Sources & References

[1]OpenAI API Pricing — OpenAI
[2]Anthropic API Pricing — Anthropic
[3]Prompt Caching with Claude — Anthropic
[4]GPT-4o mini: Advancing Cost-Efficient Intelligence — OpenAI

Written by

Transactional Team

Tags:

case-study

fintech

Industry Insights

We Evaluated 12 LLM Observability Tools. Most of Them Do Not Matter.

A practical evaluation of LLM observability tools across tracing, cost tracking, quality monitoring, and prompt management. What matters, what is marketing, and what to actually look for.

Transactional TeamMar 5, 2026

Case Studies

An Enterprise Team Was Shipping Hallucinations to Users. Traces Showed Them Where.

How an enterprise company with AI-powered customer support reduced hallucination rates from 8% to 0.3% and cut AI issue MTTR from days to minutes using LLM observability and trace-level analysis.

Transactional TeamMar 4, 2026

Tutorials

Your AI Agent Will Crash in Production. Plan for It.

Common AI agent failure modes and how to handle them: tool execution failures, context window overflow, infinite loops, and hallucinated function calls. Production-ready error patterns with code.

Transactional TeamMar 3, 2026

YOUR AGENTS DESERVE
REAL INFRASTRUCTURE.

START BUILDING AGENTS THAT DO REAL WORK.

Deploy Your First Agent