Transactional
Case Studies
8 min read

A Fintech Team Cut Their LLM Bill by 70%. Here is Exactly How.

How a fintech company processing loan applications with LLMs reduced their monthly AI spend from $15K to $4.5K using semantic caching, fallback routing, and token optimization.

Transactional Team
Mar 1, 2026
8 min read
Share
A Fintech Team Cut Their LLM Bill by 70%. Here is Exactly How.

$15,000 a Month and Growing

A mid-size fintech company -- let us call them LendFlow -- came to us with a problem that was getting worse every month. The following is a composite scenario illustrating common cost optimization patterns, not a specific customer engagement. They were processing loan applications using LLMs and spending $15,000 a month on AI API calls. That number was growing 20% month over month.

Their pipeline was straightforward. Every loan application went through three AI steps:

  1. Document classification: Determine document type (pay stub, bank statement, tax return, etc.)
  2. Data extraction: Pull structured data from the document
  3. Risk assessment: Generate a preliminary risk score with reasoning

Each step used GPT-4o. Each application submitted an average of 8 documents. They processed roughly 12,000 applications per month. The math was ugly.

12,000 applications x 8 documents x 3 AI steps = 288,000 LLM calls per month.

At an average of $0.05 per call, that is $14,400. And because application volume was growing, the bill was growing with it.

LendFlow: Before vs. After AI Gateway Optimization

BeforeAfter
Monthly LLM Cost$14,400$4,850
Avg Latency2,100ms890ms
Cost per Application$1.20$0.29
Classification Cost$2,880/mo$230/mo
Cache Hit Rate0%64%

The Diagnosis

When LendFlow connected their pipeline to our AI Gateway and turned on LLM Observability, the problems became obvious within the first hour.

Problem 1: Identical Documents, Zero Caching

Loan applicants frequently submit the same types of documents. A Chase bank statement from January 2026 looks structurally identical to another Chase bank statement from January 2026. The classification step was sending nearly identical inputs to GPT-4o every time.

Observability showed that 62% of document classification requests had a semantic similarity score above 0.97 with a previous request. That is 178,000 calls per month that could have been served from cache.

Problem 2: GPT-4o for Everything

Document classification does not need GPT-4o. Deciding whether a document is a pay stub or a bank statement is a straightforward classification task. GPT-4o-mini handles it with 99.2% accuracy. Mistral Small handles it with 98.7% accuracy.

LendFlow was using GPT-4o for all three steps because it was the first model they integrated, and nobody had tested cheaper alternatives. The cost difference is massive: GPT-4o charges $2.50 per million input tokens. GPT-4o-mini charges $0.15. That is a 16x difference.

Problem 3: Bloated Prompts

The data extraction prompt included 4,200 tokens of instructions, examples, and edge case handling. Observability showed that 1,800 of those tokens were examples for document types that represented less than 2% of their volume. Every single request paid for those 1,800 tokens whether they were relevant or not.

The Fix

We worked with LendFlow to implement three changes over a two-week period. No application code was rewritten. Everything was configured through AI Gateway.

Week 1: Semantic Caching

We enabled semantic caching for the document classification step with a similarity threshold of 0.96.

// AI Gateway configuration for classification step
{
  model: "openai/gpt-4o",
  cache: {
    enabled: true,
    type: "semantic",
    similarityThreshold: 0.96,
    ttl: 86400 // 24 hours
  }
}

Results were immediate. On day one, the cache hit rate for classification was 58%. By day three, as the cache warmed up, it stabilized at 64%.

That is 64% of 96,000 classification calls per month that no longer hit the LLM. At $0.03 per classification call, that saved roughly $1,840 per month from this single change.

Week 1: Fallback Routing to Cheaper Models

We reconfigured the classification step to use GPT-4o-mini as the primary model with GPT-4o as a fallback for low-confidence responses.

// Classification step configuration
{
  model: "openai/gpt-4o-mini",
  fallbacks: ["openai/gpt-4o"],
  routing: {
    strategy: "confidence",
    confidenceThreshold: 0.90,
    // If GPT-4o-mini confidence < 90%, retry with GPT-4o
  }
}

GPT-4o-mini handled 94% of classifications without needing the fallback. The 6% that fell through to GPT-4o were edge cases like handwritten documents or poor quality scans. Classification accuracy stayed above 99%.

Cost per classification call dropped from $0.03 to $0.002. Combined with caching, the classification step went from $2,880/month to $230/month.

Week 2: Token Optimization

We analyzed the data extraction prompt using our token usage dashboard. The prompt breakdown showed:

  • System instructions: 1,200 tokens (necessary)
  • Output format specification: 800 tokens (necessary)
  • Common examples: 400 tokens (necessary)
  • Rare edge case examples: 1,800 tokens (removable)

We split the prompt into a base prompt (2,400 tokens) and document-type-specific appendices. The gateway dynamically includes only the relevant appendix based on the classification result from step 1.

// Dynamic prompt assembly
{
  model: "openai/gpt-4o",
  systemPrompt: baseExtractionPrompt,
  dynamicContext: {
    source: "classification_result",
    appendices: {
      "pay_stub": payStubExamples,        // 200 tokens
      "bank_statement": bankStatementExamples,  // 250 tokens
      "tax_return": taxReturnExamples,    // 300 tokens
      // ... other types
    }
  }
}

Average prompt size dropped from 4,200 tokens to 2,700 tokens. A 36% reduction in input tokens for every extraction call.

The Numbers

Here is the before and after, measured over a 30-day period at the same application volume (12,000 applications):

Document Classification

MetricBeforeAfterChange
Monthly calls96,00034,560-64% (caching)
ModelGPT-4oGPT-4o-mini-94% cost/call
Monthly cost$2,880$230-92%

Data Extraction

MetricBeforeAfterChange
Monthly calls96,00096,000No change
Avg prompt tokens4,2002,700-36%
Monthly cost$7,200$4,600-36%

Risk Assessment

MetricBeforeAfterChange
Monthly calls96,00096,000No change
ModelGPT-4oGPT-4oNo change
Monthly cost$4,320$4,320No change

Risk assessment stayed on GPT-4o. This step requires the strongest reasoning capability, and the cost was justified by the quality requirements.

Total

MetricBeforeAfterChange
Total monthly cost$14,400$4,850-66%
Avg latency2,100ms890ms-58%
Classification accuracy99.1%99.2%+0.1%
Extraction accuracy97.8%97.6%-0.2%

The final cost reduction was 66%, not quite 70%. But the following month, as cache hit rates improved and they optimized a few more prompts, the number hit 71%.

The Timeline

  • Day 1: Connected pipeline to AI Gateway, enabled observability
  • Day 1-3: Analyzed cost breakdown, identified optimization targets
  • Day 4-5: Enabled semantic caching for classification
  • Day 6-8: Configured model fallback routing
  • Day 9-12: Refactored extraction prompts, tested accuracy
  • Day 14: Full rollout, monitoring

Two weeks of configuration changes. No application code rewritten. No models retrained. No infrastructure changes.

What LendFlow Did Next

With their AI costs under control, LendFlow actually expanded their AI usage. They added a fourth step -- document fraud detection -- that they had previously rejected as too expensive. With the cost savings from optimization, the new step was budget-neutral.

Their LLM bill is now stable at around $5,500/month despite processing 40% more applications than when they started. Cost per application dropped from $1.20 to $0.29.

The Takeaway

Most teams overspend on LLMs not because the technology is expensive, but because they have no visibility into what is expensive and why. They use the same model for every task because testing alternatives requires rewriting integration code. They send bloated prompts because nobody has profiled token usage. They make redundant calls because there is no caching layer.

The fix is not cheaper models. It is visibility and routing. Know what each call costs, send each call to the right model, cache what you can, and trim what you do not need.

Start with the AI Gateway to get the routing and caching. Add LLM Observability to see where the money goes. The rest follows.

Sources & References

  1. [1]OpenAI API PricingOpenAI
  2. [2]Anthropic API PricingAnthropic
  3. [3]Prompt Caching with ClaudeAnthropic
  4. [4]GPT-4o mini: Advancing Cost-Efficient IntelligenceOpenAI

Written by

Transactional Team

Share
Tags:
case-study
ai
fintech

YOUR AGENTS DESERVE
REAL INFRASTRUCTURE.

START BUILDING AGENTS THAT DO REAL WORK.

Deploy Your First Agent