A Fintech Team Cut Their LLM Bill by 70%. Here is Exactly How.
How a fintech company processing loan applications with LLMs reduced their monthly AI spend from $15K to $4.5K using semantic caching, fallback routing, and token optimization.
Transactional Team
Mar 1, 2026
##
8 min read
Share
$15,000 a Month and Growing
A mid-size fintech company -- let us call them LendFlow -- came to us with a problem that was getting worse every month. The following is a composite scenario illustrating common cost optimization patterns, not a specific customer engagement. They were processing loan applications using LLMs and spending $15,000 a month on AI API calls. That number was growing 20% month over month.
Their pipeline was straightforward. Every loan application went through three AI steps:
Document classification: Determine document type (pay stub, bank statement, tax return, etc.)
Data extraction: Pull structured data from the document
Risk assessment: Generate a preliminary risk score with reasoning
Each step used GPT-4o. Each application submitted an average of 8 documents. They processed roughly 12,000 applications per month. The math was ugly.
12,000 applications x 8 documents x 3 AI steps = 288,000 LLM calls per month.
At an average of $0.05 per call, that is $14,400. And because application volume was growing, the bill was growing with it.
LendFlow: Before vs. After AI Gateway Optimization
BeforeAfter
Monthly LLM Cost$14,400$4,850
Avg Latency2,100ms890ms
Cost per Application$1.20$0.29
Classification Cost$2,880/mo$230/mo
Cache Hit Rate0%64%
The Diagnosis
When LendFlow connected their pipeline to our AI Gateway and turned on LLM Observability, the problems became obvious within the first hour.
Problem 1: Identical Documents, Zero Caching
Loan applicants frequently submit the same types of documents. A Chase bank statement from January 2026 looks structurally identical to another Chase bank statement from January 2026. The classification step was sending nearly identical inputs to GPT-4o every time.
Observability showed that 62% of document classification requests had a semantic similarity score above 0.97 with a previous request. That is 178,000 calls per month that could have been served from cache.
Problem 2: GPT-4o for Everything
Document classification does not need GPT-4o. Deciding whether a document is a pay stub or a bank statement is a straightforward classification task. GPT-4o-mini handles it with 99.2% accuracy. Mistral Small handles it with 98.7% accuracy.
LendFlow was using GPT-4o for all three steps because it was the first model they integrated, and nobody had tested cheaper alternatives. The cost difference is massive: GPT-4o charges $2.50 per million input tokens. GPT-4o-mini charges $0.15. That is a 16x difference.
Problem 3: Bloated Prompts
The data extraction prompt included 4,200 tokens of instructions, examples, and edge case handling. Observability showed that 1,800 of those tokens were examples for document types that represented less than 2% of their volume. Every single request paid for those 1,800 tokens whether they were relevant or not.
The Fix
We worked with LendFlow to implement three changes over a two-week period. No application code was rewritten. Everything was configured through AI Gateway.
Week 1: Semantic Caching
We enabled semantic caching for the document classification step with a similarity threshold of 0.96.
Results were immediate. On day one, the cache hit rate for classification was 58%. By day three, as the cache warmed up, it stabilized at 64%.
That is 64% of 96,000 classification calls per month that no longer hit the LLM. At $0.03 per classification call, that saved roughly $1,840 per month from this single change.
Week 1: Fallback Routing to Cheaper Models
We reconfigured the classification step to use GPT-4o-mini as the primary model with GPT-4o as a fallback for low-confidence responses.
GPT-4o-mini handled 94% of classifications without needing the fallback. The 6% that fell through to GPT-4o were edge cases like handwritten documents or poor quality scans. Classification accuracy stayed above 99%.
Cost per classification call dropped from $0.03 to $0.002. Combined with caching, the classification step went from $2,880/month to $230/month.
Week 2: Token Optimization
We analyzed the data extraction prompt using our token usage dashboard. The prompt breakdown showed:
System instructions: 1,200 tokens (necessary)
Output format specification: 800 tokens (necessary)
Common examples: 400 tokens (necessary)
Rare edge case examples: 1,800 tokens (removable)
We split the prompt into a base prompt (2,400 tokens) and document-type-specific appendices. The gateway dynamically includes only the relevant appendix based on the classification result from step 1.
Average prompt size dropped from 4,200 tokens to 2,700 tokens. A 36% reduction in input tokens for every extraction call.
The Numbers
Here is the before and after, measured over a 30-day period at the same application volume (12,000 applications):
Document Classification
Metric
Before
After
Change
Monthly calls
96,000
34,560
-64% (caching)
Model
GPT-4o
GPT-4o-mini
-94% cost/call
Monthly cost
$2,880
$230
-92%
Data Extraction
Metric
Before
After
Change
Monthly calls
96,000
96,000
No change
Avg prompt tokens
4,200
2,700
-36%
Monthly cost
$7,200
$4,600
-36%
Risk Assessment
Metric
Before
After
Change
Monthly calls
96,000
96,000
No change
Model
GPT-4o
GPT-4o
No change
Monthly cost
$4,320
$4,320
No change
Risk assessment stayed on GPT-4o. This step requires the strongest reasoning capability, and the cost was justified by the quality requirements.
Total
Metric
Before
After
Change
Total monthly cost
$14,400
$4,850
-66%
Avg latency
2,100ms
890ms
-58%
Classification accuracy
99.1%
99.2%
+0.1%
Extraction accuracy
97.8%
97.6%
-0.2%
The final cost reduction was 66%, not quite 70%. But the following month, as cache hit rates improved and they optimized a few more prompts, the number hit 71%.
The Timeline
Day 1: Connected pipeline to AI Gateway, enabled observability
Day 1-3: Analyzed cost breakdown, identified optimization targets
Day 4-5: Enabled semantic caching for classification
Day 6-8: Configured model fallback routing
Day 9-12: Refactored extraction prompts, tested accuracy
Day 14: Full rollout, monitoring
Two weeks of configuration changes. No application code rewritten. No models retrained. No infrastructure changes.
What LendFlow Did Next
With their AI costs under control, LendFlow actually expanded their AI usage. They added a fourth step -- document fraud detection -- that they had previously rejected as too expensive. With the cost savings from optimization, the new step was budget-neutral.
Their LLM bill is now stable at around $5,500/month despite processing 40% more applications than when they started. Cost per application dropped from $1.20 to $0.29.
The Takeaway
Most teams overspend on LLMs not because the technology is expensive, but because they have no visibility into what is expensive and why. They use the same model for every task because testing alternatives requires rewriting integration code. They send bloated prompts because nobody has profiled token usage. They make redundant calls because there is no caching layer.
The fix is not cheaper models. It is visibility and routing. Know what each call costs, send each call to the right model, cache what you can, and trim what you do not need.
Start with the AI Gateway to get the routing and caching. Add LLM Observability to see where the money goes. The rest follows.