How to Cut AI API Costs With Caching and Routing

By Sagnik Bhattacharya 25 Mar 2026 5 min read

Coding Liquids blog cover featuring Sagnik Bhattacharya for cutting AI API costs with caching and routing.

AI API costs can grow fast — especially when your application handles thousands of requests, uses large contexts, or calls expensive models for every query. Most applications can cut costs 50-80% without any quality loss.

I teach Flutter and Excel with AI — explore my courses if you want structured learning.

The key techniques are caching (avoid repeat work), routing (use the cheapest model that handles each query well), and prompt optimisation (send fewer tokens). This guide covers all three with practical implementation patterns.

Quick answer

Cache responses for repeated or similar queries, route simple queries to cheaper models and complex ones to expensive models, optimise prompts to use fewer tokens, and batch similar requests. These techniques can reduce costs 50-80% with minimal quality impact.

Your AI API bill is growing faster than your revenue.
You are using an expensive model for every request regardless of complexity.
Many of your queries are similar or identical.

Response caching

The simplest cost reduction: if the same question comes in again, return the cached response instead of calling the API.

Exact-match caching is easy to implement. Semantic caching (matching queries that are similar but not identical) is more complex but can catch more cache hits.

Hash the prompt to create cache keys for exact matches
Set TTL (time-to-live) based on how quickly your data changes
For semantic caching, embed queries and match against cached query embeddings
Cache at the application layer, not the API layer — you control the cache key

Model routing

Not every query needs your most expensive model. Simple questions ('What is X?'), formatting tasks, and classification can be handled by cheaper, faster models.

Build a router that classifies incoming queries and sends them to the appropriate model. Route 60-70% of queries to a cheap model and only the complex ones to the expensive model.

# Simple model router
def route_query(query: str) -> str:
    # Classify query complexity
    if len(query) < 100 and is_simple_question(query):
        return "haiku"  # cheap model
    elif requires_reasoning(query):
        return "opus"   # expensive model
    else:
        return "sonnet"  # mid-tier model

Prompt optimisation

Tokens cost money. Shorter prompts that produce the same output quality are free cost savings.

Review your prompts for unnecessary repetition, verbose instructions, and context that does not help the model. A prompt that went through three rounds of 'just add this too' is usually bloated.

Remove redundant instructions — say it once, clearly
Use system prompts for persistent instructions instead of repeating in every user message
Trim context to only what is relevant — do not dump entire documents when a paragraph will do
Measure: shorter prompt must produce same quality, otherwise the savings are false

Batching requests

If you process many independent items (documents, support tickets, products), batch them into a single API call with batch pricing. Most providers offer 50% discounts for batch processing.

Even without batch pricing, combining multiple items into a single prompt reduces per-request overhead and can be cheaper than individual calls.

Monitoring and optimising

Track cost per request, per feature, and per user. Know where your money goes before trying to reduce it.

Set up alerts for cost spikes — a bug that sends the full document instead of a summary can cost hundreds in a day.

Worked example: reducing costs for a content moderation system

A content moderation system processes 100K messages daily using an expensive model at $0.03 per message ($3,000/day). After: simple messages route to a cheap model ($0.001), exact duplicates hit cache (30% of volume), and prompts are optimised (20% fewer tokens). New cost: $600/day — an 80% reduction with the same accuracy.

Common mistakes

Routing all queries to the cheapest model — quality drops and users notice.
Not monitoring costs until the bill arrives.
Optimising prompts without measuring quality impact.

When to use something else

For choosing between open models and API models overall, see choosing between open and API models. For running models locally to eliminate API costs entirely, see running Gemma 4 locally.

Frequently asked questions

What are the biggest levers to cut AI API costs?

Caching repeated or similar queries, routing simple queries to cheaper models, trimming prompt tokens, and batching. Together these often cut 50-80% with little quality loss.

How does response caching work?

If the same or a near-identical question comes in again, return the stored answer instead of calling the API. Exact-match caching is trivial; semantic caching matches paraphrases via embeddings.

What is model routing?

Sending each query to the cheapest model that can handle it: small, fast models for classification, formatting and simple Q&A, and the expensive model only for complex reasoning. A classifier or simple rules pick the route.

How do I cut tokens without hurting quality?

Tighten the prompt, drop redundant examples, summarise long context, and cap output length. Tokens are money, so shorter prompts that produce the same output are free savings.

Does provider-side prompt caching help?

Yes — providers that cache a long, reused system prompt charge less for the cached prefix, which helps a lot when every request shares a big instruction block. Structure prompts so the stable part is cacheable.

Will caching serve stale answers?

Only if you let it. Set TTLs, key the cache on the inputs that affect the answer, and skip caching for time-sensitive or personalised queries. Cache the stable, repeatable work.

Related guides on this site

These guides cover model selection, local alternatives, and production AI patterns.