guidecost-optimizationprompt-caching

How to Cut Your AI API Costs by 60% Without Switching Models

Apertis Team•March 19, 2026•9 min read

The Cost Problem You're Probably Facing

Your AI API bill just hit $500 this month. Last month it was $250. You're using the same models, running similar workloads — but the costs keep climbing.

This is the reality for anyone building with AI agents, coding tools, or applications that make repeated API calls. A single agentic task might make 50–100 API calls as the model reasons through problems, refines code, or processes documents. Each call costs money. Scale this across even a modest user base and you're looking at $1000+ per month fast.

The bad news: there's no magic button that makes this problem disappear.

The good news: most teams are leaving 40–60% in cost savings on the table, completely unnecessarily. You don't need to switch models, reduce capability, or accept lower quality. You just need to be smarter about how you use the API.

Here are five concrete strategies that actually work.

Strategy 1: Prompt Caching — Cache Reads Are Essentially Free

Prompt caching is the single most effective cost-cutting technique available, and it's embarrassingly underused.

Here's how it works: when you send a prompt to an AI API, the provider caches that exact prompt. If you send the same prompt again (even days later), the API reuses the cached version. Cache reads cost 90% less than a fresh prompt.

On Apertis, it's even better — cache reads are completely free. You only pay for the cache write (25% of input token cost) and nothing for subsequent reads.

Real numbers: Let's say you're using Claude Opus (expensive model, great capability). A single prompt costs:

Input tokens:    200,000 tokens × $0.015/1K = $3.00
Cache write:     200,000 tokens × $0.00375/1K = $0.75
Cache read:      $0.00 (free on Apertis)

Now imagine you have 50 users analyzing the same document. Without caching:

50 users × $3.00 per request = $150 per day

With caching (after the first request):

First request:   $3.00 + $0.75 (write) = $3.75
Next 49 requests: $0.00 each = $0.00
Total per day:   $3.75 (instead of $150)
Savings:         97.5%

To enable caching, just add cache_control to your prompt:

import openai

client = openai.OpenAI(
    api_key="sk-xxxx",
    base_url="https://api.apertis.ai/v1"
)

response = client.chat.completions.create(
    model="claude-opus-4-6-20250514",
    messages=[
        {
            "role": "system",
            "content": {
                "type": "text",
                "text": "You are an expert code reviewer. You analyze code and provide detailed feedback.",
                "cache_control": {"type": "ephemeral"}  # Enable caching
            }
        },
        {
            "role": "user",
            "content": "Review this code:\n\n[large codebase here]"
        }
    ]
)

The key insight: any prompt component that repeats across requests should be cached. System prompts, large documents, code repositories, knowledge bases — these are all perfect for caching.

Strategy 2: Context Compression — Cut Token Usage by 65%

Caching reduces repeated requests. But what about single requests that consume massive token counts?

Context compression is the answer. Apertis automatically compresses large contexts using advanced algorithms that summarize information while preserving critical details.

Add :compress to your model name and Apertis handles the rest:

response = client.chat.completions.create(
    model="claude-opus-4-6-20250514:compress",  # Enable compression
    messages=[
        {
            "role": "user",
            "content": "Summarize this 500-page technical document and answer my questions"
        }
    ]
)

Real impact: A developer analyzing a large codebase (2M tokens uncompressed) saw these results:

Without compression:  2,000,000 input tokens × $0.015/1K = $30.00
With compression:     700,000 tokens × $0.015/1K = $10.50
Savings per request:  $19.50 (65% reduction)

Compression works best for:

Code review tasks (entire repositories)
Document analysis (long PDFs, user documentation)
Bulk content summarization
Log analysis (application logs, server traces)

The tradeoff: compression adds a tiny bit of latency (~200ms) as the context is processed. For background tasks, this is invisible. For interactive chat, it's barely noticeable.

Strategy 3: Smart Model Routing — Use the Right Model for Each Task

Not every task needs Claude Opus. Using your most expensive model for every request is like taking a private jet to buy groceries.

Apertis makes it trivial to route different tasks to different models:

def get_ai_response(task_type, user_input):
    if task_type == "code_review":
        # Complex reasoning needed → use Claude Opus
        model = "claude-opus-4-6-20250514"
        price_per_1k_input = 0.015
    elif task_type == "code_completion":
        # Speed matters more than depth → use Claude Sonnet
        model = "claude-sonnet-4-5-20250514"
        price_per_1k_input = 0.003
    elif task_type == "quick_question":
        # Simple task → use GPT-4o mini
        model = "gpt-4o-mini"
        price_per_1k_input = 0.00015
    else:
        # Basic task → use DeepSeek (free)
        model = "deepseek-v3-2"
        price_per_1k_input = 0  # Free model

    response = client.chat.completions.create(
        model=model,
        messages=[...]
    )
    return response

Here's how this looks in practice for a typical agentic workflow:

Request → Classify complexity → Route to appropriate model → Return response

Simple API question     → GPT-4o mini       ($0.00015 per 1K input)
Medium documentation   → Claude Sonnet     ($0.003 per 1K input)
Complex reasoning      → Claude Opus       ($0.015 per 1K input)
Free tier / prototype  → DeepSeek V3.2     (Free)

A typical SaaS that processes 10,000 requests per day might break down like:

40% simple requests    (4,000 × $0.001 avg)  = $4.00
40% medium requests    (4,000 × $0.015 avg)  = $60.00
15% complex requests   (1,500 × $0.040 avg)  = $60.00
5% free tier          (500 × $0.00)          = $0.00
Daily cost:                                    $124.00
Monthly cost:                                  $3,720.00

But if you were using Claude Opus for everything:

10,000 requests × $0.040 avg (Opus) = $400/day = $12,000/month

Smart routing saves you $8,280 per month with zero quality loss for most tasks.

Strategy 4: Coding Plans — 2x More Value Than Pay-As-You-Go

If you're using models for code generation, debugging, or anything related to software development, Apertis's Coding Plans offer exceptional value.

Here's the math:

PAYG Pricing (Claude Opus):
- Input:  $0.015 per 1K tokens
- Output: $0.075 per 1K tokens

Coding Plan Lite ($12/month):
- Input:  $0.0075 per 1K tokens (50% discount)
- Output: $0.030 per 1K tokens (60% discount)
- Monthly allowance: 100M input tokens + 25M output tokens

For a developer using Claude for coding tasks:

Typical coding request:
- Input:  50,000 tokens
- Output: 10,000 tokens

PAYG cost:
- Input:  50,000 × $0.015 / 1000 = $0.75
- Output: 10,000 × $0.075 / 1000 = $0.75
- Total:  $1.50 per request

Coding Plan Lite cost:
- Flat rate: $12/month for the first 100M input + 25M output
- 20 requests per month → $0.60 per request
- Breakeven: 8 requests per month

For teams with active development:

Lite Plan ($12/mo): Best for individuals, small teams, or occasional use
Pro Plan ($25/mo): Typical for startup technical teams (unlimited tokens + priority)
Max Plan ($200/mo): For large teams, enterprises, or power users

Most teams hit ROI within the first week.

Strategy 5: Use Free Models for Prototyping and Development

Why pay anything while you're building and testing?

Apertis offers 9 completely free models:

- DeepSeek V3.2        (Strong general reasoning)
- Gemini 3 Flash       (Fast multimodal)
- GPT-5.1 Codex Mini   (Coding-specific)
- Claude 3.2 Haiku     (Ultra-fast)
- Qwen 2.5 72B         (Strong reasoning)
- Mixtral 8x7B         (Open-source quality)
- Llama 3.1 70B        (General purpose)
- Grok 2 Mini          (Web knowledge)
- Mistral 7B           (Efficient reasoning)

Use these for:

Prototyping: Build your feature with free models, then optimize
Testing: Verify API integration before paying
Development: Internal tooling, documentation generation
Cheap tasks: Simple classification, templated content, basic Q&A

One team we talked to built their entire MVP on free models, then switched to paid when they needed production-grade reliability. Saved them $2,000 in early-stage costs.

Real Example: From $150/month to $55/month

Here's how a real developer optimized their costs:

Before optimization:

- Using Claude Opus for everything
- No caching implemented
- 100 API calls per day
- Average 80K tokens per request
- Cost: $150/month

Optimization applied:

1. Added prompt caching for system prompts
   → Saved 40% on repeated prompts

2. Implemented context compression for document analysis
   → Reduced token usage by 65% on heavy tasks

3. Switched simple tasks to GPT-4o mini
   → 60% of requests don't need Opus

4. Added `:web` suffix for live data queries
   → Eliminated redundant information requests

5. Subscribed to Coding Plan Lite
   → Got 50% discount on all tokens

After optimization:

- Mixed model approach (Opus + Sonnet + mini + DeepSeek)
- Caching active on 40% of requests
- Compression on document tasks
- Coding Plan discount applied
- Same quality, same capability
- Cost: $55/month
- Savings: $95/month (63% reduction)

Same features. Same user experience. 63% cheaper.

Putting It All Together

The most cost-effective setup combines all five strategies:

def get_optimized_response(task_type, user_input, large_context=None):
    # 1. Select model based on task (Strategy 3)
    if task_type == "code_generation":
        model = "claude-sonnet-4-5-20250514"
    elif task_type == "simple_query":
        model = "gpt-4o-mini"
    else:
        model = "deepseek-v3-2"  # Free fallback (Strategy 5)

    # 2. Enable compression for large contexts (Strategy 2)
    if large_context and len(large_context) > 100_000:
        model = model + ":compress"

    # 3. Build message with cache_control (Strategy 1)
    messages = [
        {
            "role": "system",
            "content": {
                "type": "text",
                "text": "You are an expert developer assistant.",
                "cache_control": {"type": "ephemeral"}
            }
        },
        {
            "role": "user",
            "content": user_input
        }
    ]

    # 4. Make request (Strategy 4 billing applied automatically)
    response = client.chat.completions.create(
        model=model,
        messages=messages
    )

    return response

The Bottom Line

You don't need to sacrifice quality or capability to cut your AI API costs significantly. Most teams are overspending due to suboptimal routing, missing caching, and not leveraging subscription plans.

Start with caching (biggest impact, easiest to implement). Then add smart routing. Then evaluate a Coding Plan if you do any development work. Finally, compress heavy contexts.

The realistic expectation: 40–60% cost reduction within a week of implementation. No code changes needed for basic features. No loss of quality.

If you're paying more than $100/month for AI APIs, you're probably leaving money on the table.

Ready to optimize? Sign up for Apertis AI and start using these strategies today. With 500+ models and built-in caching, compression, and routing, you have all the tools you need.

← Back to Research