Skip to content
Glossary

What Is Prompt Caching?
How to Save 90% on Repeated AI Context

Prompt caching lets you save a prefix of your prompt (system instructions, documents, examples) so that subsequent calls reuse it at a fraction of the cost. Anthropic's implementation cuts cache-read costs by 90% vs standard input pricing. Last verified: 2026-04-01.

8 min read·Updated April 2026
Prompt Caching Quick Reference (Claude, April 2026)
90% off
cache read vs standard input
$0.10/M
Haiku 4.5 cache reads
5 min
default cache TTL
1,024 tok
minimum cacheable prefix

How Prompt Caching Works

Without caching, every API call processes your entire prompt from scratch — even if the first 10,000 tokens are identical across all calls (e.g., the same system prompt or document).

With caching, you mark a prefix as cacheable. The first call writes it to cache (cache write, charged at a small premium). All subsequent calls within the TTL read from cache (cache read, charged at 90% off standard input price).

# Standard call — pays full input price for system prompt every time
{"role": "system", "content": "You are a helpful assistant. [2,000 tokens of instructions]..."}
# With caching — marks system prompt as cacheable prefix
{"role": "system", "content": [{"type": "text", "text": "...", "cache_control": {"type": "ephemeral"}}]}

Cache Pricing: All Claude Models

ModelStandard inputCache write (5-min)Cache write (1-hour)Cache readSavings on read
Claude Haiku 4.5$1.00/M$1.25/M$2.00/M$0.10/M90%
Claude Sonnet 4.6$3.00/M$3.75/M$6.00/M$0.30/M90%
Claude Opus 4.6$5.00/M$6.25/M$10.00/M$0.50/M90%

Cache write costs 25% more than standard input (5-min TTL) or 2× more (1-hour TTL). Cache reads break even after just a few calls.

The Break-Even Calculation

Cache writes cost 25% more than standard input. Cache reads cost 10% of standard input. Break-even = number of reads where total cost = standard input cost for all calls.

Formula: Break-even reads = 1.25 / (1 - 0.10) = 1.39 reads → break even after 2 reads

After 2 cache reads, every additional read saves 90% vs standard pricing. For any system prompt used more than twice, caching is profitable.

Worked Example: SaaS Chatbot with 100K Calls/Month

System prompt: 2,000 tokens (company context, instructions, persona). Each call also includes user message (200 tokens) + history (300 tokens).

ScenarioInput tokens/callCost/callMonthly cost (100K calls)
Without caching (Haiku)2,500 total input$0.00250$250
With caching (Haiku) — system prompt cached2,000 cache read + 500 uncached$0.000700$70 + ~$3 (writes) = $73
Monthly savings$177/month (71% saved)

Cache write: 100 writes × 2,000 tokens × $1.25/M = $0.25. Effective savings are ~70-72% when accounting for write overhead.

Haiku cache reads ($0.10/M) are cheaper than GPT-5.4 nano's standard input ($0.20/M). If your system prompt is large and reused heavily, Claude Haiku 4.5 with caching active can become the lowest-cost option in the market — even cheaper than nano on a per-effective-token basis.

When Caching Helps Most

  • Large system prompts (500+ tokens): Instructions, persona, product context, examples — anything repeated across all calls
  • Document-in-context: The same policy document, knowledge base, or legal text injected into every query
  • Agent memory: A persistent memory block representing the agent's accumulated context
  • Few-shot examples: A bank of 10–20 labeled examples included in every prompt for consistency

When Caching Does NOT Help

  • Dynamic prefixes: If every call has a unique user-specific prefix, there's nothing to cache
  • Low-volume endpoints: Below ~3 calls/hour on the same prefix, cache may expire before it's useful
  • Short system prompts (<1,024 tokens): Anthropic requires at least 1,024 cacheable tokens — short prompts don't qualify
  • Output-dominated costs: If you're generating long outputs (reports, content), the savings on cached input tokens are less impactful relative to output cost

TTL Options: 5-Minute vs 1-Hour Cache

Anthropic offers two cache durations:

  • Ephemeral (5 minutes): Write costs 25% premium. Best for high-frequency chatbots where the same user makes multiple turns quickly
  • 1-hour cache: Write costs 2× standard input. Best for document Q&A sessions where context needs to persist across a longer working session

For most chatbots, ephemeral (5-min) caching is sufficient — users rarely take more than 5 minutes between messages. Use 1-hour for document analysis workflows.

See Your Prompt Caching Savings

Enter your system prompt size and monthly call volume to calculate exact savings with caching.

AI API Cost Calculator