Prompt Caching Explained:
How to Cut LLM Costs by Up to 90%
Prompt caching lets you reuse repeated context (system prompts, documents, knowledge bases) across API calls at a fraction of the input token cost. Anthropic offers the most mature implementation in 2026, at 90% off cached reads. Last verified: 2026-04-01.
What Is Prompt Caching?
Normally, every API call sends the full prompt (system prompt + context + user message) to the model. The provider charges for every token on every call — even if most of the prompt is identical across requests.
With prompt caching, the provider processes the static portion of your prompt once, stores it server-side, and on subsequent calls reads the cached version at a steeply discounted rate. You pay full price to write the cache once, then a fraction to read it on every subsequent call.
Anthropic Prompt Caching Pricing (April 2026)
| Model | Standard Input | Cache Write (5m) | Cache Write (1h) | Cache Read | Savings |
|---|---|---|---|---|---|
| Claude Haiku 4.5 | $1.00 | $1.25 | $2.00 | $0.10 | 90% |
| Claude Sonnet 4.6 | $3.00 | $3.75 | $6.00 | $0.30 | 90% |
| Claude Opus 4.6 | $5.00 | $6.25 | $10.00 | $0.50 | 90% |
Cache TTL options: 5-minute cache is sufficient for high-frequency applications where the same context is reused within a short window. 1-hour cache (higher write cost) is appropriate for knowledge bases and documents that change rarely but need to be available longer between reads.
When Prompt Caching Saves Money
Caching is most valuable when:
- Large static context: System prompts, knowledge bases, or documents that are the same across many user requests
- High call volume: The more calls reuse the cached context, the more you save
- Repeated document processing: Sending the same document to multiple users or processing it multiple times
- RAG with shared context: A fixed set of documents retrieved for many queries
- Agent frameworks: Large agent system prompts repeated across every tool call in a multi-step workflow
Savings Calculator
Formula for monthly savings from prompt caching:
Example: SaaS with 1,000-token system prompt, 100,000 calls/month on Sonnet
- Standard: 100K × 1K tokens × $3.00/M = $300/month
- With 5-min cache: ~200 writes × 1K tokens × $3.75/M + 99,800 reads × 1K × $0.30/M = $0.75 + $29.94 = $30.69/month
- Savings: $269/month (89.7%)
Example: Document analysis app with 10,000-token document, 5,000 analyses/month on Sonnet
- Standard: 5K × 10K tokens × $3.00/M = $150/month
- With 5-min cache (assuming 1 write per analysis, ~90% cache hit): 500 writes × 10K × $3.75/M + 4,500 reads × 10K × $0.30/M = $18.75 + $13.50 = $32.25/month
- Savings: $117.75/month (78.5%)
When Caching Does NOT Help
- Short or unique prompts: If every user sends a different prompt with little shared content, there's nothing to cache
- Low call volume: The cache write cost upfront only pays off when the same context is read many times — below ~5 reads, the economics are neutral
- Rapidly changing context: If your system prompt or document changes frequently, each change invalidates the cache, requiring a new write at full price
- Budget-tier models with cheap standard pricing: If you're already on GPT-5.4 nano at $0.20/M, the absolute savings from caching are smaller even if the percentage is similar
How to Implement (Claude API)
The cache_control: { type: "ephemeral" } flag signals Anthropic to cache this content block. On the first call, it writes the cache. On subsequent calls within the TTL window, you'll see cache_read_input_tokens in the usage object confirming the cache hit.
Practical Architecture Patterns
Pattern 1 — Large system prompt caching
Place all static instructions, persona definition, and tool descriptions in the system array with cache_control. Keep only the dynamic user message uncached. Most appropriate for: customer support bots, internal assistants, chatbots with rich personas.
Pattern 2 — Document pre-loading
Load a document or knowledge base into the first user message with cache_control, then send multiple follow-up questions about the same document. The document is cached; only each question incurs full input cost. Most appropriate for: legal review, financial analysis, multi-query document processing.
Pattern 3 — Agent context caching
Cache the agent's large system prompt (tool definitions, instructions, examples). Only the current conversation turn and tool results are uncached. Each step in a multi-step agent workflow then pays only for the new tokens. Most appropriate for: coding agents, research agents, workflow orchestration.
Frequently Asked Questions
Does OpenAI offer prompt caching?
OpenAI has introduced some caching capabilities but does not offer the same explicit cache_control API with 90% discounts on cache reads as Anthropic. For workloads requiring explicit, predictable prompt caching economics, Anthropic is the current leader in 2026.
What's the minimum content size to cache?
Anthropic's prompt caching requires a minimum of 1,024 tokens to be eligible for caching. Content shorter than this threshold cannot be cached. Design your cached blocks to be at least 1,024 tokens.
Does caching affect response quality?
No. A cached prompt is processed identically to a non-cached one — the model sees the same context. Caching is purely an infrastructure optimization; it has no effect on model behavior, response quality, or output determinism.
Can I cache multiple sections?
Yes. You can apply cache_control to multiple content blocks in the same request (system prompt, document 1, document 2, etc.), each with its own cache TTL. Each block is cached and billed independently.
Calculate Your Caching Savings
Enter your system prompt size and call volume to see how much prompt caching saves per month.
Open AI API Cost Calculator