Prompt Caching Explained: Cut LLM Costs by 90% with Claude (2026)

What Is Prompt Caching?

Normally, every API call sends the full prompt (system prompt + context + user message) to the model. The provider charges for every token on every call — even if most of the prompt is identical across requests.

With prompt caching, the provider processes the static portion of your prompt once, stores it server-side, and on subsequent calls reads the cached version at a steeply discounted rate. You pay full price to write the cache once, then a fraction to read it on every subsequent call.

Anthropic Prompt Caching Pricing (April 2026)

Model	Standard Input	Cache Write (5m)	Cache Write (1h)	Cache Read	Savings
Claude Haiku 4.5	$1.00	$1.25	$2.00	$0.10	90%
Claude Sonnet 4.6	$3.00	$3.75	$6.00	$0.30	90%
Claude Opus 4.6	$5.00	$6.25	$10.00	$0.50	90%

Cache TTL options: 5-minute cache is sufficient for high-frequency applications where the same context is reused within a short window. 1-hour cache (higher write cost) is appropriate for knowledge bases and documents that change rarely but need to be available longer between reads.

When Prompt Caching Saves Money

Caching is most valuable when:

Large static context: System prompts, knowledge bases, or documents that are the same across many user requests
High call volume: The more calls reuse the cached context, the more you save
Repeated document processing: Sending the same document to multiple users or processing it multiple times
RAG with shared context: A fixed set of documents retrieved for many queries
Agent frameworks: Large agent system prompts repeated across every tool call in a multi-step workflow

Savings Calculator

Formula for monthly savings from prompt caching:

Standard cost = Calls × CachedTokens × StandardInputPrice / 1,000,000

Cached cost = (Writes × CachedTokens × WritePrice / 1M) + (Reads × CachedTokens × ReadPrice / 1M)

Savings = Standard cost − Cached cost

Example: SaaS with 1,000-token system prompt, 100,000 calls/month on Sonnet

Standard: 100K × 1K tokens × $3.00/M = $300/month
With 5-min cache: ~200 writes × 1K tokens × $3.75/M + 99,800 reads × 1K × $0.30/M = $0.75 + $29.94 = $30.69/month
Savings: $269/month (89.7%)

Example: Document analysis app with 10,000-token document, 5,000 analyses/month on Sonnet

Standard: 5K × 10K tokens × $3.00/M = $150/month
With 5-min cache (assuming 1 write per analysis, ~90% cache hit): 500 writes × 10K × $3.75/M + 4,500 reads × 10K × $0.30/M = $18.75 + $13.50 = $32.25/month
Savings: $117.75/month (78.5%)

When Caching Does NOT Help

Short or unique prompts: If every user sends a different prompt with little shared content, there's nothing to cache
Low call volume: The cache write cost upfront only pays off when the same context is read many times — below ~5 reads, the economics are neutral
Rapidly changing context: If your system prompt or document changes frequently, each change invalidates the cache, requiring a new write at full price
Budget-tier models with cheap standard pricing: If you're already on GPT-5.4 nano at $0.20/M, the absolute savings from caching are smaller even if the percentage is similar

How to Implement (Claude API)

// Mark the cacheable portion with cache_control

const response = await anthropic.messages.create({

model: "claude-sonnet-4-6",

max_tokens: 1024,

system: [

{

type: "text",

text: yourLargeSystemPromptOrDocument,

cache_control: { type: "ephemeral" }, // 5-min TTL

messages: [{ role: "user", content: userMessage }],

});

The cache_control: { type: "ephemeral" } flag signals Anthropic to cache this content block. On the first call, it writes the cache. On subsequent calls within the TTL window, you'll see cache_read_input_tokens in the usage object confirming the cache hit.

Practical Architecture Patterns

Pattern 1 — Large system prompt caching

Place all static instructions, persona definition, and tool descriptions in the system array with cache_control. Keep only the dynamic user message uncached. Most appropriate for: customer support bots, internal assistants, chatbots with rich personas.

Pattern 2 — Document pre-loading

Load a document or knowledge base into the first user message with cache_control, then send multiple follow-up questions about the same document. The document is cached; only each question incurs full input cost. Most appropriate for: legal review, financial analysis, multi-query document processing.

Pattern 3 — Agent context caching

Cache the agent's large system prompt (tool definitions, instructions, examples). Only the current conversation turn and tool results are uncached. Each step in a multi-step agent workflow then pays only for the new tokens. Most appropriate for: coding agents, research agents, workflow orchestration.

Frequently Asked Questions

Does OpenAI offer prompt caching?

OpenAI has introduced some caching capabilities but does not offer the same explicit cache_control API with 90% discounts on cache reads as Anthropic. For workloads requiring explicit, predictable prompt caching economics, Anthropic is the current leader in 2026.

What's the minimum content size to cache?

Anthropic's prompt caching requires a minimum of 1,024 tokens to be eligible for caching. Content shorter than this threshold cannot be cached. Design your cached blocks to be at least 1,024 tokens.

Does caching affect response quality?

No. A cached prompt is processed identically to a non-cached one — the model sees the same context. Caching is purely an infrastructure optimization; it has no effect on model behavior, response quality, or output determinism.

Can I cache multiple sections?

Yes. You can apply cache_control to multiple content blocks in the same request (system prompt, document 1, document 2, etc.), each with its own cache TTL. Each block is cached and billed independently.

Prompt Caching Explained:
How to Cut LLM Costs by Up to 90%

What Is Prompt Caching?

Anthropic Prompt Caching Pricing (April 2026)

When Prompt Caching Saves Money

Savings Calculator

Example: SaaS with 1,000-token system prompt, 100,000 calls/month on Sonnet

Example: Document analysis app with 10,000-token document, 5,000 analyses/month on Sonnet

When Caching Does NOT Help

How to Implement (Claude API)

Practical Architecture Patterns

Pattern 1 — Large system prompt caching

Pattern 2 — Document pre-loading

Pattern 3 — Agent context caching

Frequently Asked Questions

Does OpenAI offer prompt caching?

What's the minimum content size to cache?

Does caching affect response quality?

Can I cache multiple sections?

Calculate Your Caching Savings

Prompt Caching Explained:How to Cut LLM Costs by Up to 90%

What Is Prompt Caching?

Anthropic Prompt Caching Pricing (April 2026)

When Prompt Caching Saves Money

Savings Calculator

Example: SaaS with 1,000-token system prompt, 100,000 calls/month on Sonnet

Example: Document analysis app with 10,000-token document, 5,000 analyses/month on Sonnet

When Caching Does NOT Help

How to Implement (Claude API)

Practical Architecture Patterns

Pattern 1 — Large system prompt caching

Pattern 2 — Document pre-loading

Pattern 3 — Agent context caching

Frequently Asked Questions

Does OpenAI offer prompt caching?

What's the minimum content size to cache?

Does caching affect response quality?

Can I cache multiple sections?

Calculate Your Caching Savings

Prompt Caching Explained:
How to Cut LLM Costs by Up to 90%