What Is the Batch API? 50% Off AI Inference Explained (2026)

How the Batch API Works

Instead of sending API requests one at a time and waiting for each response (real-time inference), the Batch API lets you:

Submit a JSONL file containing many requests in a single job
Receive a job ID immediately
Poll for completion (or use a webhook) — jobs complete within 24 hours
Download results as a JSONL file when complete

The provider can process your batch during off-peak hours, which is why they pass 50% of the savings to you. The trade-off is latency: you don't get responses instantly.

Batch vs Real-Time Pricing: All Major Models

Model	Standard input/1M	Batch input/1M	Standard output/1M	Batch output/1M
Claude Haiku 4.5	$1.00	$0.50	$5.00	$2.50
Claude Sonnet 4.6	$3.00	$1.50	$15.00	$7.50
Claude Opus 4.6	$5.00	$2.50	$25.00	$12.50
GPT-5.4 nano	$0.20	$0.10	$1.25	$0.625
GPT-5.4 mini	$0.75	$0.375	$4.50	$2.25
GPT-5.4	$2.50	$1.25	$15.00	$7.50

Gemini 2.5 models do not currently offer a Batch API equivalent — their standard pricing is already lower than batch pricing on comparable OpenAI/Anthropic models for many use cases.

When to Use the Batch API

Use case	Good for batch?	Why
Nightly document enrichment	Yes	All documents processed overnight, results ready by morning
Bulk content generation	Yes	Product descriptions, email sequences, article outlines — no user waiting
Training data labeling	Yes	Thousands of samples can be labeled in a single batch job
Contract/document analysis	Yes	Upload 1,000 contracts, get structured analysis by end of day
Embeddings at scale	Yes	Large corpus indexing can run overnight in batch
Real-time chatbot responses	No	Users expect sub-second responses — batch doesn't work for interactive UX
Live customer support	No	Tickets need resolution in minutes, not hours
On-demand code completion	No	Latency requirement is <200ms — incompatible with batch model

Batch API Implementation (Anthropic)

Anthropic's Message Batches API accepts up to 10,000 requests per batch:

# Create a batch

import anthropic

client = anthropic.Anthropic()

batch = client.beta.messages.batches.create(

requests=[

{"custom_id": "req-1", "params": {

"model": "claude-haiku-4-5",

"max_tokens": 500,

"messages": [{"role": "user", "content": "Summarize: ..."}]

}},

# ... up to 10,000 requests

]

)

# Poll until done

while batch.processing_status == "in_progress":

time.sleep(60)

batch = client.beta.messages.batches.retrieve(batch.id)

Stacking Batch API with Prompt Caching

Batch API and prompt caching stack multiplicatively for maximum savings:

Standard Haiku input: $1.00/M
Batch API only: $0.50/M (50% off)
Prompt caching only: $0.10/M for cache reads (90% off)
Batch + cache reads: $0.05/M — 95% off standard rate

Example: 100K document summarizations with a 2,000-token shared system prompt. With batch processing and prompt caching on the system prompt, your blended input cost can approach $0.05–$0.10/M — the cheapest possible rate for any Anthropic model.

Batch API Limits

Anthropic: 10,000 requests per batch, 256MB JSONL file limit
OpenAI: Up to 50,000 requests per batch, 200MB file limit
Completion time: Results guaranteed within 24 hours (usually much faster)
Rate limits: Batch jobs share the same token-per-minute limits as real-time — large batches may take longer if you're rate-limited
Partial results: Both providers return results for completed requests even if the batch partially fails

What Is the Batch API?
50% Off AI Inference for Async Workloads Explained

How the Batch API Works

Batch vs Real-Time Pricing: All Major Models

When to Use the Batch API

Batch API Implementation (Anthropic)

Stacking Batch API with Prompt Caching

Batch API Limits

Calculate Batch API Savings

What Is the Batch API?50% Off AI Inference for Async Workloads Explained

How the Batch API Works

Batch vs Real-Time Pricing: All Major Models

When to Use the Batch API

Batch API Implementation (Anthropic)

Stacking Batch API with Prompt Caching

Batch API Limits

Calculate Batch API Savings

What Is the Batch API?
50% Off AI Inference for Async Workloads Explained