Skip to content
Glossary

What Is the Batch API?
50% Off AI Inference for Async Workloads Explained

The Batch API is a pricing tier offered by Anthropic and OpenAI that cuts AI inference costs by 50% for asynchronous, non-real-time jobs. Jobs complete within 24 hours. This guide explains how it works, when to use it, and how to implement it. Last verified: 2026-04-01.

7 min read·Updated April 2026
Batch API Pricing (2026)
50% off
Discount vs standard API
<24 hrs
Job completion time
$0.50/M
Haiku 4.5 batch input
$1.25/M
GPT-5.4 nano batch input

How the Batch API Works

Instead of sending API requests one at a time and waiting for each response (real-time inference), the Batch API lets you:

  1. Submit a JSONL file containing many requests in a single job
  2. Receive a job ID immediately
  3. Poll for completion (or use a webhook) — jobs complete within 24 hours
  4. Download results as a JSONL file when complete

The provider can process your batch during off-peak hours, which is why they pass 50% of the savings to you. The trade-off is latency: you don't get responses instantly.

Batch vs Real-Time Pricing: All Major Models

ModelStandard input/1MBatch input/1MStandard output/1MBatch output/1M
Claude Haiku 4.5$1.00$0.50$5.00$2.50
Claude Sonnet 4.6$3.00$1.50$15.00$7.50
Claude Opus 4.6$5.00$2.50$25.00$12.50
GPT-5.4 nano$0.20$0.10$1.25$0.625
GPT-5.4 mini$0.75$0.375$4.50$2.25
GPT-5.4$2.50$1.25$15.00$7.50

Gemini 2.5 models do not currently offer a Batch API equivalent — their standard pricing is already lower than batch pricing on comparable OpenAI/Anthropic models for many use cases.

When to Use the Batch API

Use caseGood for batch?Why
Nightly document enrichmentYesAll documents processed overnight, results ready by morning
Bulk content generationYesProduct descriptions, email sequences, article outlines — no user waiting
Training data labelingYesThousands of samples can be labeled in a single batch job
Contract/document analysisYesUpload 1,000 contracts, get structured analysis by end of day
Embeddings at scaleYesLarge corpus indexing can run overnight in batch
Real-time chatbot responsesNoUsers expect sub-second responses — batch doesn't work for interactive UX
Live customer supportNoTickets need resolution in minutes, not hours
On-demand code completionNoLatency requirement is <200ms — incompatible with batch model

Batch API Implementation (Anthropic)

Anthropic's Message Batches API accepts up to 10,000 requests per batch:

# Create a batch
import anthropic
client = anthropic.Anthropic()
batch = client.beta.messages.batches.create(
requests=[
{"custom_id": "req-1", "params": {
"model": "claude-haiku-4-5",
"max_tokens": 500,
"messages": [{"role": "user", "content": "Summarize: ..."}]
}},
# ... up to 10,000 requests
]
)
# Poll until done
while batch.processing_status == "in_progress":
time.sleep(60)
batch = client.beta.messages.batches.retrieve(batch.id)

Stacking Batch API with Prompt Caching

Batch API and prompt caching stack multiplicatively for maximum savings:

  • Standard Haiku input: $1.00/M
  • Batch API only: $0.50/M (50% off)
  • Prompt caching only: $0.10/M for cache reads (90% off)
  • Batch + cache reads: $0.05/M — 95% off standard rate

Example: 100K document summarizations with a 2,000-token shared system prompt. With batch processing and prompt caching on the system prompt, your blended input cost can approach $0.05–$0.10/M — the cheapest possible rate for any Anthropic model.

Batch API Limits

  • Anthropic: 10,000 requests per batch, 256MB JSONL file limit
  • OpenAI: Up to 50,000 requests per batch, 200MB file limit
  • Completion time: Results guaranteed within 24 hours (usually much faster)
  • Rate limits: Batch jobs share the same token-per-minute limits as real-time — large batches may take longer if you're rate-limited
  • Partial results: Both providers return results for completed requests even if the batch partially fails

Calculate Batch API Savings

Enter your monthly token volume to see exactly how much you'd save with Batch API vs real-time inference.

AI API Cost Calculator