What Is Inference Cost?
AI Compute Costs, Latency & the Make-vs-Buy Decision
Inference cost is what you pay to run an AI model on new inputs — the cost of generating each prediction or response. Unlike training cost (a one-time expense), inference is the ongoing operational cost of every API call your application makes. Last verified: 2026-04-01.
Inference vs Training: What's the Difference?
| Stage | What it is | Who pays for it | When |
|---|---|---|---|
| Training | Teaching the model on massive datasets — billions of GPU hours | Anthropic, OpenAI, Google — baked into pricing | One-time, before release |
| Inference | Running the model on your input to generate a response | You — every API call | Every request your app makes |
When people talk about "AI costs" in a production application, they almost always mean inference costs — the per-call expense charged by API providers per token.
How Inference Cost Is Calculated
API providers charge by token volume:
Inference Cost by Model — Full Table 2026
| Model | Input/1M | Output/1M | Cost per 1K queries (500+500 tok) |
|---|---|---|---|
| Gemini 2.5 Flash-Lite | $0.10 | $0.40 | $0.25 |
| Mistral Small 3.2 | $0.10 | $0.30 | $0.20 |
| GPT-5.4 nano | $0.20 | $1.25 | $0.73 |
| Gemini 2.5 Flash | $0.30 | $2.50 | $1.40 |
| Mistral Large 3 | $0.50 | $1.50 | $1.00 |
| GPT-5.4 mini | $0.75 | $4.50 | $2.63 |
| Claude Haiku 4.5 | $1.00 | $5.00 | $3.00 |
| Gemini 2.5 Pro | $1.25 | $10.00 | $5.63 |
| GPT-5.4 | $2.50 | $15.00 | $8.75 |
| Claude Sonnet 4.6 | $3.00 | $15.00 | $9.00 |
| Claude Opus 4.6 | $5.00 | $25.00 | $15.00 |
Cost per 1K queries assumes 500 input + 500 output tokens per query.
Factors That Drive Inference Cost Up
- Model size — larger, more capable models cost more per token
- Long inputs — document analysis, RAG with many retrieved chunks, or long conversation history
- Verbose outputs — output tokens cost 3–10× more than input tokens per token
- Agent loops — AI agents that call tools and re-process results can make 5–20 API calls per user action
- Retry logic — failed API calls that are retried still count (unless you catch them before sending)
- Reasoning tokens — some models (like Claude with extended thinking) generate hidden reasoning tokens that are billed
Ways to Reduce Inference Cost
1. Prompt caching
For repeated system prompts or document context, Claude's prompt caching charges 90% less on cache reads. Cache writes cost slightly more, but break even after ~3 cache reads on the same prefix. Best for: large system prompts reused across thousands of calls.
2. Batch API (50% discount)
Both Anthropic and OpenAI offer ~50% off inference costs for batch (async) processing. Jobs complete within 24 hours. Best for: document summarization, report generation, overnight enrichment runs, any non-realtime workload.
3. Model routing
Run a fast, cheap classifier (GPT-5.4 nano or Flash-Lite) to categorize incoming requests, then route: simple requests to the cheap model, complex requests to the premium model. A well-tuned router cuts blended inference cost by 60–75%.
4. Output length control
Output tokens are expensive. Instruct models to be concise, return JSON instead of markdown prose, skip preambles ("Certainly! Here's..."), and stop generating after the key information. This typically cuts output token counts by 20–40%.
5. Self-hosting at very high scale
At 5–10 billion tokens/month, self-hosting an open-weight model (Mistral, Llama) on H100 GPUs can undercut API pricing. At $2–3/hour per H100, an H100 can process ~1–2M tokens/minute. Break-even with Haiku 4.5 API occurs at roughly 2–5 billion tokens/month. Below that, API pricing wins on total cost of ownership.
Cloud vs API Inference: The Make-vs-Buy Decision
| Option | Best for | Risks |
|---|---|---|
| API (Anthropic/OpenAI/Google) | Most teams under $50K/month AI spend. Zero infra overhead, latest models, no GPU ops. | Cost scales linearly; no control over pricing changes |
| Self-hosted (Mistral, Llama) | Very high volume (>5B tokens/month) OR strict data sovereignty requirements | GPU ops burden, model lag vs frontier, engineering cost |
| Dedicated capacity (Bedrock, Azure) | Enterprise compliance, predictable latency SLAs, moderate-to-high volume | Minimum commit, more complex setup than direct API |
Calculate Your Inference Cost
Enter your token volume and model to get an exact monthly inference cost projection.
AI API Cost Calculator