Skip to content
Glossary

What Is Inference Cost?
AI Compute Costs, Latency & the Make-vs-Buy Decision

Inference cost is what you pay to run an AI model on new inputs — the cost of generating each prediction or response. Unlike training cost (a one-time expense), inference is the ongoing operational cost of every API call your application makes. Last verified: 2026-04-01.

8 min read·Updated April 2026
Inference Cost Reference (per 1M tokens, 2026)
$0.10
cheapest input (Flash-Lite/Mistral S3.2)
$3.00
mid-range input (Sonnet 4.6)
$25.00
premium output (Opus 4.6)
50% off
Batch API discount (async)

Inference vs Training: What's the Difference?

StageWhat it isWho pays for itWhen
TrainingTeaching the model on massive datasets — billions of GPU hoursAnthropic, OpenAI, Google — baked into pricingOne-time, before release
InferenceRunning the model on your input to generate a responseYou — every API callEvery request your app makes

When people talk about "AI costs" in a production application, they almost always mean inference costs — the per-call expense charged by API providers per token.

How Inference Cost Is Calculated

API providers charge by token volume:

inference_cost =
(input_tokens × input_price_per_million / 1,000,000)
+ (output_tokens × output_price_per_million / 1,000,000)

Inference Cost by Model — Full Table 2026

ModelInput/1MOutput/1MCost per 1K queries (500+500 tok)
Gemini 2.5 Flash-Lite$0.10$0.40$0.25
Mistral Small 3.2$0.10$0.30$0.20
GPT-5.4 nano$0.20$1.25$0.73
Gemini 2.5 Flash$0.30$2.50$1.40
Mistral Large 3$0.50$1.50$1.00
GPT-5.4 mini$0.75$4.50$2.63
Claude Haiku 4.5$1.00$5.00$3.00
Gemini 2.5 Pro$1.25$10.00$5.63
GPT-5.4$2.50$15.00$8.75
Claude Sonnet 4.6$3.00$15.00$9.00
Claude Opus 4.6$5.00$25.00$15.00

Cost per 1K queries assumes 500 input + 500 output tokens per query.

Factors That Drive Inference Cost Up

  • Model size — larger, more capable models cost more per token
  • Long inputs — document analysis, RAG with many retrieved chunks, or long conversation history
  • Verbose outputs — output tokens cost 3–10× more than input tokens per token
  • Agent loops — AI agents that call tools and re-process results can make 5–20 API calls per user action
  • Retry logic — failed API calls that are retried still count (unless you catch them before sending)
  • Reasoning tokens — some models (like Claude with extended thinking) generate hidden reasoning tokens that are billed

Ways to Reduce Inference Cost

1. Prompt caching

For repeated system prompts or document context, Claude's prompt caching charges 90% less on cache reads. Cache writes cost slightly more, but break even after ~3 cache reads on the same prefix. Best for: large system prompts reused across thousands of calls.

2. Batch API (50% discount)

Both Anthropic and OpenAI offer ~50% off inference costs for batch (async) processing. Jobs complete within 24 hours. Best for: document summarization, report generation, overnight enrichment runs, any non-realtime workload.

3. Model routing

Run a fast, cheap classifier (GPT-5.4 nano or Flash-Lite) to categorize incoming requests, then route: simple requests to the cheap model, complex requests to the premium model. A well-tuned router cuts blended inference cost by 60–75%.

4. Output length control

Output tokens are expensive. Instruct models to be concise, return JSON instead of markdown prose, skip preambles ("Certainly! Here's..."), and stop generating after the key information. This typically cuts output token counts by 20–40%.

5. Self-hosting at very high scale

At 5–10 billion tokens/month, self-hosting an open-weight model (Mistral, Llama) on H100 GPUs can undercut API pricing. At $2–3/hour per H100, an H100 can process ~1–2M tokens/minute. Break-even with Haiku 4.5 API occurs at roughly 2–5 billion tokens/month. Below that, API pricing wins on total cost of ownership.

Cloud vs API Inference: The Make-vs-Buy Decision

OptionBest forRisks
API (Anthropic/OpenAI/Google)Most teams under $50K/month AI spend. Zero infra overhead, latest models, no GPU ops.Cost scales linearly; no control over pricing changes
Self-hosted (Mistral, Llama)Very high volume (>5B tokens/month) OR strict data sovereignty requirementsGPU ops burden, model lag vs frontier, engineering cost
Dedicated capacity (Bedrock, Azure)Enterprise compliance, predictable latency SLAs, moderate-to-high volumeMinimum commit, more complex setup than direct API

Calculate Your Inference Cost

Enter your token volume and model to get an exact monthly inference cost projection.

AI API Cost Calculator