Skip to content
Glossary

What Is RAG Cost?
Retrieval-Augmented Generation Pricing Explained

A RAG (Retrieval-Augmented Generation) system has three cost layers: embedding, vector database, and LLM inference. The LLM is 95–99% of total RAG cost. Here's what each layer actually costs and how to minimize the biggest driver. Last verified: 2026-04-01.

8 min read·Updated April 2026
RAG Cost by Layer (2026)
$0.02/M
Embeddings (text-embedding-3-small)
$0–$70/mo
Vector DB (pgvector free → Pinecone)
95–99%
LLM share of total RAG cost
$5
Cost to embed 50K documents

RAG Architecture: Three Cost Layers

LayerWhat it doesTypical cost% of total
Embedding (ingestion)Converts documents to vectors at ingestion time (one-time cost)$0.02/M tokens (OpenAI text-embedding-3-small)~0%
Embedding (query)Converts user query to vector for similarity search (per query)~10 tokens × $0.02/M = $0.0000002/query<0.1%
Vector DBStores and searches vectors; returns top-K relevant chunks$0 (pgvector) to $70+/mo (Pinecone Starter)1–10%
LLM inferenceReceives retrieved chunks + user query; generates answer$0.001–$0.05+ per query depending on model90–99%

Optimize your LLM model choice first. Embedding and vector DB costs are nearly irrelevant — the LLM is where all your money goes.

Layer 1: Embedding Cost (Negligible)

Embeddings are the cheapest part of any RAG system:

Embedding modelPrice/1M tokens50K docs (500 tok avg)1M queries/day
OpenAI text-embedding-3-small$0.020$0.50 one-time$0.60/mo
OpenAI text-embedding-3-large$0.130$3.25 one-time$3.90/mo
Cohere Embed v3$0.100$2.50 one-time$3.00/mo
Mistral Embed$0.100$2.50 one-time$3.00/mo

Use text-embedding-3-small for almost everything. The small vs large quality difference is marginal for most retrieval tasks, and the cost difference is 6.5×.

Layer 2: Vector Database Cost

Vector DBMonthly costVectors storedBest for
pgvector (self-hosted)$0Unlimited (RAM-limited)Teams already on Postgres; up to ~1M vectors
Supabase (pgvector)$25/mo~1M vectorsManaged Postgres + vector search; easy setup
Pinecone Starter$70/mo5M vectorsProduction RAG at small-to-mid scale
Pinecone Standard$0.096/1M vectorsScalesLarge-scale retrieval (10M+ vectors)
Weaviate Cloud$25–$300+/moScalesEnterprise, hybrid search, multi-tenancy
ChromaDB (local)$0Limited by diskDev/testing; not production at scale

Layer 3: LLM Inference Cost (The Big One)

RAG retrieves 3–5 chunks (typically 300–500 tokens each) and passes them to the LLM alongside the user query. Typical RAG query: 1,500–3,000 input tokens + 200–500 output tokens.

ModelCost per RAG query (2K in / 300 out)10K queries/day100K queries/day
Gemini 2.5 Flash-Lite$0.000320$96/mo$960/mo
GPT-5.4 nano$0.000775$233/mo$2,325/mo
Claude Haiku 4.5$0.003500$1,050/mo$10,500/mo
Claude Sonnet 4.6$0.010500$3,150/mo$31,500/mo

At 100K queries/day, Flash-Lite saves $31,404/mo vs Sonnet. Model selection is the only decision that matters at scale.

Full RAG Stack Cost: Small App Example

ComponentMonthly cost (1K queries/day)
Embedding (initial ingestion of 10K docs)$0.10 one-time
Embedding (1K queries/day × 10 tokens)$0.006/mo
Vector DB (pgvector on existing Postgres)$0
LLM inference (Claude Haiku 4.5, 2K+300 tokens)$105/mo
App hosting$20/mo
Total~$125/mo

Embeddings + vector DB = $0.006/mo total. Haiku LLM = $105/mo. The ratio is 99.9% LLM, 0.1% everything else.

RAG Cost Optimization

  • Reduce retrieved chunk count: Retrieving 3 chunks instead of 5 cuts input tokens by ~600. At Sonnet, that's $0.0018 saved per query — $1,800/month at 1M queries/month.
  • Chunk size optimization: Smaller, more precise chunks mean fewer tokens passed to LLM. 256-token chunks vs 512-token chunks halves retrieval context.
  • Prompt caching on system prompt: Cache your RAG instructions and context template as a system prompt prefix. Saves 90% on the fixed-cost portion of every query.
  • Model routing: Use a cheap classifier to determine query complexity. Simple factual queries → Flash-Lite; complex multi-hop reasoning → Haiku or Sonnet.
  • Response caching: Cache LLM responses for common or repeated questions. FAQ-type RAG systems can cache 20–40% of queries, dramatically reducing LLM calls.

Calculate Your RAG System Cost

Enter your query volume and model to see your monthly LLM inference cost for your RAG application.

AI API Cost Calculator