What Is RAG Cost? Retrieval-Augmented Generation Pricing Explained (2026)

RAG Architecture: Three Cost Layers

Layer	What it does	Typical cost	% of total
Embedding (ingestion)	Converts documents to vectors at ingestion time (one-time cost)	$0.02/M tokens (OpenAI text-embedding-3-small)	~0%
Embedding (query)	Converts user query to vector for similarity search (per query)	~10 tokens × $0.02/M = $0.0000002/query	<0.1%
Vector DB	Stores and searches vectors; returns top-K relevant chunks	$0 (pgvector) to $70+/mo (Pinecone Starter)	1–10%
LLM inference	Receives retrieved chunks + user query; generates answer	$0.001–$0.05+ per query depending on model	90–99%

Optimize your LLM model choice first. Embedding and vector DB costs are nearly irrelevant — the LLM is where all your money goes.

Layer 1: Embedding Cost (Negligible)

Embeddings are the cheapest part of any RAG system:

Embedding model	Price/1M tokens	50K docs (500 tok avg)	1M queries/day
OpenAI text-embedding-3-small	$0.020	$0.50 one-time	$0.60/mo
OpenAI text-embedding-3-large	$0.130	$3.25 one-time	$3.90/mo
Cohere Embed v3	$0.100	$2.50 one-time	$3.00/mo
Mistral Embed	$0.100	$2.50 one-time	$3.00/mo

Use text-embedding-3-small for almost everything. The small vs large quality difference is marginal for most retrieval tasks, and the cost difference is 6.5×.

Layer 2: Vector Database Cost

Vector DB	Monthly cost	Vectors stored	Best for
pgvector (self-hosted)	$0	Unlimited (RAM-limited)	Teams already on Postgres; up to ~1M vectors
Supabase (pgvector)	$25/mo	~1M vectors	Managed Postgres + vector search; easy setup
Pinecone Starter	$70/mo	5M vectors	Production RAG at small-to-mid scale
Pinecone Standard	$0.096/1M vectors	Scales	Large-scale retrieval (10M+ vectors)
Weaviate Cloud	$25–$300+/mo	Scales	Enterprise, hybrid search, multi-tenancy
ChromaDB (local)	$0	Limited by disk	Dev/testing; not production at scale

Layer 3: LLM Inference Cost (The Big One)

RAG retrieves 3–5 chunks (typically 300–500 tokens each) and passes them to the LLM alongside the user query. Typical RAG query: 1,500–3,000 input tokens + 200–500 output tokens.

Model	Cost per RAG query (2K in / 300 out)	10K queries/day	100K queries/day
Gemini 2.5 Flash-Lite	$0.000320	$96/mo	$960/mo
GPT-5.4 nano	$0.000775	$233/mo	$2,325/mo
Claude Haiku 4.5	$0.003500	$1,050/mo	$10,500/mo
Claude Sonnet 4.6	$0.010500	$3,150/mo	$31,500/mo

At 100K queries/day, Flash-Lite saves $31,404/mo vs Sonnet. Model selection is the only decision that matters at scale.

Full RAG Stack Cost: Small App Example

Component	Monthly cost (1K queries/day)
Embedding (initial ingestion of 10K docs)	$0.10 one-time
Embedding (1K queries/day × 10 tokens)	$0.006/mo
Vector DB (pgvector on existing Postgres)	$0
LLM inference (Claude Haiku 4.5, 2K+300 tokens)	$105/mo
App hosting	$20/mo
Total	~$125/mo

Embeddings + vector DB = $0.006/mo total. Haiku LLM = $105/mo. The ratio is 99.9% LLM, 0.1% everything else.

RAG Cost Optimization

Reduce retrieved chunk count: Retrieving 3 chunks instead of 5 cuts input tokens by ~600. At Sonnet, that's $0.0018 saved per query — $1,800/month at 1M queries/month.
Chunk size optimization: Smaller, more precise chunks mean fewer tokens passed to LLM. 256-token chunks vs 512-token chunks halves retrieval context.
Prompt caching on system prompt: Cache your RAG instructions and context template as a system prompt prefix. Saves 90% on the fixed-cost portion of every query.
Model routing: Use a cheap classifier to determine query complexity. Simple factual queries → Flash-Lite; complex multi-hop reasoning → Haiku or Sonnet.
Response caching: Cache LLM responses for common or repeated questions. FAQ-type RAG systems can cache 20–40% of queries, dramatically reducing LLM calls.

What Is RAG Cost?Retrieval-Augmented Generation Pricing Explained