What Is RAG Cost?
Retrieval-Augmented Generation Pricing Explained
A RAG (Retrieval-Augmented Generation) system has three cost layers: embedding, vector database, and LLM inference. The LLM is 95–99% of total RAG cost. Here's what each layer actually costs and how to minimize the biggest driver. Last verified: 2026-04-01.
RAG Architecture: Three Cost Layers
| Layer | What it does | Typical cost | % of total |
|---|---|---|---|
| Embedding (ingestion) | Converts documents to vectors at ingestion time (one-time cost) | $0.02/M tokens (OpenAI text-embedding-3-small) | ~0% |
| Embedding (query) | Converts user query to vector for similarity search (per query) | ~10 tokens × $0.02/M = $0.0000002/query | <0.1% |
| Vector DB | Stores and searches vectors; returns top-K relevant chunks | $0 (pgvector) to $70+/mo (Pinecone Starter) | 1–10% |
| LLM inference | Receives retrieved chunks + user query; generates answer | $0.001–$0.05+ per query depending on model | 90–99% |
Optimize your LLM model choice first. Embedding and vector DB costs are nearly irrelevant — the LLM is where all your money goes.
Layer 1: Embedding Cost (Negligible)
Embeddings are the cheapest part of any RAG system:
| Embedding model | Price/1M tokens | 50K docs (500 tok avg) | 1M queries/day |
|---|---|---|---|
| OpenAI text-embedding-3-small | $0.020 | $0.50 one-time | $0.60/mo |
| OpenAI text-embedding-3-large | $0.130 | $3.25 one-time | $3.90/mo |
| Cohere Embed v3 | $0.100 | $2.50 one-time | $3.00/mo |
| Mistral Embed | $0.100 | $2.50 one-time | $3.00/mo |
Use text-embedding-3-small for almost everything. The small vs large quality difference is marginal for most retrieval tasks, and the cost difference is 6.5×.
Layer 2: Vector Database Cost
| Vector DB | Monthly cost | Vectors stored | Best for |
|---|---|---|---|
| pgvector (self-hosted) | $0 | Unlimited (RAM-limited) | Teams already on Postgres; up to ~1M vectors |
| Supabase (pgvector) | $25/mo | ~1M vectors | Managed Postgres + vector search; easy setup |
| Pinecone Starter | $70/mo | 5M vectors | Production RAG at small-to-mid scale |
| Pinecone Standard | $0.096/1M vectors | Scales | Large-scale retrieval (10M+ vectors) |
| Weaviate Cloud | $25–$300+/mo | Scales | Enterprise, hybrid search, multi-tenancy |
| ChromaDB (local) | $0 | Limited by disk | Dev/testing; not production at scale |
Layer 3: LLM Inference Cost (The Big One)
RAG retrieves 3–5 chunks (typically 300–500 tokens each) and passes them to the LLM alongside the user query. Typical RAG query: 1,500–3,000 input tokens + 200–500 output tokens.
| Model | Cost per RAG query (2K in / 300 out) | 10K queries/day | 100K queries/day |
|---|---|---|---|
| Gemini 2.5 Flash-Lite | $0.000320 | $96/mo | $960/mo |
| GPT-5.4 nano | $0.000775 | $233/mo | $2,325/mo |
| Claude Haiku 4.5 | $0.003500 | $1,050/mo | $10,500/mo |
| Claude Sonnet 4.6 | $0.010500 | $3,150/mo | $31,500/mo |
At 100K queries/day, Flash-Lite saves $31,404/mo vs Sonnet. Model selection is the only decision that matters at scale.
Full RAG Stack Cost: Small App Example
| Component | Monthly cost (1K queries/day) |
|---|---|
| Embedding (initial ingestion of 10K docs) | $0.10 one-time |
| Embedding (1K queries/day × 10 tokens) | $0.006/mo |
| Vector DB (pgvector on existing Postgres) | $0 |
| LLM inference (Claude Haiku 4.5, 2K+300 tokens) | $105/mo |
| App hosting | $20/mo |
| Total | ~$125/mo |
Embeddings + vector DB = $0.006/mo total. Haiku LLM = $105/mo. The ratio is 99.9% LLM, 0.1% everything else.
RAG Cost Optimization
- Reduce retrieved chunk count: Retrieving 3 chunks instead of 5 cuts input tokens by ~600. At Sonnet, that's $0.0018 saved per query — $1,800/month at 1M queries/month.
- Chunk size optimization: Smaller, more precise chunks mean fewer tokens passed to LLM. 256-token chunks vs 512-token chunks halves retrieval context.
- Prompt caching on system prompt: Cache your RAG instructions and context template as a system prompt prefix. Saves 90% on the fixed-cost portion of every query.
- Model routing: Use a cheap classifier to determine query complexity. Simple factual queries → Flash-Lite; complex multi-hop reasoning → Haiku or Sonnet.
- Response caching: Cache LLM responses for common or repeated questions. FAQ-type RAG systems can cache 20–40% of queries, dramatically reducing LLM calls.
Calculate Your RAG System Cost
Enter your query volume and model to see your monthly LLM inference cost for your RAG application.
AI API Cost Calculator