AI RAG System Cost 2026: Embeddings, Vector DB, LLM Retrieval Pricing

What Makes Up a RAG System's Cost?

A RAG pipeline has 4 cost components:

Embedding (ingestion): Converting documents to vectors — one-time + incremental
Vector database: Storing and searching vectors — monthly recurring
Retrieval (query embeddings): Embedding each user query — per-query cost
LLM generation: Generating the final answer with retrieved context — per-query cost

Model	Price per 1M tokens	Dimensions	Notes
OpenAI text-embedding-3-small	$0.020	1,536	Best price/performance ratio
OpenAI text-embedding-3-large	$0.130	3,072	Higher quality for complex search
Google text-embedding-004	$0.000	768	Free via Vertex AI (limits apply)
Cohere embed-english-v3	$0.100	1,024	Optimized for semantic search
Voyage AI voyage-3	$0.060	1,024	Strong multilingual support
BAAI/bge-large (self-hosted)	$0.000	1,024	Free, GPU/CPU inference required

Database	Free Tier	Starter	Growth	Vectors (free tier)
Pinecone	Yes	$70/mo	$700+/mo	1 index, 100K vectors
Weaviate Cloud	Yes	$25/mo	Custom	1M vectors (sandbox)
Qdrant Cloud	Yes	$9/mo	$25+/mo	1GB storage free
Chroma (self-hosted)	Free	$0	$0	Unlimited (own infrastructure)
pgvector (PostgreSQL)	Free	$0	$0	Unlimited (own infrastructure)
Supabase pgvector	Yes	$25/mo	$599+/mo	500MB storage free

Embeddings (initial): 50K docs × 500 tokens = 25M tokens × $0.020 = $0.50 one-time
Query embeddings: 10K × 200 tokens = 2M tokens × $0.020 = $0.04/month
Vector DB: Qdrant free tier = $0/month
LLM generation (Gemini 2.5 Flash-Lite, ~1K tokens/query avg): 10M tokens × $0.40/M = $4/month
Total: ~$4/month

Embeddings (initial): 1M × 500 tokens = 500M tokens × $0.020 = $10 one-time
Query embeddings: 100K × 200 tokens = 20M tokens = $0.40/month
Vector DB: Pinecone Starter = $70/month
LLM generation (Gemini 2.5 Flash-Lite, $0.10/$0.40): 100M tokens ≈ $25/month
Total: ~$95/month

Use pgvector instead of Pinecone — saves $70–$700/month for most scales
Cache frequent queries — 30–50% of queries are often identical, cache vector + LLM results
Reduce retrieved chunks — fetch top-3 instead of top-10 reduces input tokens 70%
Use small embedding models — text-embedding-3-small vs large saves 85% with minimal quality loss
Re-rank with a cheap model — use Cohere Rerank or cross-encoder to trim context before expensive LLM call