Architecture Cost
AI RAG System Cost 2026:
Embeddings, Vector DBs, and LLM Retrieval Pricing
Complete cost breakdown for Retrieval-Augmented Generation (RAG) systems in 2026 — embedding API costs, vector database pricing (Pinecone, Weaviate, pgvector), retrieval costs, and full stack monthly budgets.
13 min read·Updated March 2026
RAG Cost Components
$0.02/M
OpenAI embedding tokens
$70–$700/mo
Pinecone vector DB
$0
pgvector (self-hosted)
$50–$500/mo
Full RAG system (typical)
What Makes Up a RAG System's Cost?
A RAG pipeline has 4 cost components:
- Embedding (ingestion): Converting documents to vectors — one-time + incremental
- Vector database: Storing and searching vectors — monthly recurring
- Retrieval (query embeddings): Embedding each user query — per-query cost
- LLM generation: Generating the final answer with retrieved context — per-query cost
Embedding API Pricing (2026)
| Model | Price per 1M tokens | Dimensions | Notes |
|---|---|---|---|
| OpenAI text-embedding-3-small | $0.020 | 1,536 | Best price/performance ratio |
| OpenAI text-embedding-3-large | $0.130 | 3,072 | Higher quality for complex search |
| Google text-embedding-004 | $0.000 | 768 | Free via Vertex AI (limits apply) |
| Cohere embed-english-v3 | $0.100 | 1,024 | Optimized for semantic search |
| Voyage AI voyage-3 | $0.060 | 1,024 | Strong multilingual support |
| BAAI/bge-large (self-hosted) | $0.000 | 1,024 | Free, GPU/CPU inference required |
Vector Database Pricing (2026)
| Database | Free Tier | Starter | Growth | Vectors (free tier) |
|---|---|---|---|---|
| Pinecone | Yes | $70/mo | $700+/mo | 1 index, 100K vectors |
| Weaviate Cloud | Yes | $25/mo | Custom | 1M vectors (sandbox) |
| Qdrant Cloud | Yes | $9/mo | $25+/mo | 1GB storage free |
| Chroma (self-hosted) | Free | $0 | $0 | Unlimited (own infrastructure) |
| pgvector (PostgreSQL) | Free | $0 | $0 | Unlimited (own infrastructure) |
| Supabase pgvector | Yes | $25/mo | $599+/mo | 500MB storage free |
Full RAG System Monthly Cost Examples
Small RAG app (10,000 queries/month, 50K documents indexed)
- Embeddings (initial): 50K docs × 500 tokens = 25M tokens × $0.020 = $0.50 one-time
- Query embeddings: 10K × 200 tokens = 2M tokens × $0.020 = $0.04/month
- Vector DB: Qdrant free tier = $0/month
- LLM generation (GPT-4o mini, 1K tokens/query): 10M tokens = $1.50/month
- Total: ~$2/month
Medium RAG app (100K queries/month, 1M documents)
- Embeddings (initial): 1M × 500 tokens = 500M tokens × $0.020 = $10 one-time
- Query embeddings: 100K × 200 tokens = 20M tokens = $0.40/month
- Vector DB: Pinecone Starter = $70/month
- LLM generation (GPT-4o mini): 100M tokens = $15/month
- Total: ~$85/month
Enterprise RAG (1M queries/month, 10M documents)
- Vector DB: Pinecone or Weaviate dedicated = $500–$2,000/month
- LLM (GPT-4o mini at scale): ~$150/month
- Embeddings refresh: ~$20/month (incremental)
- Total: $670–$2,170/month
Cost Optimization for RAG Systems
- Use pgvector instead of Pinecone — saves $70–$700/month for most scales
- Cache frequent queries — 30–50% of queries are often identical, cache vector + LLM results
- Reduce retrieved chunks — fetch top-3 instead of top-10 reduces input tokens 70%
- Use small embedding models — text-embedding-3-small vs large saves 85% with minimal quality loss
- Re-rank with a cheap model — use Cohere Rerank or cross-encoder to trim context before expensive LLM call
Calculate Your RAG System Costs
Estimate total monthly costs for your document count and query volume.
AI Cost Calculator