What Is a Context Window? LLM Memory Limits Explained (2026)

What Is a Context Window?

Think of the context window as the AI model's working memory for a single conversation or task. It encompasses everything the model can "see" and reason over at once:

Your system prompt (instructions and persona)
The full conversation history (all previous messages in a chat)
Any documents or data you inject (via RAG or direct paste)
The current user message
The model's response (which also consumes tokens)

When the total exceeds the context window limit, the request fails — or older content gets truncated, degrading quality.

Context Window Sizes by Model (2026)

Model	Context window	Approx. pages of text	Max document size
Gemini 2.5 Flash-Lite	1M tokens	~750 pages	Full codebase / book-length
Gemini 2.5 Flash	1M tokens	~750 pages	Full codebase / book-length
Gemini 2.5 Pro	1M tokens	~750 pages	Full codebase / book-length
Claude Haiku 4.5	200K tokens	~150 pages	Long reports, mid-size codebases
Claude Sonnet 4.6	200K tokens	~150 pages	Long reports, mid-size codebases
Claude Opus 4.6	200K tokens	~150 pages	Long reports, mid-size codebases
GPT-5.4 nano	128K tokens	~96 pages	Short to medium documents
GPT-5.4 mini	128K tokens	~96 pages	Short to medium documents
GPT-5.4	1M tokens	~750 pages	Full codebase / book-length
Mistral Small 3.2	128K tokens	~96 pages	Short to medium documents

Context Window vs Use Case

Use Case	Tokens needed	Minimum context window	Which models work
Chatbot (5 turns)	~3,500	Any model	All models
10-page PDF analysis	~8,000	8K+	All models
50-page report	~40,000	40K+	All models (well within any)
100-page report	~80,000	80K+	All (at 63% of 128K — approaching GPT/Mistral limit)
Full legal contract review (200 pages)	~150,000	150K+	Claude (200K) ✓, Gemini (1M) ✓ — GPT nano/mini ✗
Full codebase (1,000 files)	~500,000	500K+	Gemini 2.5 Flash/Pro, GPT-5.4 (1M) only
Book-length analysis	~400,000	400K+	Gemini 2.5 Flash/Pro, GPT-5.4 only

Context Window and Cost: The Relationship

A larger context window doesn't change your per-token price — but it changes how much you can spend per request. Sending 100K tokens of document context to Claude Sonnet 4.6 costs $0.30 just for the input, before any output.

In practice:

Large context = large input cost — a 100K-token document at $3/M = $0.30/call in input alone
Chatbot context grows with turns — a 30-turn conversation may accumulate 15K+ input tokens from history
RAG limits context cost — instead of sending full documents, retrieve only the 3–5 relevant chunks (~2,000 tokens) via vector search

Context Window Strategies

1. Truncate conversation history

For chatbots, only keep the last N turns (3–5) in the context. For most use cases, older turns don't affect answer quality — and keeping them adds linear cost per turn.

2. Use RAG instead of full-document injection

Rather than injecting 50 pages into the context, use embeddings to retrieve the 3–5 most relevant passages (~2,000 tokens). This keeps context small, cost low, and often improves relevance vs. overwhelming the model with noise.

3. Match model to document size

Don't use Claude Sonnet 4.6 ($3/M) for short chatbot turns — use Claude Haiku 4.5 ($1/M) or GPT-5.4 nano ($0.20/M). Reserve large-context models for tasks that actually need it.

4. Prompt caching for large repeated contexts

Claude's prompt caching lets you pay 90% less for re-reading the same context. If you inject the same 10,000-token document into every call for a given user session, caching that prefix at $0.10/M (vs $1.00/M uncached on Haiku) saves $0.009 per call — significant at high volume.

What Is a Context Window?
LLM Memory Limits and Cost Implications Explained