Context Window

RR
Ryan Rutan

Context Window

The context window is the maximum number of tokens a large language model can process in a single input (prompt plus output). It is determined by the model's architecture and training. Everything the model can "see" for a query (instructions, examples, context, conversation history, reference documents) must fit within this token budget, making the limit one of the most consequential constraints in designing LLM applications. It's the size of the model's working memory for any given request.

The token math:

1 token ≈ 0.75 English words (rough approximation).
1 token ≈ 4 characters of English text.

So a 100,000-token context window holds roughly 75,000 words, or about 300 pages of a typical book.

How context windows have grown (2020-2026):

YearNotable modelContext window
2020GPT-32,048 tokens
2022GPT-3.54,096 tokens
2023GPT-48K, then 32K, then 128K tokens
2023Claude 2100K tokens
2024Claude 3.5200K tokens
2024Gemini 1.5 Pro1M tokens (later 2M)
2025Llama 4 Scout10M tokens (open-weights leader)
2026Frontier models200K-10M+ tokens; window size no longer constrains most apps

The growth: roughly 1000x from GPT-3 to today's frontier models in ~5 years. Context window is no longer a major constraint for most applications.

Why context window matters:

Document analysis: can the entire document fit, or do you need to chunk?

Conversation history: how far back can the model remember?

Few-shot examples: how many examples fit?

Multi-document reasoning: can the model see multiple sources at once?

Long-form output: output also counts against the context window.

What fits in various context windows:

  • 4K tokens: short conversations, single emails.
  • 32K tokens: legal documents, code files, longer reports.
  • 128K tokens: book chapters, multi-file codebases, full conversations.
  • 200K tokens: full books, large legal documents, extended conversations.
  • 1M+ tokens: entire codebases, video transcripts, multiple long documents.

The cost vs context window tradeoff:

Larger context windows cost more per query: because every token in input contributes to inference cost.

Typical pricing (2025 frontier model):

  • Input tokens: $1-$10 per million tokens.
  • Output tokens: $5-$50 per million tokens.

A 200K-token input at $3/M input tokens = $0.60 per query just for the input. Significant for high-volume apps.

The context window vs RAG decision:

With large context windows, you might wonder: do I still need RAG?

Use large context window directly when: total relevant info fits; willing to pay the per-query cost; quality of attention to the full context is good.

Use RAG when: information corpus is too large for context window; want to cite specific sources; want to update knowledge frequently; per-query cost matters.

Combine: even with large context windows, RAG is often more cost-effective than stuffing entire corpora into every prompt.

Performance vs context window:

"Lost in the middle" effect: models attend more to information at the beginning and end of context, less to information in the middle. Putting critical info at start/end improves attention.

Quality degradation at edges: very long contexts sometimes have quality issues toward the end of the context window. Empirically test.

Needle-in-haystack tests: models tested on retrieving specific facts from long contexts. Modern models perform well but not perfectly.

The economics implication for AI applications:

Context economy: how you use the context window directly affects per-query economics. Wasted context = wasted money.

Prompt compression: techniques to compress context (summarization, hierarchical organization, semantic compression).

Caching: prompt caching (Anthropic, OpenAI offer this) reuses portions of common prompts, dropping costs 50-90% for repeated context.

Context engineering: discipline of optimizing what goes into the context window for cost and quality.

Ryan's Take

Context window was THE constraint in 2022 and 2023. By 2025 it's just a knob you manage. Know your model's window, design your prompts and retrieval to fit it, and cache where you can. Put the critical stuff at the start or end, because models genuinely lose the middle. And don't pay to stuff giant context into every call when RAG would do the same job cheaper.

What founders get wrong: Stuffing entire documents into context windows just because the model can handle it. The right discipline: optimize context for both cost and quality; use prompt caching; combine large context with RAG; treat context as an economic resource.

Related: Large Language Model · Transformer Architecture · Inference Cost · Retrieval-Augmented Generation · Prompt Engineering

FAQ

What is a context window?
The maximum number of tokens an LLM can process in a single input (prompt + output). Determines how much the model can "see" for any given query, instructions, examples, context, conversation history, and reference documents must all fit.

How big are modern context windows?
Frontier models in 2025: 200K-2M+ tokens. GPT-4: 128K. Claude 3.5: 200K. Gemini 1.5/2: 1M-2M. Roughly 1000x growth from GPT-3's 2K context in 5 years. Most applications no longer constrained by context window size.

How many words is 100K tokens?
~75,000 words (1 token ≈ 0.75 English words). Roughly 300 pages of a typical book. 1M tokens ≈ 750,000 words or 3000 book pages.

Do I still need RAG with large context windows?
Often yes. Per-query cost makes massive context economically unsustainable for many apps. RAG fetches only relevant context per query. Combining large context windows with RAG is common in production. Use full context when info fits and cost is OK; RAG when corpus is large or cost matters.

Find this article helpful?

This is just a small sample! Register to unlock our in-depth courses, hundreds of video courses, and a library of playbooks and articles to grow your startup fast. Let us Let us show you!

OR

GoogleLinkedInFacebookX/Twitter

Submission confirms agreement to our Terms of Service and Privacy Policy.