The Context Window

What it is

The context window is the fixed-size span of tokens a model can attend to on a single turn — everything it can "see" at once. Your instructions, the conversation so far, any files or search results you pulled in, and the answer the model is about to write all have to fit inside it.

The context window grew ~5,000× in five years. The largest advertised window at each landmark release, from GPT-3's 2K tokens (2020) to Llama 4's 10M (2025).

The window is measured in tokens, not words or characters — a token is roughly ¾ of a word in English. As the chart shows, advertised windows have grown enormously, but the size is a hard ceiling: once a turn's tokens exceed it, something has to be dropped, summarized, or truncated before the model can run.

What is a token?

A token is the unit a model reads and writes. It can be a full word, a syllable, a single character, or a fragment of a longer word. Common words are usually one token; rarer or longer ones get split into several.

What lives in it

It is tempting to picture the window as "my prompt." In practice, your message is usually a small slice of it. On any given turn, the window holds:

  • The system prompt: the instructions and persona the model runs under.
  • Tool definitions: the schema for every tool or function the model can call.
  • The conversation history: every previous user and assistant turn.
  • Retrieved context: files, documents, or search results loaded for the task.
  • The current user message: what you just asked.
  • Reserved output space: room set aside for the model's reply.
One window, many tenants. Everything the model can use on a turn shares a single fixed-size token budget — your message is only a sliver of it, and the reply is carved from the same space (reserved on the right). Proportions are illustrative.

Every one of these draws from the same budget. Deciding what earns a place is the subject of Token Budgeting.

Input and output share the budget

The advertised number is the total, not an input allowance. The tokens the model generates come out of the same window as the tokens you put in, so the two trade off against each other: reserving room for a long answer leaves less room for context, and a window packed with input leaves less room to respond. Most APIs require you to cap output explicitly (a max_tokens setting) precisely because it competes with the input for the same space.