Context Rot

The effects

Context rot is the performance degradation that LLMs experience as the length of the input context grows.

Repeated words — performance by input length. Every model's output fidelity collapses as the context grows, despite the task itself staying trivial. Data from Chroma's Context Rot research.

Although leading models now advertise context windows of a million tokens or more, performance degrades well before that limit, so in practice you work with far fewer tokens than the window nominally allows. Past a certain threshold, hallucinations and errors become more frequent. Cost compounds the problem: every token is reprocessed on every turn, so a bloated context is slower and more expensive too.

The causes

Lost in the Middle

LLMs perform best when relevant information is at the beginning or end of the input, but when relevant context is in the middle of a long input, retrieval performance degrades considerably, even in models specifically designed for long contexts (Liu et al., 2023).

Lost in the middle. Retrieval accuracy is highest when the answer sits at the start or end of the context and sags when it falls in the middle — a U-shaped curve that holds even for long-context models. After Liu et al., 2023 (GPT-3.5, 20-document QA).

Distraction by Irrelevant Context

Adding irrelevant context that forces the model to perform an additional recovery step significantly degrades its ability to maintain reliable performance. In other words, it's not just how many tokens there are, but how much noise the model has to filter out (Shi et al., 2023).