· 8 min read

The 1M Context Window Is Here. I Stopped Chunking My Docs.

The 1M Context Window Is Here. I Stopped Chunking My Docs.

Two weeks ago I had a half-built RAG pipeline on my desk. A vector store, an embedding job, a chunking strategy I'd already rewritten three times, and a re-ranker I was about to bolt on top because the recall was bad. I was going to spend the next weekend wiring it into the API. I never did. Anthropic moved 1M-token context to general availability on Claude Opus 4.6 and Sonnet 4.6 at standard API pricing — the previous long-context surcharge is gone — and the math on the whole project changed in one afternoon. I deleted the repo on a Tuesday.

This is the post I wanted to read before I started building that pipeline. What you actually save when context windows get this big, where you still want chunking, and how to think about the trade in concrete numbers instead of vibes.

What 1M tokens actually buys you

A million tokens is not an abstract number. Concretely, it is roughly the entire codebase of a medium-sized solo project, or a 700-page PDF, or about six months of a Slack channel, or every blog post on this site so far with room to spare. It is enough to put the whole context of "what I am working on" inside a single prompt and ask the model to reason across all of it at once.

Two things change when this becomes affordable. First, you stop thinking about retrieval as a separate engineering problem. The retrieval layer was always the hardest part of RAG to get right — chunk size, overlap strategy, embedding model selection, top-k, re-rankers, hybrid search with BM25, query rewriting. All of that exists to compensate for the fact that you couldn't show the model the whole document. When you can, most of that scaffolding becomes unnecessary.

Second, you stop losing the connections between sections. The single biggest failure mode of a chunked RAG pipeline is that it retrieves three semantically similar chunks but misses the chunk that actually answers the question because the answer was phrased differently. With a million tokens of context, you don't have to guess what the answer chunk looks like. You include everything and let the model find it.

The pipeline I deleted

The RAG pipeline I was building had five moving parts. An ingestion script that chunked source documents into 800-token windows with 200-token overlap. An embedding step using a small model. A vector store. A retrieval function that pulled top-10 chunks at query time. And a small re-ranker on top because top-10 retrieval was missing the right chunk about a third of the time.

Five pieces. Each one with its own bugs, its own configuration knobs, and its own ongoing maintenance cost. The replacement is one function: read the source files into a string, prepend the question, send it to the model. That is not a simplification on a slide deck — that is the actual code in production today.

I am not saying this to brag about deletion. I am saying it because the engineering effort I was about to spend is now spendable on something else. For a solo dev, that swap is the entire point. The hours you don't pour into infrastructure are hours you put into things customers actually see.

Where chunking still wins

I want to be honest about where the long-context approach loses, because the answer isn't "always use 1M tokens." There are three places where chunking is still the right call.

The first is cost. A million-token prompt costs more than a 10K-token retrieved-chunks prompt. If you're answering a thousand questions a day against the same document, you're paying for the full context every time, and that adds up fast. Caching helps — Anthropic's prompt caching brings repeated long contexts down significantly — but if your pattern is "many short questions against one giant document," classic RAG with embeddings is still cheaper at scale.

The second is latency. A 1M-token prompt takes longer to process than a 10K one, even with all the optimizations. If you're building something that needs to feel instant, like an autocomplete or an inline answer, you don't want to round-trip a million tokens.

The third is genuinely huge corpora. A million tokens is a lot, but it is not infinite. If your knowledge base is actually a hundred million tokens — every issue ticket your company has ever filed, every email in your inbox, the whole internal wiki — you still need a retrieval step. The interesting question is what kind. The new pattern I see emerging is "do a coarse retrieval to narrow down to the right million tokens, then dump the whole million tokens into the model." That's a much cheaper retrieval problem than chunked RAG because you're retrieving documents, not chunks, and the model handles the within-document reasoning.

A quick benchmark from my own desk

I ran the same task two ways on my own setup last week: answer a question about my Astro blog by querying its own source code. The codebase is about 80K tokens including content. Easy fit for the long-context path.

The chunked RAG version took longer to set up and missed the right answer twice because the chunking broke a function definition across two chunks. The dump-the-whole-codebase version answered correctly on the first try, took roughly five seconds, and cost about ten cents.

Ten cents is not nothing if you're running it a thousand times. But for the ad-hoc "I need to understand this codebase" query I run maybe twice a day, it's irrelevant. I would happily pay ten cents to save the engineering hours I was about to spend on a chunker that loses function boundaries.

What I'd actually do today

If you're a solo dev about to build a RAG pipeline, here's the decision tree I'd use.

If your knowledge base fits in 1M tokens, don't build RAG. Just include the whole thing in the prompt. Use prompt caching to keep the cost reasonable on repeat queries. You'll ship in a day instead of a month, and the recall will be better.

If your knowledge base is 1M to 100M tokens, build a coarse document-level retrieval — embeddings on the document level, not the chunk level — and then dump the top one or two whole documents into the model. Skip the chunking, skip the re-ranker, skip the hybrid search. Document-level retrieval is dramatically simpler than chunk-level retrieval and good enough when the model handles the within-document reasoning.

If your knowledge base is 100M+ tokens, you're an enterprise problem and this post is not for you. Hire someone who has done it before.

The biggest mindset shift is not technical. It's that the right answer for most solo projects is now "no RAG." The infrastructure tax of building and maintaining a retrieval pipeline almost never pays for itself when a million tokens of context is on the table at standard pricing. Ship the simple version, see if it's actually too slow or too expensive in production, and add complexity only when the numbers force you to.

Sources

Stay in the Loop

Get new posts delivered to your inbox. No spam, unsubscribe anytime.

Related Posts