[ ] commonplace
Browse Log in Get started
Browse Log in Get started
← Back to blog
Note Wednesday, May 6, 2026

"RAG Is Dead” Is a Lazy Take

Every few months, someone declares RAG is dead.

Usually right after a new model launches with a larger context window.

128k.

1M.

“Why retrieve anything? Just dump everything into the prompt.”

And people act like information retrieval is now obsolete.

It isn’t.

And the argument gets weaker the more you think about it.

Bigger context windows are not a free lunch

LLMs have an inherent constraint: they need to process whatever you give them.

Today, with transformer-based models, larger contexts typically mean:

higher inference costs

slower responses

more memory usage

That’s obvious.

But the common rebuttal is:

“That’s just a temporary transformer problem.”

Maybe.

Let’s assume transformers aren’t even the final architecture.

Maybe we get:

sub-quadratic attention

Joint Embedding Predictive Architecture-style systems

diffusion language models

entirely new architectures we haven’t discovered yet

Great.

The argument still doesn’t hold.

Because this was never fundamentally about transformers.

It’s about computation.

No matter how efficient models become, processing more information will always require more work than processing less information.

Maybe the curve improves dramatically.

Maybe context becomes effectively unlimited.

That still doesn’t make brute force efficient.

Processing an entire knowledge base will always cost more than processing the 3 documents that actually matter.

At scale, that difference compounds fast.

Millions of requests later:

unnecessary computation still costs money

unnecessary computation still adds latency

unnecessary computation still creates infrastructure overhead

Better architectures may reduce the penalty.

They don’t eliminate the incentive to be efficient.

“Just put everything in context” isn’t elegant engineering

Even if context limitations disappeared tomorrow, why would loading everything be the default design?

That’s not architecture.

That’s avoiding architecture.

It sounds like:

“Storage got cheaper, so we don’t need database indexes anymore.”

Nobody serious would say that.

We build systems that retrieve the right data at the right time because brute force breaks at scale.

AI systems are no different.

More context can make answers worse

People also assume more context automatically improves quality.

That’s often false.

Too much context can introduce:

irrelevant information

lower signal-to-noise ratio

conflicting sources

harder debugging

less predictable outputs

Giving a model 5 relevant documents often works better than dumping 500 vaguely related ones into a prompt.

The challenge isn’t maximizing context.

It’s maximizing relevance.

Retrieval solves a real problem

Retrieval is often framed as a temporary workaround until models get “good enough.”

That misses the point.

Retrieval exists because selecting the right information is fundamentally better than selecting everything.

That remains true regardless of model architecture.

Retrieval improves:

latency

cost efficiency

relevance

scalability

maintainability

That’s not a hack.

That’s system design.

“RAG” also became an overhyped buzzword

That said: the backlash didn’t come from nowhere.

“RAG” became one of the most overused terms in AI.

A lot of companies built:

vector database wrappers

naive chunking pipelines

basic prompt templates

…and called it groundbreaking infrastructure.

That criticism is fair.

A lot of “RAG startups” were mostly wrappers with good branding.

But bad implementations don’t invalidate the underlying idea.

That’s like saying databases are dead because someone built a terrible dashboard product.

The future is selective systems

The future probably looks like:

stronger models

larger context windows

better retrieval systems

better memory layers

better routing/orchestration

better tool usage

Sometimes full-context approaches will make sense.

Sometimes retrieval-first systems will win.

Most real-world systems will likely be hybrid.

That’s how engineering usually works: tradeoffs, not absolutes.

Stop confusing brute force with progress

Bigger context windows are useful.

Model improvements are real.

New architectures may radically change what’s possible.

None of that means efficient information retrieval disappears.

“Just dump everything into the prompt” is often brute force wearing innovation branding.

And history is pretty consistent here:

As systems scale, efficiency matters more—not less.

← Previous Compound Engineering: Make Every Unit of Work Compound Into the Next
Random
Next → No newer jottings

Collected over time.

· v0.15.0 (64ba6d7)
RSS Subscribe