Here's a scenario that's playing out across hundreds of businesses right now: a company deploys an AI chatbot on their website. It handles questions well during demo. Then, two weeks into production, a customer asks about the refund policy. The bot confidently cites a policy that stopped existing eight months ago. The customer requests the refund. Support has to explain what actually happened. The customer is unhappy. The company wishes they'd never deployed the chatbot.

This is an AI hallucination problem. It's not theoretical, and it's not going away when models get bigger.

What does AI hallucination actually mean?

Large language models don't retrieve information — they predict what text should come next given the input. When a model is asked a question, it generates an answer that looks like what an answer to that question should look like, based on patterns in its training data.

This works remarkably well for general knowledge. It fails predictably for:

Current specifics (pricing, policies, product features that changed after the training cutoff)
Private information (your internal docs, your specific offer, your configuration options)
Precise details (specific version numbers, exact limits, contract terms)

The model doesn't know what it doesn't know. It fills gaps with confident-sounding plausibility. This is not a bug in the usual sense — it's fundamental to how the architecture works. You don't fix it by asking the model to "be more careful."

What does RAG do about hallucinations?

Retrieval-Augmented Generation changes the contract.

Instead of asking the model to answer from memory, RAG first retrieves relevant content from a trusted external source, then passes that content to the model as context for generating the answer. The model's job shifts from "know the answer" to "synthesise this retrieved content into a coherent response."

The pipeline looks like this:

User submits a query
Query is embedded as a vector and compared against your indexed content
The top-k most semantically relevant chunks are retrieved
Retrieved chunks are passed to the LLM as context
LLM generates a response grounded in those specific chunks
Response includes citations pointing back to the source content

The model is now working from your documents, not from training data. If the answer isn't in your documents, a well-configured RAG system says so rather than speculating.

Why isn't "generic LLM + your documents" enough?

A common misconception: "we can just upload our docs to ChatGPT and ask it questions." This works for casual exploration. It doesn't work as production infrastructure.

The problems:

Context window truncation. LLMs have token limits. If your documentation is large, you can't fit all of it in a single prompt. Something gets cut. Often the thing that gets cut is the thing the user needed.

No retrieval — just injection. Uploading a PDF to ChatGPT is not retrieval. The model is reading a document in its context window. There's no embedding, no semantic indexing, no ranking by relevance. You're relying on the model's attention to find the right passage in a wall of text.

No systematic citation. Generic LLMs don't reliably cite which part of which document an answer came from. You can't verify accuracy. You can't give users confidence.

Stale data. Documents uploaded to a chatbot session aren't updated automatically. Your pricing page changes. Your API docs get revised. The session doesn't know.

A proper RAG architecture solves all of these. It maintains a live, queryable semantic index of your content that updates as your content changes.

What makes a RAG pipeline strict?

Not all RAG is equal. "Strict" RAG — the kind that actually behaves predictably in production — has these properties:

Source control. The retrieval system knows exactly which sources it can draw from. It doesn't supplement with training data or web results unless you explicitly instruct it to. If you've indexed your documentation, it answers from your documentation and nothing else.

Chunk relevance thresholds. Not all retrieved chunks are equally relevant. A well-configured system applies a minimum similarity score before including a chunk in context. If nothing meets the threshold, the system replies that it couldn't find relevant information — it doesn't fall back to guessing.

Attribution. Every claim in the generated response is tied to a retrievable source. The user sees where the answer came from and can click through to verify.

Freshness. The index is kept current. Crawls run on a schedule. When source content changes, the vectors update. The model's answers reflect your current documentation, not a snapshot from six months ago.

What does grounded RAG look like in a business?

If your business has specific facts — pricing, policies, procedures, product specifications, legal terms — the only AI behaviour that's acceptable in customer-facing contexts is one that won't invent those facts.

A generic LLM on your website is a liability. Every conversation is a potential support ticket or a trust-eroding moment.

A RAG-powered assistant grounded in your content is the opposite. It answers accurately from what you've told it, cites the source, and admits when it doesn't know. Customers get accurate information. Your team doesn't spend time correcting AI errors. Support volume stays flat or drops.

The difference in outcome between these two deployments is not marginal. It's the difference between an AI that helps your business and one that creates incidents.

What's the first question to ask before deploying AI?

Before choosing a model, a framework, or a deployment platform: ask "what is this AI allowed to draw from?"

If the answer is "its training data" — that's a generic LLM. Expect hallucinations on specifics.

If the answer is "only our indexed content, with citations" — that's grounded RAG. Expect predictable, accurate, auditable responses.

For anything customer-facing where accuracy matters, only the second answer is acceptable. That isn't just product preference; it aligns with the trustworthiness and risk-management principles in the NIST AI Risk Management Framework.

Frequently Asked Questions

What is AI hallucination?

AI hallucination is when a language model generates a response that is confident and plausible-sounding but factually incorrect. It happens because models predict statistically likely text rather than retrieving verified facts — so they fill knowledge gaps with authoritative-sounding guesses.

Can RAG completely eliminate hallucinations?

No, but it substantially reduces them for in-scope questions. In a strict RAG system, if the answer isn't found in the indexed content, the model says so rather than guessing. Hallucinations are most likely when a query falls entirely outside the indexed scope — which is why source control and retrieval thresholds matter.

What is the difference between RAG and a generic chatbot?

A generic chatbot answers from its training data — which doesn't include your private documents, current pricing, or recent policy changes. A RAG system retrieves from a live semantic index of your actual content before generating any response. The answer is grounded in what you've published, not in what the model learned months ago.

How does Surfable keep the RAG index current?

Scheduled crawls re-index your source content automatically. When a page on your site changes, the vectors update on the next crawl cycle. Your AI's answers reflect your current documentation rather than a static snapshot.

What happens when a user asks a question that isn't in my content?

A properly configured RAG system returns a no-answer response rather than speculating. This is the correct behaviour — an honest "I don't have that information" is significantly better for user trust than a confident wrong answer. Zero-result queries also appear in your dashboard as content gaps to fill.

Your AI representative is only as trustworthy as the content it's grounded in. Start there.