RAG for Small Business: When a Custom Knowledge Base Beats a Chatbot

Technology

RAG without the jargon

Retrieval-Augmented Generation describes a simple idea behind a mouthful of a name. When a user asks a question, look up relevant information from your own content first, then ask an AI model to answer using only that information. The retrieval step grounds the model in your business reality. The generation step turns those facts into a natural-sounding response.

RAG bolts a custom knowledge base onto a general-purpose language model. The model brings fluency and reasoning. Your content brings truth. Neither works as well alone.

Why a generic chatbot falls short

A chatbot built on ChatGPT or Claude with no grounding is working off its training data. It knows plenty about the world as of its cutoff date, and nothing specifically about your business. Asked about your products, pricing or policies, it either refuses or fabricates plausible-sounding answers. Both are bad outcomes on a business website.

Decision-tree chatbots (the scripted kind) avoid fabrication but only answer the specific questions you anticipated. Users phrasing things slightly differently get dropped into fallback flows. Maintenance is constant as your offering changes.

RAG sits in the pragmatic middle. The model stays flexible and conversational. The knowledge base stays accurate and current. Neither has to do the other's job.

When a small business actually benefits from RAG

RAG is not cheap to run well. The signal it is the right fit is usually one of these.

A substantial body of support content. If you have 50 or more knowledge base articles, policy documents or product spec sheets and customers keep asking questions already answered somewhere in that content, RAG closes the gap. We have seen 40-70 per cent ticket deflection on clients with mature documentation.

A product catalogue that is hard to navigate. When users want to describe what they need in their own words rather than filter by category, RAG over product data helps them self-serve. It works well for trade suppliers, B2B distributors and professional services with complex offerings.

Regulated or precise information. Industries where a wrong answer matters, allied health clinics, financial services, legal practices, benefit from grounded responses with citations. You still need human review and disclaimers, but RAG at least stops the model inventing.

Internal staff portals. The strongest ROI case in our experience. Onboarding, HR policies, procurement and IT helpdesk queries are all repetitive and answerable from documents. Internal RAG deployments are lower risk because the audience is known.

If you do not fit one of these shapes, RAG is probably premature. A well-structured FAQ page, better site search or a simple scripted chatbot may cover the need at a fraction of the cost.

An implementation sketch

The architecture has only a handful of parts, though each has choices behind it.

Content ingestion

We pull content from the source of truth (CMS, SharePoint, Notion, help desk, PDFs) through a scheduled sync. Each document is cleaned, stripped of navigation cruft, and split into chunks of 300-800 tokens. Chunk size is one of the few parameters that genuinely matters. Too small and you lose context. Too large and retrieval gets noisy.

Embeddings and vector storage

Every chunk is converted to a vector using an embedding model (OpenAI text-embedding-3-small is our default, occasionally Cohere or a local model for data residency). The vectors go into a database that can do similarity search. For small businesses we almost always use Postgres with the pgvector extension because most clients already run Postgres, and the cost is effectively zero at modest scale. Pinecone and Qdrant are alternatives when scale or hosted convenience is worth the cost.

Query pipeline

At query time, the user's question is embedded, the top 5-10 most similar chunks are retrieved, and those chunks are passed to the language model alongside a carefully written system prompt. The model is instructed to answer only from the provided context and to say so when it does not know. Inline citations back to source URLs are generated so users can verify.

Guardrails

A minimum of three. Refusing to answer when retrieval confidence is low. Filtering queries that are obviously off-topic or adversarial. Logging every interaction for review. None are optional.

The whole stack runs comfortably on modest infrastructure. A Next.js or FastAPI backend, a Postgres instance, and API calls to Anthropic or OpenAI. No specialised hardware, no dedicated ML team.

The maintenance reality

This is the part every vendor glosses over. A RAG system is a living thing and needs ongoing attention.

Content freshness. When documents change in the source system, the index needs to update. We run a nightly sync with webhook-triggered updates for critical changes. Stale answers about pricing or policies erode trust faster than anything else.

Eval and regression testing. We maintain a bank of 50-200 golden questions with expected answers and run them whenever the prompt, model or retrieval stack changes. Without this, improvements to one query silently break another.

Prompt drift. As the business changes (tone shifts, new product lines launch, legal language updates), the system prompt needs updating. Treat it as code. Version-controlled, reviewed, and deployed deliberately.

Cost monitoring. Token usage can drift upward as conversations get longer or retrieval returns larger chunks. A weekly review of cost per conversation catches this early.

Budget for 2-4 hours per month of maintenance on a stable deployment, more in the first three months after launch.

What it costs to run

For a small business deployment handling 1,000-3,000 conversations a month, monthly running costs in 2026 look like this:

Model calls (Claude Sonnet or GPT-4o class): AUD 80-250
Embedding generation: AUD 5-15
Vector storage (pgvector on existing Postgres): effectively zero
Hosting for API and sync jobs: AUD 20-60
Monitoring and logging: AUD 10-30

So roughly AUD 150-350 per month for the running stack, plus the initial build. Compare that to the loaded cost of a support team member handling the same volume and the ROI is usually obvious by month two or three.

Where it can go wrong

The failure modes we have seen most often.

Garbage in, garbage out. If your underlying documentation is outdated or inconsistent, RAG will confidently surface the bad information. Cleaning the source content is half the project.

Over-retrieval. Stuffing 15 chunks of context into every query wastes tokens and confuses the model. Fewer, higher-quality chunks consistently outperform more.

Skipping evaluation. A system that looks good in a demo can be 70 per cent accurate on real queries. Without a rigorous eval harness you will not know, and the support team quietly loses confidence in it.

Treating it as set-and-forget. RAG is not a plug-in. It is a product. Plan for the ongoing work before you start.

If RAG sounds like the right shape for your business, or you are not sure, we are happy to help you work out whether it fits before you commit to building.