Generative AI Implementation · Cornerstone

RAG vs fine-tuning vs prompting: choosing the right architecture

A decision framework for when retrieval-augmented generation earns its keep, when fine-tuning is the right answer, and when prompting alone is enough. With the cost-and-latency reality of each in 2026.

Quantum Associates — Quantum Associates

· 14 min read

Three years into the generative-AI era, “RAG vs fine-tuning vs prompting” has become the standard architecture decision for any non-trivial GenAI system. The vendor literature on each is enormous, partial, and usually directed at the architecture the vendor sells. This piece is the decision framework we actually use inside engagements — opinionated, no vendor agenda, and grounded in the cost and latency reality of running each in production in 2026.

The short version: most production systems use two of the three, not one. Prompting is almost always involved. RAG and fine-tuning are the architectural questions worth thinking carefully about. The most consequential failure mode we see is over-engineering — reaching for fine-tuning when better prompting would have done, or reaching for RAG when the corpus was small enough to fit in context.

The three approaches, fast

Prompting is the baseline: an LLM call with a prompt designed to elicit the behaviour you want. Includes few-shot examples in the prompt, structured output instructions, chain-of-thought scaffolding, system messages defining role and constraints. No external system involved beyond the model call.

RAG (retrieval-augmented generation) is prompting plus a retrieval step. Before each LLM call, relevant context is fetched from an external knowledge store (typically a vector database, sometimes a hybrid search system) and injected into the prompt. The model doesn’t memorise the knowledge; it reads it freshly on each query.

Fine-tuning modifies the model itself. A base foundation model is further trained on a curated dataset of input-output examples, producing a derivative model that has the desired behaviour patterns baked into the weights. The fine-tuned model is then prompted normally; the “knowledge” of how to respond is inside the model.

These aren’t mutually exclusive. You can fine-tune a model that uses RAG. You can run RAG with a base model. You can prompt a fine-tuned model. The architecture question is which of these capabilities you actually need.

When prompting alone is enough

Prompting is the right answer more often than people assume. The pattern recognition is roughly this: if the task can be specified in a one-page prompt, and the model can read the data freshly each time, you probably don’t need RAG or fine-tuning.

Concrete examples where prompting alone is the right answer:

  • Text transformation tasks — summarisation, translation, format conversion, structured extraction from short documents
  • Single-document analysis — Q&A over one document, sentiment analysis on one piece of text
  • Drafting tasks with brief context — drafting an email given the relevant facts in the prompt
  • Classification tasks with a small label set and a few-shot examples in the prompt

Prompting cost: one model call per task. Latency: typically 1–5 seconds depending on output length. Operational complexity: low — just the model API and your application code. No vector database to manage, no fine-tuning pipeline, no model versioning beyond the foundation model’s own versions.

The reason to NOT default to prompting: when the context you need is too large to fit in the prompt (more than ~50 documents at any meaningful detail), or when the behaviour pattern you need is too subtle to encode in instructions. Those are the cases that push you toward RAG or fine-tuning respectively.

When RAG earns its keep

RAG is the right architecture when you have a large corpus of knowledge and you need the model to reason over the relevant subset for each query.

The clearest signals you need RAG:

  1. The knowledge base is too large to fit in a prompt. Even with 200k+ context windows, putting your entire corporate document corpus into every prompt is expensive and degrades model performance through context dilution. RAG retrieves only what’s relevant.
  2. The knowledge base changes frequently. Fine-tuning bakes knowledge into weights; updating requires retraining. RAG retrieves at query time, so updates to the corpus are instantly available.
  3. The answer needs to be grounded in source documents. Compliance, regulatory, customer-support and legal use cases typically require traceable citations. RAG produces citations naturally; fine-tuning doesn’t.

The architecture work for production RAG is real and often underestimated:

  • Chunking strategy — how the source documents are split into retrievable units. Naive chunking (fixed token sizes) produces poor retrieval; semantic chunking or document-structure-aware chunking is usually worth the effort.
  • Embedding model selection — the embedding model determines what “relevance” means. Pick badly and retrieval quality is capped, no matter how good the LLM downstream.
  • Hybrid retrieval — vector similarity alone is often insufficient. Production RAG systems frequently combine vector search with keyword (BM25) and metadata filtering.
  • Reranking — a second-pass model that reranks the initial retrieval results. Cheap to add, often substantially improves answer quality.
  • Evaluations harness — RAG quality is determined by retrieval quality and generation quality, both of which need separate evaluation. Without evals, you don’t know if a change improved or regressed the system.
  • Observability — every retrieval traced, every generation traced. Drift detection, cost monitoring, latency monitoring.

Production RAG cost: typically 2–4x prompting cost (the retrieval step itself plus the longer context). Latency: typically 2–8 seconds depending on retrieval system. Operational complexity: meaningfully higher — vector database, embedding pipeline, retrieval evaluation, ongoing corpus maintenance.

The most common RAG failure mode in 2026 is insufficient evaluation discipline. Teams ship RAG, deploy to production, and don’t measure retrieval quality systematically. Six months later quality has drifted and the team has no visibility into why. The fix is to build evals first, then build RAG.

When fine-tuning earns its keep

Fine-tuning is the right answer less often than the vendor literature suggests, but where it fits, it fits well.

The clearest signals fine-tuning is the right choice:

  1. The desired behaviour is too subtle to specify in a prompt. Specific tone of voice for a brand, specific formatting conventions for an industry, specific reasoning patterns that few-shot examples can’t reliably elicit.
  2. You need to compress prompts at scale. A fine-tuned model can produce behaviour that a base model requires a long prompt to elicit. If you’re making millions of calls a month, the prompt-token savings compound.
  3. Latency is critical. A smaller fine-tuned model can sometimes match a larger base model’s quality on a narrow task, at substantially lower latency and cost.
  4. You have proprietary patterns the foundation model doesn’t know. Specific document formats, internal code conventions, domain-specific reasoning that’s not represented in foundation model training data.

The work for production fine-tuning:

  • Training data curation — usually the hardest and most expensive part. High-quality, representative input-output pairs. Several hundred to several thousand examples is typical; quality matters more than quantity.
  • Base model selection — fine-tuning is supported on different base models in different ways. OpenAI’s fine-tuning API works on selected GPT models; Anthropic offers fine-tuning via partnerships; open-weights models (Llama, Mistral) can be fine-tuned in your own infrastructure.
  • Evaluation against the base model — fine-tuning is only worth doing if the fine-tuned model meaningfully outperforms the base model with appropriate prompting. We see fine-tuning projects abandoned at this evaluation step about as often as we see them succeed.
  • Ongoing maintenance — base model upgrades, corpus drift, periodic re-fine-tuning. Fine-tuned models accumulate technical debt; the maintenance posture needs to be established upfront.

Production fine-tuning cost: highly variable. The training is one-time per fine-tune (typically AUD $500–$5,000 depending on dataset size and model). Inference cost is the same as the base model. Operational complexity: meaningful — training pipeline, evaluation infrastructure, model versioning, regression testing.

The most common fine-tuning failure mode: fine-tuning instead of better prompting. We’ve seen organisations spend months on fine-tuning that a two-week prompt-engineering effort would have matched or beaten. The decision rule we use: try sophisticated prompting first. Fine-tune only when prompting plateaus below the quality threshold.

The decision framework

The flowchart we use inside engagements is roughly:

Step 1: Can the task be specified in a prompt?

  • If the answer is yes — knowledge fits, instructions fit, examples fit — start with prompting and engineer it well before considering anything else.
  • If no — proceed.

Step 2: Is the gap “knowledge the model doesn’t have” or “behaviour the model doesn’t exhibit”?

  • Knowledge gap → RAG
  • Behaviour gap → fine-tuning

Step 3: For RAG: is the corpus stable enough and small enough that you could fit it in context?

  • If yes — try long-context prompting first. Sometimes simpler than RAG.
  • If no — RAG.

Step 4: For fine-tuning: have you exhausted prompting?

  • If no — go back to prompting. Better prompts, better few-shot examples, better chain-of-thought scaffolding.
  • If yes — fine-tune.

Step 5: Does the system need to combine knowledge AND behaviour gaps?

  • Yes — RAG + fine-tuning. The fine-tuned model uses RAG-retrieved context. Most expensive option; most powerful where it fits.

This is the framework. The application of it to your specific situation is what the discovery + design phases of an engagement do.

Cost reality, AUD, 2026

Approximate cost ranges for a moderate production workload — say, 100,000 LLM-driven workflows per month:

  • Prompting only, GPT-4o or Claude 3.7 Sonnet: AUD $500–$3,000 / month, scaling roughly with prompt+completion token count
  • RAG, same base model: AUD $1,000–$6,000 / month (the base prompting cost plus the retrieval/embedding overhead plus the vector database — Pinecone, Weaviate, or Postgres+pgvector)
  • Fine-tuning, OpenAI: training AUD $500–$5,000 one-time; inference cost typically 1.5–3x base model rate
  • Fine-tuning, open-weights on your own infrastructure: training cost is GPU rental ($50–$2,000); inference cost depends on your infrastructure
  • RAG + fine-tuning: combined costs from both

These ranges depend heavily on the specific implementation — prompt length, retrieval system, model, infrastructure. The right way to get accurate figures is to model the specific workload against the specific providers. We do this in week 2 of a Generative AI Pilot as the “two cost models” deliverable (build cost and twelve-month run cost).

The 2026 considerations that change the calculus

A few specific things have shifted in 2026 that change the decision calculus from what was true even 12 months ago:

Long-context models are more capable. Claude’s 1M-token context window and similar from competitors mean some use cases that would have required RAG can now use long-context prompting. The decision isn’t free — long-context calls are slower and more expensive per call — but for stable, moderately-sized corpora it’s a real alternative to RAG infrastructure.

Prompt caching is mainstream. Anthropic, OpenAI and Google all offer prompt caching at meaningful discounts. This dramatically improves the economics of prompting and RAG architectures that have stable system prompts or stable context blocks. It also reduces (but doesn’t eliminate) the “compress the prompt” argument for fine-tuning.

Smaller, more capable fine-tuneable base models have arrived. Llama 3.x at 70B and below, Mistral’s newer releases, Qwen, others. The fine-tuning case for using a smaller, fine-tuned model for narrow tasks is stronger than it was — particularly where latency or cost matter and you have the operations capability to host your own inference.

Evaluations tooling has matured. OpenAI Evals, custom evaluation harnesses, LangSmith and similar platforms — the tooling for systematically measuring RAG and fine-tuning quality is meaningfully better than it was. Less excuse for shipping un-evaluated systems.

What we actually do

Most production GenAI engagements we run end up combining:

  • Prompting for the orchestration and behaviour layer — always
  • RAG for any non-trivial corporate knowledge access — common
  • Fine-tuning for narrow tone/format/reasoning patterns where prompting plateaus — occasional, less common than RAG

The diagnostic for which combination is right for a specific use case is the Generative AI Pilot — three weeks from kickoff to a working system in production. Week 1 of the pilot is exactly this architecture decision; weeks 2 and 3 are the build.

If you’re earlier in the journey — not yet committed to a specific use case, multiple AI ideas in the air — the AI Readiness Sprint is the right starting point. That’s the two-week assessment that produces the prioritised backlog before any architecture decisions need to be made.

For everything else, the discovery call is 30 minutes and conversational.

Related insights

Adjacent reading.

Next step

Want to talk about this with a senior partner?

30 minutes, no pitch, no deck — just a working conversation about how this applies to your situation.