Generative AI Implementation · Cornerstone
A decision framework for when retrieval-augmented generation earns its keep, when fine-tuning is the right answer, and when prompting alone is enough. With the cost-and-latency reality of each in 2026.
Quantum Associates — Quantum Associates
· 14 min read
Three years into the generative-AI era, “RAG vs fine-tuning vs prompting” has become the standard architecture decision for any non-trivial GenAI system. The vendor literature on each is enormous, partial, and usually directed at the architecture the vendor sells. This piece is the decision framework we actually use inside engagements — opinionated, no vendor agenda, and grounded in the cost and latency reality of running each in production in 2026.
The short version: most production systems use two of the three, not one. Prompting is almost always involved. RAG and fine-tuning are the architectural questions worth thinking carefully about. The most consequential failure mode we see is over-engineering — reaching for fine-tuning when better prompting would have done, or reaching for RAG when the corpus was small enough to fit in context.
Prompting is the baseline: an LLM call with a prompt designed to elicit the behaviour you want. Includes few-shot examples in the prompt, structured output instructions, chain-of-thought scaffolding, system messages defining role and constraints. No external system involved beyond the model call.
RAG (retrieval-augmented generation) is prompting plus a retrieval step. Before each LLM call, relevant context is fetched from an external knowledge store (typically a vector database, sometimes a hybrid search system) and injected into the prompt. The model doesn’t memorise the knowledge; it reads it freshly on each query.
Fine-tuning modifies the model itself. A base foundation model is further trained on a curated dataset of input-output examples, producing a derivative model that has the desired behaviour patterns baked into the weights. The fine-tuned model is then prompted normally; the “knowledge” of how to respond is inside the model.
These aren’t mutually exclusive. You can fine-tune a model that uses RAG. You can run RAG with a base model. You can prompt a fine-tuned model. The architecture question is which of these capabilities you actually need.
Prompting is the right answer more often than people assume. The pattern recognition is roughly this: if the task can be specified in a one-page prompt, and the model can read the data freshly each time, you probably don’t need RAG or fine-tuning.
Concrete examples where prompting alone is the right answer:
Prompting cost: one model call per task. Latency: typically 1–5 seconds depending on output length. Operational complexity: low — just the model API and your application code. No vector database to manage, no fine-tuning pipeline, no model versioning beyond the foundation model’s own versions.
The reason to NOT default to prompting: when the context you need is too large to fit in the prompt (more than ~50 documents at any meaningful detail), or when the behaviour pattern you need is too subtle to encode in instructions. Those are the cases that push you toward RAG or fine-tuning respectively.
RAG is the right architecture when you have a large corpus of knowledge and you need the model to reason over the relevant subset for each query.
The clearest signals you need RAG:
The architecture work for production RAG is real and often underestimated:
Production RAG cost: typically 2–4x prompting cost (the retrieval step itself plus the longer context). Latency: typically 2–8 seconds depending on retrieval system. Operational complexity: meaningfully higher — vector database, embedding pipeline, retrieval evaluation, ongoing corpus maintenance.
The most common RAG failure mode in 2026 is insufficient evaluation discipline. Teams ship RAG, deploy to production, and don’t measure retrieval quality systematically. Six months later quality has drifted and the team has no visibility into why. The fix is to build evals first, then build RAG.
Fine-tuning is the right answer less often than the vendor literature suggests, but where it fits, it fits well.
The clearest signals fine-tuning is the right choice:
The work for production fine-tuning:
Production fine-tuning cost: highly variable. The training is one-time per fine-tune (typically AUD $500–$5,000 depending on dataset size and model). Inference cost is the same as the base model. Operational complexity: meaningful — training pipeline, evaluation infrastructure, model versioning, regression testing.
The most common fine-tuning failure mode: fine-tuning instead of better prompting. We’ve seen organisations spend months on fine-tuning that a two-week prompt-engineering effort would have matched or beaten. The decision rule we use: try sophisticated prompting first. Fine-tune only when prompting plateaus below the quality threshold.
The flowchart we use inside engagements is roughly:
Step 1: Can the task be specified in a prompt?
Step 2: Is the gap “knowledge the model doesn’t have” or “behaviour the model doesn’t exhibit”?
Step 3: For RAG: is the corpus stable enough and small enough that you could fit it in context?
Step 4: For fine-tuning: have you exhausted prompting?
Step 5: Does the system need to combine knowledge AND behaviour gaps?
This is the framework. The application of it to your specific situation is what the discovery + design phases of an engagement do.
Approximate cost ranges for a moderate production workload — say, 100,000 LLM-driven workflows per month:
These ranges depend heavily on the specific implementation — prompt length, retrieval system, model, infrastructure. The right way to get accurate figures is to model the specific workload against the specific providers. We do this in week 2 of a Generative AI Pilot as the “two cost models” deliverable (build cost and twelve-month run cost).
A few specific things have shifted in 2026 that change the decision calculus from what was true even 12 months ago:
Long-context models are more capable. Claude’s 1M-token context window and similar from competitors mean some use cases that would have required RAG can now use long-context prompting. The decision isn’t free — long-context calls are slower and more expensive per call — but for stable, moderately-sized corpora it’s a real alternative to RAG infrastructure.
Prompt caching is mainstream. Anthropic, OpenAI and Google all offer prompt caching at meaningful discounts. This dramatically improves the economics of prompting and RAG architectures that have stable system prompts or stable context blocks. It also reduces (but doesn’t eliminate) the “compress the prompt” argument for fine-tuning.
Smaller, more capable fine-tuneable base models have arrived. Llama 3.x at 70B and below, Mistral’s newer releases, Qwen, others. The fine-tuning case for using a smaller, fine-tuned model for narrow tasks is stronger than it was — particularly where latency or cost matter and you have the operations capability to host your own inference.
Evaluations tooling has matured. OpenAI Evals, custom evaluation harnesses, LangSmith and similar platforms — the tooling for systematically measuring RAG and fine-tuning quality is meaningfully better than it was. Less excuse for shipping un-evaluated systems.
Most production GenAI engagements we run end up combining:
The diagnostic for which combination is right for a specific use case is the Generative AI Pilot — three weeks from kickoff to a working system in production. Week 1 of the pilot is exactly this architecture decision; weeks 2 and 3 are the build.
If you’re earlier in the journey — not yet committed to a specific use case, multiple AI ideas in the air — the AI Readiness Sprint is the right starting point. That’s the two-week assessment that produces the prioritised backlog before any architecture decisions need to be made.
For everything else, the discovery call is 30 minutes and conversational.
Related insights
AI Agents & Agentic Automation
When agentic AI earns its keep over RPA, when RPA is still the right answer, and how to evaluate a workflow before committing to either. Honest, opinionated, no vendor agenda.
AI Strategy & Roadmapping
The pattern is consistent enough to be useful. What separates AI pilots that ship into production from the ones that quietly die six months in.
Next step
30 minutes, no pitch, no deck — just a working conversation about how this applies to your situation.