Build vs buy vs fine-tune: an AI implementation decision framework

The single biggest source of wasted AI spend in the AU mid-market in 2026 is the decision to build when buying would have been better — closely followed by the decision to fine-tune when prompting would have been enough. Both errors are made by smart people, usually because the trade-offs aren’t legible until you’ve been through the work once.

This piece is the decision framework we walk through with clients in the first week of any AI engagement. It covers the three implementation patterns (buy / build / fine-tune), the questions that point cleanly to one or another, and the cost-and-effort reality of each at mid-market AU scale in 2026.

The three patterns

Buy means using an off-the-shelf AI product — Copilot for Microsoft 365, Einstein for Salesforce, a vertical SaaS tool with AI features built in, a commercial RAG product like Glean. You’re paying a per-seat or per-call fee and accepting that the product’s opinions about how the work should be done are mostly fixed.

Build means assembling your own system from foundation-model APIs and your own data. You’re typically writing prompts, designing a retrieval pipeline, integrating with your data sources, and shipping the result as a custom application. The model itself is usually frontier-tier and unmodified — the value you add is the integration and the prompting.

Fine-tune means taking an existing foundation model and training it on your own data to change its behaviour. You’re paying compute costs to update the model weights and accepting the operational complexity of running a model that’s yours, not the vendor’s.

These are three different bets, with different cost profiles, time-to-value, and risk profiles. The right one depends on the use case — and most organisations need a mix.

The buy-first heuristic

The default in 2026 should be buy unless you have a specific reason to choose otherwise. Three reasons it’s the right default:

Cost-of-time. A SaaS AI product is live the day you turn it on. A custom build is 8–16 weeks. The opportunity cost of those weeks is usually larger than the customisation gap.
Vendor compounding. The major AI products are improving rapidly. A Copilot you buy in May 2026 is meaningfully better than a Copilot you bought in November 2025. Custom builds don’t get that compounding for free — every improvement requires engineering effort.
Operational depth. A vendor like Microsoft or Salesforce has invested billions in the operational substrate around the AI — monitoring, security, access control, audit logs, regional compliance. You’re not buying just the AI — you’re buying that substrate.

The exceptions to the buy-first default are specific:

The use case is core to the business and differentiating — the way you do this work is the reason customers choose you, and a generic product can’t express it.
The use case has specific compliance requirements that no off-the-shelf product satisfies — sovereignty, audit-trail, integration with a regulator-specific system.
The relevant data is structurally inaccessible to the vendor’s integration patterns — it lives in a legacy system, an unusual format, or a place the SaaS connectors can’t reach.
The unit economics don’t work at the vendor’s pricing — you have 5,000 users who each need a feature that costs $30/user/month from the SaaS vendor, but it’d cost you a fraction of that to build on raw foundation-model APIs.

If none of those apply, buy. The longer answer to “why most organisations should buy first” sits inside our Why most enterprise AI pilots fail piece — most pilot failures we see are custom builds that should have been off-the-shelf.

When to build

Build is right when at least two of the four exception criteria above apply. In practice that’s typically:

A workflow that’s specific to your industry (insurance claims triage, legal contract review at a particular firm’s standard, clinical documentation against a specific health system’s templates) that off-the-shelf doesn’t address.
A workflow with compliance and integration requirements (the data must stay in AU, the audit log must match your existing risk framework, the system must integrate with a legacy COBOL system) that exceed what SaaS can offer.
An internal platform play — you’re building an internal tool that 200+ employees will use, and the per-seat pricing of a SaaS equivalent is materially higher than the custom build amortised over 3 years.

The build pattern that’s working in 2026 is what we’d call thin-wrapper-on-frontier-model:

Use a frontier foundation model (Claude Sonnet, GPT, Gemini Pro at the tier your use case actually needs)
Build a retrieval layer over your own data (RAG architecture)
Wrap with strong prompt engineering and evaluation infrastructure
Ship as a focused application, not a platform

Build cost in AU mid-market for a focused build in this pattern is typically:

Discovery + design: $40K–$80K (2–4 weeks, see our Generative AI Pilot)
Build + ship: $80K–$200K (8–12 weeks)
Year-1 run cost: $50K–$200K depending on usage volume and model selection
Year-1 operate cost (internal owner + ongoing tuning): $100K–$250K

Total year-1 envelope: typically $250K–$700K for a single production workflow at mid-market scale. Year-2 cost drops significantly as build is amortised.

If those numbers don’t pencil against the value of the use case, the right call is usually buy, not “build at a smaller budget” — the failure mode of under-budgeted builds is what funds most of the AU AI consulting industry’s rework engagements.

When to fine-tune (rarely)

Fine-tuning is the most over-prescribed pattern in 2026 AU AI consulting. The pitch is intuitive — “we’ll train a model on your data so it understands your business” — but the reality is that fine-tuning is rarely the right answer for the problem the buyer is actually trying to solve.

Fine-tuning is the right answer when:

You have a specific output format that a frontier model with good prompting can’t reliably produce. Examples: a particular code style that’s idiosyncratic to your codebase; a very specific clinical documentation format; a domain-specific structured-output format with rules a prompt can’t fully express.
You have latency or cost constraints that mean a smaller, specialised model is meaningfully cheaper to run at your volume — and the lift from fine-tuning is large enough to make the trade-off work.
You’ve already exhausted prompting and retrieval and have measurable evidence that the gap can’t be closed by improving them.

Fine-tuning is the wrong answer when:

You want the model to “know your data” — that’s a retrieval problem, not a fine-tuning problem. RAG, not weights.
You want the model to use your terminology — that’s a prompting and few-shot examples problem.
You think your data is so unique that a foundation model couldn’t reason over it — almost always false; frontier models in 2026 reason over Australian regulatory text, clinical literature, legal frameworks, and most domain-specific corpora well, given the right prompts.

The cost-and-effort reality of fine-tuning:

Frontier-model fine-tuning (where vendors expose it): $50K–$200K to design, train, and evaluate a meaningful run, often plus a premium per-call inference price for the fine-tuned variant.
Open-source fine-tuning (Llama, Mistral, etc): $80K–$300K to set up the training infrastructure, run iterations, and stand up production-grade hosting and observability.
Ongoing cost: re-fine-tuning at each model upgrade, evaluation infrastructure, the operational complexity of running model weights that are yours.

The cleaner alternative for most “we want the model to behave specifically” cases: invest the same money in better prompts, better retrieval, and better evaluation — you’ll typically get 80% of the benefit at 20% of the cost, and the work is portable across model upgrades.

The decision framework

Walk through the following in order. Stop at the first “yes.”

Q1. Is there a credible off-the-shelf SaaS product that solves 80% of this use case? → Yes: buy it. Then evaluate after 6 months whether the remaining 20% justifies a custom build or whether the SaaS vendor has closed the gap.

Q2. Is the use case core to your differentiation, compliance-restricted in ways SaaS can’t serve, structurally inaccessible to SaaS connectors, or at a scale where the unit economics don’t work? → Yes: build, using the thin-wrapper-on-frontier-model pattern. → No: revisit Q1 — you may have been too restrictive about what SaaS options were available.

Q3. After building, have you exhausted prompting and retrieval improvements and still have a measurable behavioural gap? → Yes: consider fine-tuning, with a clear specification of what you’re trying to achieve. → No: keep iterating on prompts and retrieval; the marginal cost is lower.

The most common error is jumping from Q1 to Q3 — “we want a model that understands our business, let’s fine-tune one.” That conflates “the model needs access to our data” (a retrieval problem) with “the model needs to behave differently” (a fine-tuning problem). They’re different problems with different solutions.

A practical example

A mid-market AU insurer asked us to scope an AI engagement to “fine-tune a model on our claims handling guidelines so adjusters can ask it questions during triage.”

We walked through the framework:

Q1: Could Copilot Studio plus their existing SharePoint guidelines library serve this? Possibly — but the adjusters needed specific structured outputs (a recommended next-action with rationale), and the audit log requirements were specific. So we provisionally said no.
Q2: Was the use case differentiating? Not really. Was it compliance-restricted? Yes — needed AU sovereignty and an audit trail tied to the policy admin system. So: build.
The build that shipped: a thin wrapper on a frontier model, with retrieval against the guidelines library, with prompt-engineered structured outputs and a strong evaluation framework. ~12 weeks, ~$180K all-in.
The fine-tune that didn’t ship: we didn’t need it. The model — with good retrieval, careful prompting, and a few-shot example set — answered adjuster questions with the right structure 94% of the time at first evaluation, climbing to 97% after three rounds of prompt iteration. The fine-tune that would have been $120K of additional spend would have lifted that to maybe 98% at the cost of a much more brittle deployment.

That’s the canonical pattern. The first-instinct answer was fine-tune. The right answer was thin-wrapper build. The work it took to figure that out was three days of structured discovery — a fraction of the cost of starting on the wrong path.

The bottom line

In 2026 AU mid-market:

Buy is right for the majority of use cases. It’s the default.
Build is right when the use case is differentiating, compliance-restricted, structurally inaccessible to SaaS, or at the wrong unit economics — and it works best as thin-wrapper-on-frontier-model.
Fine-tune is right occasionally — when output format, cost, or specific behaviour can’t be addressed by prompting and retrieval. It is rarely the first thing to try.

If you’re mid-decision: the AI Readiness Sprint is the productised engagement we run to walk through this framework against your specific use-case backlog. Three weeks, fixed price, with the explicit deliverable of a scored decision per workflow.

Build vs buy vs fine-tune: an AI implementation decision framework

The three patterns

The buy-first heuristic

When to build

When to fine-tune (rarely)

The decision framework

A practical example

The bottom line

Adjacent reading.

RAG vs fine-tuning vs prompting: choosing the right architecture

Why most enterprise AI pilots fail (and the pattern that works)

Want to talk about this with a senior partner?