Measuring AI ROI: a CFO-grade framework for Australian businesses

The hardest single conversation in any AI engagement we run is the one with the CFO. Not because CFOs are obstructive — they’re typically the clearest thinkers in the room — but because most AI business cases are built on the wrong unit economics. They optimise for headline savings that look impressive at the kickoff, then quietly collapse when the run cost shows up on the invoice in month four.

This piece is the framework we use to construct AI business cases that hold up to CFO scrutiny. It covers the four cost-and-value categories a defensible business case needs, the three measurement disciplines that keep the case honest after the system ships, and the most common errors we see in AI ROI calculations across the AU market.

Why most AI business cases fail CFO review

Three patterns we see consistently:

Pattern A: only the build cost is modelled. The proposal includes a clear engagement cost ($X to build the system) and a forecast value ($Y per year in savings or revenue). It doesn’t include the run cost — the foundation model API charges, the additional infrastructure, the ongoing evaluation work, the operator time. When the run cost arrives in production, the actual margin collapses and the project goes from “saved us $400K/year” to “saved us $80K/year after costs.”

Pattern B: savings are gross, not net. “This system will save 4 FTE worth of work per year — that’s $400K in salary costs.” The assumption: those 4 FTE will be eliminated or redeployed to higher-value work, and the saving will show up on the P&L. The reality, almost always: the 4 FTE remain in their roles; the time saved is absorbed by other work that was previously deferred; the P&L impact is zero. A CFO sees through this in 10 seconds.

Pattern C: the comparison case is unrealistic. “Without AI, this work would take 40 hours per week of senior analyst time. With AI, it takes 6 hours.” The unspoken comparison is “AI vs do nothing.” The real comparison is usually “AI vs better process design, or AI vs hiring a junior analyst, or AI vs accepting the current state.” If those alternatives are cheaper, the AI ROI calculation needs to show the differential, not the gross.

The fix for all three is to construct the business case across four explicit categories with honest assumptions in each.

The four categories of a defensible AI business case

1. Build cost

Straightforward. What you’ll pay for the engagement that designs and ships the system. Should be a fixed-price band, not an open-ended time-and-materials estimate. For our productised offers, this is the published price band.

Typical AU bands by engagement size:

Productised offer (sprint / pilot / governance): $15K–$80K
Mid-scope custom build: $80K–$300K
Large enterprise deployment: $250K–$2M+

The number should be a single line item with a 10–20% contingency for scope adjustments inside the agreed envelope.

2. Run cost (the one most often missed)

What it costs to operate the system per month / per year. Includes:

Foundation model API charges: prompt tokens + completion tokens, multiplied by per-1K-token pricing, multiplied by volume. For a system making 100,000 model calls per month at moderate complexity, expect $1K–$8K per month on the model bill alone. For 1M calls per month, expect $10K–$80K.
Vector database / retrieval infrastructure (if RAG): managed Pinecone, hosted Weaviate, or self-hosted pgvector. Typically $200–$2,000 per month depending on scale.
Compute for orchestration: serverless function executions, container hosting, or VM time for the application that wraps the AI. Typically $100–$2,000 per month.
Observability + evaluation infrastructure: production tracing, automated evaluation runs, monitoring dashboards. Typically $200–$1,500 per month.
Periodic re-training / re-evaluation: if the system uses fine-tuning, the cost of periodic re-fine-tuning. Typically $500–$5,000 per training run, frequency depends on drift.

For a moderately-sized production AI system in 2026, expect total run cost in the $3K–$20K per month range. Larger systems scale linearly to roughly $10K–$80K per month. The numbers move with usage; doubling the user base typically doubles the run cost.

3. Operational + ongoing-people cost

The category most build-cost-only business cases miss entirely.

Engineering maintenance: ongoing work to fix issues, respond to provider-side changes, update integrations. Typically 1–4 days per month of senior engineering time, depending on system complexity. At AU senior engineering rates ($1,200–$2,000/day), that’s $1.2K–$8K per month.
AI champion / coordinator time: the named human owner of the system. For non-trivial production AI, this is typically 4–8 hours per week of an internal employee’s time. At loaded cost equivalent of $80K–$160K/year for a senior employee at 10% of their time = $8K–$16K/year.
Evaluation review: someone has to look at the automated evaluation results and respond when they drift. Typically 2–8 hours per month.
Incident response capacity: low frequency but high impact. Should be modelled at an expected value (e.g. one significant incident per year requiring 20 hours of response = ~$5K).

Total ongoing-people cost is typically $20K–$80K per year for a moderately complex production AI system. For larger systems with dedicated AI operations functions, this scales meaningfully.

4. Value (with honest assumptions)

The trickiest category. The four sub-categories that actually move CFO conversations:

a. Time saved → real P&L impact (rare but most credible)

The savings are real if and only if the time-saved capacity is actually removed or redeployed. Examples where this works:

AI triage allows a customer service team to handle 2x volume without hiring additional staff. Real saving: avoided hire of $80K/year salary + on-costs.
AI document review eliminates the need to backfill a departing role. Real saving: avoided hire.
AI automation absorbs a third-party contract that was being renewed annually. Real saving: contract cancellation, e.g. $200K/year.

The test: can a CFO point to a specific line item that gets smaller because of the AI system? If yes, the saving is real. If the answer is “the team will get more done in the same time” without a measurable downstream effect, the saving is theoretical.

b. Revenue impact

Often more credible than cost savings because revenue is observable. Examples:

AI-driven personalisation lifts conversion rate by 8%. Measurable: pre/post conversion rate × traffic × average order value.
AI-augmented sales team closes 12% more deals at the same effort level. Measurable: closed deals before/after, controlled for other variables where possible.
AI-enabled new product line creates net-new revenue stream. Measurable: revenue from products that didn’t exist before the AI capability.

Revenue cases need controlled measurement (ideally A/B testing, at minimum well-instrumented before/after analysis) to be credible past month three.

c. Risk reduction (hard to quantify, sometimes most important)

Hardest to put in dollars but sometimes the dominant value driver:

AI-enabled fraud detection that catches $X/year of fraud previously missed. Quantifiable if the historical fraud baseline is known.
AI compliance monitoring that reduces regulatory exposure. Quantifiable as the avoided cost of regulatory action (probability × magnitude).
AI safety monitoring that prevents incidents. Quantifiable via expected-value analysis on historical incident rates and severity.

Risk-reduction value is real but should be modelled conservatively in the business case (apply a 0.3–0.7 probability discount to the headline avoided-loss number).

d. Capability that wasn’t previously available

The hardest to quantify but increasingly important in 2026: AI enables work that genuinely couldn’t be done before at any reasonable cost. Examples:

Analysing 10 years of customer feedback at scale to inform product strategy.
Drafting 100 personalised customer responses per day that would require a 20-person team to produce manually.
Continuous compliance monitoring across 1,000+ documents that no human team could practically review.

For these, the “without AI” comparison case is “we don’t do this at all” — the value is the strategic option created, not a cost avoided. Quantification is qualitative; presentation is in terms of strategic capability rather than P&L.

The CFO-grade business case template

Putting the four categories together, the business case template that survives CFO scrutiny:

12-MONTH BUSINESS CASE — [System name]

COSTS
  Build cost                                  $X    (one-time, year 1)
  Run cost (12 months)                        $Y    (foundation model + infra + obs)
  Operational + people (12 months)            $Z    (eng maintenance + champion + ops)
  Total year-1 cost                           $X+Y+Z

VALUE (year 1)
  Direct cost savings (with line items)       $A
  Revenue impact (with measurement plan)      $B
  Risk reduction (probability-weighted)       $C × 0.5
  Capability premium (qualitative)            [described, not totalled]
  Total quantified year-1 value               $A+$B+$C/2

PAYBACK
  Year-1 net                                  ($A+$B+$C/2) - ($X+$Y+$Z)
  Months to payback                           Total cost / monthly net value
  Year-2 net (no build cost)                  ($A+$B+$C/2) - ($Y+$Z)
  3-year cumulative                           Year 1 + Year 2 + Year 3 net

The honest version usually shows: year-1 net is small or negative (build cost dominates), year-2 net is meaningfully positive, three-year cumulative is strongly positive. That’s how good AI investments actually look — they pay back in year two, not year one.

Business cases that show large year-1 positive returns are typically the ones where the numbers haven’t been pressure-tested. CFOs know this. They’re more comfortable with a realistic year-2 payback than an unrealistic year-1 positive.

The three measurement disciplines that keep the case honest

A business case approved at month zero is half the work. The other half is maintaining the discipline to know whether the projected value is actually showing up. Three practices that distinguish AI investments that hit their projections from ones that quietly miss:

1. The pre-deployment baseline

Before the AI system ships, the metrics it’s expected to move must have an explicit baseline. “Reduce average claim handling time” requires knowing the current average. “Improve conversion rate” requires the current conversion rate. Sounds obvious; routinely missed.

The reason: in the rush of pre-deployment, baselining feels like overhead. Six months later when someone asks “did this work?” the absence of the baseline makes the question unanswerable.

Practical: dedicate one workshop in the Design phase to defining the metrics, the baseline values, and the measurement methodology. Make the baseline a Design-phase deliverable.

2. The actuals-vs-forecast review cycle

The business case should be revisited at month 3, month 6, and month 12 post-deployment. Each review compares actual cost and actual value to the business case forecast, with explanations for variances.

This sounds like governance theatre but isn’t. The review forces the project sponsor to confront whether the system is actually delivering what was promised. If the answer is no, the response options are visible: tune the system, change the scope, accept lower-than-projected value, or shut the system down.

Practical: schedule the three reviews in the executive sponsor’s calendar at deployment. Make the actuals-vs-forecast a recurring agenda item.

3. The kill criterion

Every AI investment should have an explicit kill criterion at decision time. “If this system doesn’t hit X by month Y, we shut it down.” The criterion forces honesty about what success requires; the existence of the criterion makes it possible to stop investments that are quietly underperforming.

The most common kill criteria we see (and recommend):

“If actual run cost exceeds 2x forecast at month 6, we redesign or shut down.”
“If the system isn’t in production use by month X, we shut down.”
“If the projected month-12 value isn’t tracking to be at least 50% of forecast at month 6, we redesign or shut down.”

Practical: write the kill criterion into the engagement letter or the executive approval document. Refer to it explicitly at each actuals-vs-forecast review.

What we provide in engagements

Every productised offer we run produces the four-category cost model as part of the Design phase output. The Generative AI Pilot, in particular, ships with both the build cost (the engagement cost) and the 12-month run cost forecast (with confidence bands).

For organisations that have AI investments already in flight and want a structured actuals-vs-forecast review, the AI Readiness Sprint includes review of existing AI investments as part of the prioritised backlog work. Sometimes the most useful output of that sprint is the recommendation to redesign or kill an existing investment that’s quietly underperforming.

For interactive estimation against your specific situation, our AI ROI Calculator walks through the four-category model and produces a structured business-case skeleton you can present to your finance team. Free, ~5 minutes.

For everything else, the discovery call is 30 minutes, no agenda. If you have a specific AI investment under consideration and want a senior practitioner’s view of the business case before you commit, that’s the fastest path.