AI practice

Generative AI implementation — production systems, not demos.

RAG, agents, evaluations and observability designed for the realities of running LLMs in production — cost, latency, accuracy and drift, all measured.

Most generative-AI projects stall at the same place: the proof-of-concept that worked in a notebook never reaches production, or it reaches production and quietly drifts off the metric it was supposed to move. We design against that pattern. Our GenAI engagements ship working systems with the evaluations, observability and runbooks an operator can actually run. The architecture decisions — RAG vs fine-tuning vs prompting, which foundation model, what evaluations matter — get made based on your real constraints, not the demo room.

What we deliver

Offerings inside Generative AI.

GenAI pilot to production

A three-week sprint from kickoff to a working system in real users' hands. Evaluations harness, observability dashboard, runbooks for the operator — designed so the system survives without us.

RAG implementation

Retrieval-augmented generation, properly. Chunking strategy, embedding model selection, vector store choice, hybrid search, re-ranking, eval harness, observability. Output: a system that retrieves what your team actually needs.

LLM evaluations + observability

For teams that have an LLM in production but no idea whether it's getting better or worse. Eval harness against your labelled data; cost/latency/accuracy dashboards; drift detection. Confidence in what's shipping.

Prompt engineering + fine-tuning advisory

When prompting is enough, when RAG is the better answer, and the rare case where fine-tuning earns its keep. A two-week advisory engagement that sets the architecture for the next twelve months.

When to engage us

We’re typically the right partner when…

Stack

Tech we work with day-to-day.

Engagement-specific stack choices are always driven by your constraints. The below is what we have current production experience with.

Anthropic Claude (incl. Computer Use) OpenAI GPT Vercel AI SDK LangGraph Pinecone Postgres + pgvector OpenAI Evals Anthropic prompt caching AWS Bedrock + Azure OpenAI

FAQ

Common questions.

Which foundation model do you typically recommend?

It depends on the use case and the deployment constraints. We use Anthropic Claude as a default starting point because it tends to produce more reliable agentic behaviour and has strong AU-data-residency options via AWS Bedrock; OpenAI GPT and Google Gemini are common alternatives where the use case fits their strengths. We have no commission-based vendor relationships — every recommendation is based on your constraints.

Can you work with on-premise or private-cloud LLM deployments?

Yes — for governance-sensitive sectors (defence, parts of healthcare, certain APRA-regulated work) on-prem or VPC-deployed open-weights models are appropriate. We help select between Llama, Mistral and similar; we run the deployment work in partnership with your infrastructure team.

How do you measure whether a GenAI system is "working"?

Three layers: (1) automated evaluations against a labelled dataset, run on every change; (2) production observability — cost per request, latency at p95, quality drift indicators; (3) the operator metric the system was meant to move (time saved, cases auto-triaged, etc.). All three live on a dashboard you own from day one.

What about hallucinations / accuracy?

Hallucinations are a system-design problem more than a model problem. RAG with proper retrieval, prompt-level constraints, citations + audit trails, and an evaluation harness that detects regressions — these are the patterns that work. We don't promise zero hallucinations; we promise the architecture and observability to detect and prevent the kinds that matter for your use case.

Next step

Talk to a senior partner about your Generative AI engagement.

Discovery calls are 30 minutes, no deck, no pitch. We’ll tell you honestly whether we’re the right team for your specific situation.