AI practice
RAG, agents, evaluations and observability designed for the realities of running LLMs in production — cost, latency, accuracy and drift, all measured.
Most generative-AI projects stall at the same place: the proof-of-concept that worked in a notebook never reaches production, or it reaches production and quietly drifts off the metric it was supposed to move. We design against that pattern. Our GenAI engagements ship working systems with the evaluations, observability and runbooks an operator can actually run. The architecture decisions — RAG vs fine-tuning vs prompting, which foundation model, what evaluations matter — get made based on your real constraints, not the demo room.
What we deliver
A three-week sprint from kickoff to a working system in real users' hands. Evaluations harness, observability dashboard, runbooks for the operator — designed so the system survives without us.
Retrieval-augmented generation, properly. Chunking strategy, embedding model selection, vector store choice, hybrid search, re-ranking, eval harness, observability. Output: a system that retrieves what your team actually needs.
For teams that have an LLM in production but no idea whether it's getting better or worse. Eval harness against your labelled data; cost/latency/accuracy dashboards; drift detection. Confidence in what's shipping.
When prompting is enough, when RAG is the better answer, and the rare case where fine-tuning earns its keep. A two-week advisory engagement that sets the architecture for the next twelve months.
When to engage us
Stack
Engagement-specific stack choices are always driven by your constraints. The below is what we have current production experience with.
FAQ
It depends on the use case and the deployment constraints. We use Anthropic Claude as a default starting point because it tends to produce more reliable agentic behaviour and has strong AU-data-residency options via AWS Bedrock; OpenAI GPT and Google Gemini are common alternatives where the use case fits their strengths. We have no commission-based vendor relationships — every recommendation is based on your constraints.
Yes — for governance-sensitive sectors (defence, parts of healthcare, certain APRA-regulated work) on-prem or VPC-deployed open-weights models are appropriate. We help select between Llama, Mistral and similar; we run the deployment work in partnership with your infrastructure team.
Three layers: (1) automated evaluations against a labelled dataset, run on every change; (2) production observability — cost per request, latency at p95, quality drift indicators; (3) the operator metric the system was meant to move (time saved, cases auto-triaged, etc.). All three live on a dashboard you own from day one.
Hallucinations are a system-design problem more than a model problem. RAG with proper retrieval, prompt-level constraints, citations + audit trails, and an evaluation harness that detects regressions — these are the patterns that work. We don't promise zero hallucinations; we promise the architecture and observability to detect and prevent the kinds that matter for your use case.
Next step
Discovery calls are 30 minutes, no deck, no pitch. We’ll tell you honestly whether we’re the right team for your specific situation.