Skip to main content
Big Freight LifeBig Freight Life
Big Freight Life

Systems

AI Is Not a Feature

Designing AI-First Systems for Exceptions and Small-Team Reality

April 2026

15 min read

The Thesis

1.AI is not a feature — shipping AI changes the architecture and the decision workflow. If you don’t design the scaffolding, you inherit silent coupling and operational debt.

2.Design for exceptions, not the happy path — reliability comes from explicitly designing non-ideal paths: uncertainty, missing info, adversarial inputs, outages.

3.Small teams can’t afford bad AI decisions — limited runway means thin error budgets. The right stance is graceful degradation, hard caps, high observability, and narrow scope.

The Bolt-On Failure Mode

In production ML, systems accrue ‘hidden technical debt’ — entanglement, boundary erosion, undeclared consumers, and configuration debt. These failure modes arise when ML is treated like a swappable component rather than a system.

DimensionBolt-on AIAI-first SystemExpected Failure
Starting pointTool-first ('add a chatbot')Problem/workflow-firstImpressive demo, negative ROI
Truth sourceModel memory + promptsGrounding + provenance + constraintsConfident wrong answers
ExceptionsImplicit, handled ad hocModeled states + recovery pathsBrittle UX, trust collapse
Evaluation'Looks good in QA'Golden sets + continuous eval gatesRegressions after updates
Ops postureMinimal telemetryObservability + budgets + alertsSilent drift, cost spikes

Quick ML wins mask compounding maintenance costs due to glue code, pipeline complexity, and coupling loops.

Case Studies

Real incidents show the pattern clearly.

Air Canada
A chatbot gave misleading bereavement fare information

In Moffatt v. Air Canada (2024), a tribunal found the airline liable for its chatbot's inaccurate information about bereavement fares. The tribunal rejected the argument that the chatbot was a separate entity — customers shouldn't have to double-check one part of a site against another.


Diagnosis

The chatbot operated as a conversational veneer over policy without a constraint layer tying outputs to authoritative policy text, freshness validation, or confidence gating.

Transferable Pattern

Policy-constrained answering: retrieval over canonical sources, required citations for high-stakes claims, confidence gating that routes uncertain answers to escalation.

McDonald's / IBM
Drive-thru AI ordering pilot ended after multi-year test

McDonald's ended its AI-powered drive-thru voice ordering test with IBM after running the pilot since 2021. Drive-thru ordering is an exception-dense environment: noise, accents, interruptions, menu complexity, tight latency.


Diagnosis

Demos overrepresent the happy path. Real conditions — noise, accents, interruptions — reveal the system's true reliability limits.

Transferable Pattern

Uncertainty-first dialogue: explicit confirmation for low-confidence intents, seamless human handoff as a designed state, metrics tied to order correctness not just automation rate.

Zillow Group
iBuying program wound down after forecasting failures

Zillow wound down Zillow Offers after describing forecasting limitations and risk dynamics in its Q3 2021 shareholder letter. When model outputs drive capital deployment, error becomes balance-sheet exposure.


Diagnosis

When AI outputs drive capital deployment, 'AI as a feature' is an invalid framing. The model is a core component of a financial risk system that needs circuit breakers, stress tests, and governance.

Transferable Pattern

Financial circuit breakers: conservative operating bands, kill-switch governance, post-market monitoring and continuous compliance evaluation.

The Architecture

A central AI-first pattern is to separate ‘model call’ from ‘product decision.’ The orchestrator owns routing, retrieval, validation, policy gates, fallbacks, and observability.

Loading diagram...

AI-first architecture: the orchestrator separates model calls from product decisions

Thin-waist contracts Enforce structured outputs and validation rather than free-form text everywhere.

Tool isolation Treat tool invocation as privileged execution behind explicit allowlists and permission checks.

Observability by default Log and trace every decision pathway: routing, retrieval docs used, tool calls.

AI-First Successes

By contrast, AI-first successes share two properties: AI is embedded in a workflow with rapid human verification, and the organization treats evaluation as core infrastructure.

GitHub Copilot

A controlled experiment showed developers completed coding tasks 55.8% faster with Copilot. The key: AI is embedded in a high-feedback workflow (the IDE) where humans verify quickly. Autonomy is constrained — the system suggests, the human decides.

Stitch Fix

Their 'expert-in-the-loop' approach intentionally combines human judgment with algorithmic generation. Humans define quality criteria, review outputs, and feed failure examples back into evaluation suites. Exceptions and nuance are treated as primary reality.

Start with suggestion layers and progressively earn autonomy by passing eval gates and staying within error budgets.

The Dashboard

An AI-first dashboard should look more like an SRE dashboard than a model benchmark.

Outcomes

  • Task completion rate
  • Escalation rate
  • Redo / re-ask rate

Reliability

  • p50 / p95 / p99 latency
  • Timeout rate
  • Tool-call failure rate

Quality

  • Grounded-answer rate
  • Citation coverage
  • Human override rate

Safety & Cost

  • Prompt-injection detections
  • Policy violations
  • Cost / request
  • Cost / outcome

Loading diagram...

Observability stack: OpenTelemetry → storage → dashboards + alerting

Your system needs an architect.

If your AI investments aren't producing results, the system is the place to start.