Resilience

Designing for Exceptions

Not the Happy Path

April 2026

12 min read

The Reality Split

Happy Path

20%

of reality

Exceptions

80%

of reality

The happy path describes what was planned. The exception paths reveal what the system is.

This is a familiar distributed-systems stance: design for partial failure, retries, timeouts, missing dependencies, and ambiguous state. In AI systems, exceptions dominate because the world is adversarial, noisy, and underspecified.

Why Exceptions Dominate in AI

In AI systems, exceptions aren’t edge cases. They’re the primary operating condition.

Ambiguous inputs

Users omit details, use slang, or provide contradictory information. What looks like a clear question to a human is often deeply underspecified.

Model uncertainty

Models have uncertainty and occasional mode collapse behaviors that are hard to predict. Fluency masks wrongness.

Retrieval failure

No relevant documents found, stale documents, wrong chunking. RAG doesn’t guarantee the right context reaches the model.

System failures

Upstream and downstream services fail: timeouts, rate limits, outages. The AI layer inherits every fragility beneath it.

Case Studies

Three different domains, same lesson.

McDonald's / IBM

Drive-thru ordering is an exception-dense environment

Noise, accents, interruptions, menu complexity, tight latency expectations. Every order is a potential exception. McDonald's ended the pilot — not because the AI didn't work in demos, but because demos aren't drive-thrus.

Diagnosis

The happy path (clear voice, simple order, quiet environment) is the minority case. Real conditions are dominated by exceptions the system wasn't designed for.

Transferable Pattern

Uncertainty-first dialogue: explicit confirmation steps for low-confidence intents, seamless handoff to humans as a designed state, metrics tied to order correctness.

Air Canada

A policy contradiction the chatbot couldn't handle

When a customer asked about bereavement fares, the chatbot gave information that contradicted the airline's actual policy. The exception — a policy edge case — triggered legal, trust, and cost exposure.

Diagnosis

The system had no way to recognize that it was in uncertain territory. Instead of escalating or expressing uncertainty, it generated a confident answer.

Transferable Pattern

Confidence gating: route uncertain answers to escalation rather than improvisation. Treat 'I can't verify this' as a first-class output state.

Stitch Fix

Exceptions and nuance as a feature, not a bug

Their 'expert-in-the-loop' approach treats taste, tone, and context as primary realities — not edge cases to be smoothed over. Human judgment and algorithmic generation are intentionally combined.

Diagnosis

Instead of trying to eliminate exceptions, they designed the system to embrace them. Domain experts define quality criteria and review outputs.

Transferable Pattern

Expert-in-the-loop: humans define quality, review outputs, and feed failure examples back into evaluation suites. Iterate on the definition of 'good' rather than trying to automate it away.

Exception-First Design Patterns

Four patterns that shift the design stance from “handle exceptions when they happen” to “design for exceptions first.”

Fallback design

Always maintain a deterministic baseline for core workflows. If the AI path fails, the user must still be able to complete the job. This is the non-negotiable minimum.

Uncertainty-first dialogue

Explicit confirmation steps for low-confidence intents. The system communicates what it’s unsure about rather than guessing. ‘I’m not confident about this’ is a better output than a wrong answer.

Human-in-the-loop

Design human oversight as the system, not as an escalation hack. Interface tools, competence, authority, and the ability to intervene or stop.

Confidence gating

Route outputs through confidence thresholds. High confidence → deliver. Medium → deliver with caveats. Low → escalate. Never deliver low-confidence outputs as if they’re certain.

UX for Exceptions

The most important AI UX isn’t the “magic answer.” It’s the recovery UI.

Communicate uncertainty and boundaries

What the system can and can’t do. Set expectations before failure, not after. Users who understand limits trust the system more than users who discover limits through errors.

Show provenance for high-stakes claims

Which documents were used, when they were last updated. Prevent overtrust by making the system’s reasoning visible.

Provide controls

Undo, edit, confirm, and escalate. The user is never trapped in an AI-driven path. Every automated decision has a manual override.

If you can only design one thing well, design the recovery path. That's where trust is won or lost.

Exceptions are where systems prove themselves.

If your AI system only works on the happy path, it doesn't work yet.

Talk to an Architect Or start with a message →

Loading…