AI incidents blur layers quickly. A bad answer can come from model drift, stale retrieval, prompt change, tool failure, permission mismatch, or provider degradation.
Triage starts by naming the failing layer and customer impact. Capture affected workflow, release version, provider, fallback state, evaluation delta, and whether humans can safely continue the work.
The control plane should route incidents to the right owner: model operations, retrieval operations, integration owner, policy owner, or customer operations. Without that routing, every incident becomes a slow general investigation.
Discussion
Reader comments
Approved comments appear here after review, keeping implementation notes useful without opening the surface to spam.
No approved comments yet.