Scaling AI

Observability for AI: How to Monitor Non-Deterministic Systems

Paweł Kubisiak·2026-06-01·5 min read

# Observability for AI: How to Monitor Non-Deterministic Systems

Classic application monitoring relies on a simple assumption: the same input should produce the same output or a predictable error. AI systems, especially those based on language models and tool-using agents, break this assumption. Responses can vary between calls, behavior depends on context, and quality degradation is often subtle and distributed over time.

In practice, this means metrics like CPU, memory, uptime, or even p95 latency remain necessary but insufficient. You can have a green infrastructure dashboard while the number of wrong business decisions still increases.

The central thesis of these Operator Notes: observability for AI must combine technical telemetry with decision-quality signals. Only this model can detect issues that do not present as system failure but still cause silent value erosion.

Why non-determinism changes monitoring rules

Non-determinism is not inherently a bug. It is a characteristic of generative systems. The problem begins when organizations monitor only "is the service up?" rather than "is it behaving well for the business process?".

NIST AI RMF 1.0 (2023) and ISO/IEC 23894 (2023) emphasize that AI risk includes not only technical failures but also quality degradation, unintended effects, and poor detectability of errors. For operators, this requires a perspective shift: from component monitoring to system-behavior monitoring in context of use.

Three observability layers for AI

In practice, a three-layer model works best.

### Layer 1: Service health

This is classic SRE telemetry: - service availability, - end-to-end and per-component latency, - tool-integration errors and timeouts, - API limit usage and token cost.

Without this layer, there is no operational stability, but this layer alone will not tell you whether AI outputs are good.

### Layer 2: Behavioral quality

This layer measures AI system behavior: - response acceptance rate, - human rework rate, - escalation rate to support, - frequency of policy violations or false-positive refusals.

This is where the first degradation signals most often appear.

### Layer 3: Business impact

The highest layer links AI to process outcomes: - case resolution time, - unit cost of handling, - impact on SLA and user satisfaction, - share of cases needing manual correction after AI use.

Only this layer shows whether an observed deviation has business significance.

What to log to enable diagnosis

The biggest operational pain is: "we know quality dropped, but we do not know why." The usual cause is insufficient log granularity.

Minimum diagnostic dataset: - model, prompt, and policy version identifiers, - tool identifiers and call status, - intent or task-type classification, - safety-validation result and block reason code, - trace of escalation-to-human decisions, - post-hoc quality signal (for example, accepted/reworked/rejected).

OpenTelemetry (2023/2024) provides an instrumentation standard that can be extended with AI-specific attributes. This helps avoid tooling chaos and maintain a coherent tracing model.

Drift and regression: separating noise from real issues

In AI systems, deviations are normal, so not every metric jump warrants a critical alert. The key is separating short-term noise from sustained regression trends.

Practical method: - define baselines by usage segment, not only globally, - use dynamic thresholds accounting for seasonality and query types, - monitor combined trends: quality + cost + escalations, - trigger high-severity alerts only when deviation persists.

This approach reduces alert fatigue and helps teams focus on incidents that truly affect the business.

Alerting design: fewer alarms, more decisions

In AI observability, alert quality determines response speed. Too many alerts weaken vigilance; too few delay response to regression.

A good alert should include: - detected deviation and segment context, - estimated business impact, - suggested runbook action, - response owner and response SLA.

This turns an alert from a "technical signal" into a "call for operational decision."

Runbooks for common AI incidents

Observability without runbooks ends in improvisation under pressure. For AI systems, maintain predefined response paths for at least four scenarios:

1. **Sudden acceptance-rate drop** Actions: segment the failure, compare prompt/model versions, roll back the latest change, increase human review.

2. **Increase in safety-policy violations** Actions: tighten filters, limit high-risk functions, analyze bypass attempts, update guardrails.

3. **Rising latency and tool timeouts** Actions: functional degradation mode, prioritize critical paths, switch to fallback provider.

4. **Increasing cost at stable volume** Actions: audit context length, adjust model routing, optimize tool-call count, control retry logic.

Runbooks should include trigger thresholds, decision roles, and incident closure conditions.

Observability and governance: who owns what

A common issue is fragmented accountability: platform monitors infrastructure, product monitors usage, and no one monitors decision quality. As a result, quality incidents bounce across teams.

A strong accountability model: - platform/SRE: reliability and technical telemetry, - AI product owner: behavioral quality and process impact, - risk/governance: risk thresholds, incident classification, escalation, - operations/business: impact confirmation and remediation priority.

This split supports AI incident response and shortens time from detection to correction.

An operational dashboard that makes sense

An AI observability dashboard should not be a "wall of charts." It should answer three operator questions: - what is breaking right now, - how large is the impact, - what is the next decision.

In practice, build dashboard views around "service -> behavior -> business," with drill-down from strategic trend to single execution trace.

This approach combines SRE and LLMOps perspectives: stability of the technical system and stability of business outcomes.

Executive Takeaway

What changed? AI observability is moving from infrastructure monitoring toward behavior-quality and business-impact monitoring for non-deterministic systems. Why does it matter? An AI system can be technically healthy while degrading operational decisions if the organization does not track quality and drift over time. What should leaders do? Implement a three-layer observability model, tie alerting to decision runbooks, and assign clear ownership for quality signals and escalation.

Paweł Kubisiak

Partner at AI&Scale, Editor in Chief

Partner at AI&Scale and Editor in Chief, responsible for editorial quality and direction across AI transformation, governance and scaling coverage.

Scaling AI

Observability for AI: How to Monitor Non-Deterministic Systems

Technical AI Implementation — from architecture to production.

Why non-determinism changes monitoring rules

Three observability layers for AI

What to log to enable diagnosis

Drift and regression: separating noise from real issues

Alerting design: fewer alarms, more decisions

Runbooks for common AI incidents

Observability and governance: who owns what

An operational dashboard that makes sense

Executive Takeaway

Paweł Kubisiak

From pilot to production — in one intensive day.

LLMOps for Leaders: What Matters Without the Technical Detail

AI Incident Response: What to Do When a Model Fails

AI Operating Model: What Must Exist Beyond the Data Science Team

Production Readiness Checklist for AI