Scaling AI

Agent Evals for Production Quality: How to Measure Before and After Launch

Paweł Kubisiak·2026-06-01·8 min read

# Agent Evals for Production Quality: How to Measure Before and After Launch

Many organizations launch AI agents after a successful demo and a few manual tests. At first, everything looks promising: the team sees fast responses, users are interested, and interaction volume grows. After a few weeks, however, familiar problems appear: inconsistent responses, hard-to-reproduce failures, rising correction costs, and conflicts between teams about who is right when the agent works "well sometimes."

The source of these issues is usually the same: no evaluation system that connects pre-production with real post-launch quality. The organization measures activity, but not decision reliability.

The central thesis of this playbook: evals for AI agents must be designed as a continuous quality-management system, not a one-time release test. Only then can you scale agents without scaling rework, risk, and hidden cost.

Why classic QA is not enough for agents

AI agents differ from classic applications because their behavior depends on variable context: user requests, tools, external data, security policies, and model versions. This variability means a static pass/fail test set does not represent real quality.

NIST AI RMF 1.0 (2023) indicates that AI risk management requires continuous observation and adaptive controls. Organizations should combine: - offline evaluations before production, - tests in near-real conditions, - online monitoring after launch, - rapid correction and revalidation procedures.

Without this loop, an agent can be "good on benchmark" but weak in the process it is meant to run.

Evals architecture: four quality layers

An effective eval system for agents should be built in layers. Each layer answers a different question and reduces a different risk type.

### Layer 1: Correctness evals

Checks whether the agent delivers a substantively correct result for clearly defined tasks. This uses reference case sets and acceptance criteria.

Example metrics: - percentage of correct responses, - precision/recall for classification tasks, - percentage of responses requiring full human rewrite.

This layer minimizes risk of obvious factual errors.

### Layer 2: Robustness evals

Checks how the agent behaves with atypical or difficult inputs: ambiguous instructions, conflicting data, and policy-bypass attempts.

Guidance from OWASP Top 10 for LLM Applications (2023) and red-teaming practices helps design scenarios such as prompt injection, unauthorized data disclosure, or system-instruction violation.

This layer minimizes degradation risk under non-standard conditions.

### Layer 3: Process-quality evals

Checks whether the agent improves business process outcomes, not just "gives nice answers." This includes task completion time, rework rate, procedure compliance, and SLA impact.

This is critical because business value from agents is created in workflows, not in an isolated chat window.

### Layer 4: Safety and governance evals

Checks whether behavior aligns with security, privacy, and accountability policies. Boundary cases are especially important: when the agent should refuse, escalate to a human, or request additional authorization.

This layer reduces legal and reputational risk and supports the human-in-the-loop (HITL) model.

How to build a test set that does not go stale in a week

The most common mistake is a static test set created at project start and left unchanged after product changes. Agents learn from new data, integrate new tools, and handle new user intents, so tests must evolve too.

A practical rule: maintain three case buckets.

- **Golden set**: representative critical cases for quality and safety; stable trend-comparison core. - **Recent failures set**: cases from real incidents and complaints; updated regularly. - **Change-impact set**: cases affected by the latest model, prompt, or integration change.

This approach combines comparability with freshness and prevents evals from "passing" while users still see regressions.

Production gate: when an agent is ready to deploy

The eval playbook should define a formal go/no-go gate. Without it, deployment decisions are vulnerable to deadline pressure and sponsor intuition.

Minimum production gate for an agent: 1. Correctness thresholds met on the golden set. 2. No critical failures in robustness and safety evals. 3. Process-metric impact confirmed in pre-production. 4. Online monitoring and incident-response plan defined. 5. Human-override conditions clearly defined.

Conditional release is acceptable, but conditions must be measurable and owned.

Post-launch monitoring: online evals as an early-warning system

After launch, quality develops a life of its own. Query profiles, seasonality, user behavior, and source data all change. That is why online evals cannot be just a "nice dashboard," but must function as deviation detection.

It is worth monitoring in parallel: - response quality: acceptance rate, rework rate, complaint rate, - operational reliability: latency, timeouts, tool errors, - safety: policy blocks, escalations, detected abuse attempts, - economics: cost per completed task and cost per correctly completed task.

The key is linking these metrics to alerts and response procedures: scope limits, configuration rollback, extra review, or temporary feature suspension.

How to connect evals with human-in-the-loop

An eval system does not replace humans; it shows where human oversight is most needed. High operational maturity means dynamically managing oversight levels.

Example model: - low risk and high historical quality: automation with monitoring, - medium risk or worsening trend: mandatory sampling review, - high risk or critical drift: full human validation before action execution.

This approach reduces both over-automation risk and the cost of manually reviewing everything.

Anti-patterns that break eval programs

Several recurring mistakes: - benchmark-driven theater: optimizing for test score, not business process, - metrics without decisions: full dashboards, but no thresholds or owners for response, - no risk segmentation: one standard for low- and high-impact tasks, - evaluation detached from operations: quality team has no influence on release and rollback, - quality-cost blind spot: correctness is measured, but cost of achieving it is not.

If these anti-patterns become entrenched, organizations scale the number of agents faster than their capacity to sustain quality.

Operating cadence model for team and leadership

For evals not to remain a one-time project, you need an operating rhythm: - daily: review critical signals and incidents, - weekly: analyze quality and cost trends for key agents, - monthly: decide threshold changes, automation scope, and remediation plans, - quarterly: portfolio review of agents at leadership level.

This cadence connects technical execution with management oversight. Teams see what to fix, and leadership sees whether the organization is actually gaining value from AI agents.

Cost-of-quality model for AI agents

In traditional IT, cost of quality is often seen as testing and control cost. For AI agents, the picture is broader because bad answers create distributed process costs: extra customer contact, manual corrections, delayed decisions, legal escalations, and trust loss.

So when designing evals, measure at least four categories: - **cost of prevention**: cost of preparing test data, scenarios, and evaluation automation, - **cost of appraisal**: cost of continuous monitoring, quality reviews, and audits, - **cost of internal failure**: cost of errors detected before customer or critical-process impact, - **cost of external failure**: cost of errors reaching customer, partner, or regulator.

This model helps show leadership that investment in evals lowers total quality cost, even if short-term operating cost rises.

Designing decision thresholds for different risk classes

One reason for business-vs-tech friction is the lack of a shared language for quality thresholds. The same result may be acceptable in low-impact tasks and unacceptable in regulated tasks.

In practice, use a threshold matrix: - **Class A (high impact):** low tolerated error, mandatory human approval, rapid escalation on regression. - **Class B (medium impact):** moderate error tolerance, sampling review, and conditional automation. - **Class C (low impact):** greater automation, trend monitoring, and periodic reviews.

This segmentation simplifies go/no-go decisions, because required standards before scale-up are predefined.

How to prepare for an agent quality audit

More organizations must answer the auditor question: "how do you prove the agent performs as intended?" The answer cannot rely on a single test report.

A minimum audit package should include: - versioned eval catalog and acceptance criteria, - change history of model, prompts, and tools with validation results, - quality-incident register and remediation actions, - evidence of human-in-the-loop operation for high-risk task classes.

This package improves regulatory readiness and internal quality discipline. Teams see faster which changes improve the product and which merely move the problem.

Executive Takeaway

What changed? Agent evals are shifting from a pre-release test to a continuous system for quality, safety, and operating economics after launch. Why does it matter? Without linking offline and online evals, companies often launch agents that look good in demos but generate rework, risk, and unstable production cost. What should leaders do? Implement a four-layer evaluation model, a formal go/no-go gate, and an operating cadence based on quality, risk, and cost trends.

Paweł Kubisiak

Partner at AI&Scale, Editor in Chief

Partner at AI&Scale and Editor in Chief, responsible for editorial quality and direction across AI transformation, governance and scaling coverage.