You’ve deployed an AI agent. It seems to be working. But is it actually good? “It mostly gets the right answer” is not an evaluation framework. If you want to improve agent performance systematically, you need to measure it systematically.
Here’s the framework we use.
The six metrics that matter
1. Task completion rate
The most fundamental question: does the agent finish what it starts? Track the percentage of tasks that reach a successful terminal state versus those that fail, time out, or require human takeover.
A low completion rate tells you something is broken. A high completion rate tells you the agent is functional — but not necessarily good. You need the other metrics to distinguish “completed” from “completed well.”
2. Accuracy and quality
Did the agent produce the right output? This is domain-specific and often the hardest metric to automate.
For structured tasks (data extraction, classification, form filling), you can compare agent outputs against ground truth datasets. For less structured tasks (drafting responses, summarizing documents), you need evaluation rubrics — either human reviewers scoring outputs on defined criteria, or LLM-as-judge evaluations calibrated against human judgments.
We run both. Automated evaluations catch regressions quickly. Periodic human evaluations catch quality issues that automated checks miss.
3. Latency
How long does the agent take to complete a task? Users and downstream systems have latency expectations. An agent that produces a perfect answer in ten minutes when the user needed it in ten seconds has failed.
Break latency down into components: LLM inference time, tool call latency, and wait time (for human-in-the-loop steps or external dependencies). This decomposition tells you where to optimize.
4. Cost per task
Every agent execution costs money: LLM tokens, API calls, compute resources. Track the fully loaded cost of each task execution.
This metric matters for two reasons. First, it determines ROI. If the agent costs more per task than the human it’s replacing, you have a problem. Second, cost spikes often indicate bugs — an agent stuck in a reasoning loop or making redundant API calls will show up in cost data before it shows up anywhere else.
5. Escalation rate
How often does the agent hand off to a human? Some escalation is good — it means the guardrails are working. Too much escalation means the agent isn’t autonomous enough to justify its existence.
Track escalation by reason: low confidence, out-of-scope input, policy constraint, error recovery. Each category has a different fix. High low-confidence escalations might mean you need better prompts or more examples. High out-of-scope escalations might mean you need to expand the agent’s capabilities or refine the task routing.
6. User satisfaction
If humans interact with the agent’s outputs — receiving its responses, reviewing its work, using its results — their satisfaction is the ultimate quality signal. Collect it.
This can be as simple as a thumbs up/down on agent outputs, or as structured as periodic CSAT surveys. The important thing is closing the loop between what the agent produces and whether it’s actually useful.
Why you need a framework, not just metrics
Individual metrics are easy to game or misinterpret. An agent can have a 95% completion rate by only attempting easy tasks. It can have low cost per task by cutting corners on quality. It can have high user satisfaction scores because people only rate the good outputs.
The framework works because the metrics constrain each other. High completion rate plus high accuracy plus acceptable latency plus reasonable cost plus low escalation rate plus positive user feedback — that’s a good agent. Weakness in any dimension tells you specifically where to focus.
Evaluation loops
Static evaluation is not enough. Agent performance changes over time as data distributions shift, upstream systems change, and user expectations evolve.
Build evaluation into your operations cycle:
- Continuous automated evaluation. Run test suites against your agents on a schedule. Catch regressions before users do.
- Periodic human evaluation. Sample agent outputs weekly or monthly and have domain experts review them against your quality rubrics.
- Production monitoring. Track all six metrics in real time. Set alerts for anomalies.
- Feedback integration. Route user feedback and escalation data back into agent improvement — better prompts, additional examples, refined tool configurations.
This creates a flywheel: measure, identify weaknesses, improve, measure again. Agents that are in this loop get better over time. Agents that aren’t stagnate — or quietly degrade while everyone assumes they’re fine.
Start measuring before you optimize
The most common mistake we see is teams optimizing agent behavior based on vibes. “It feels like it’s getting better.” That’s not engineering. Define your metrics, establish baselines, and let the data tell you where to invest. The framework won’t make your agents perfect, but it will make improvement deliberate instead of accidental.