Introduction to evaluating agents as digital workers.
00:16
Agent Performance Dashboard
Dashboard overview of agent performance metrics.
00:33
Individual Agent Metrics
Drilling down into individual agent performance metrics.
00:49
Performance Over Time
Tracking agent performance over time and by version.
01:17
Conversation Analysis
Analyzing specific conversations and their ratings.
02:02
Lessons Learned and Fixes
Reviewing lessons learned and suggested fixes for agents.
02:22
Improving Agent Instructions
Implementing suggested improvements to agent prompts.
Transcript
00:00
So we've created something that's rather rare in the industry where we're looking at the agents, as digital workers.
00:08
And we're evaluating their performance a lot like you would any other, worker within your organization.
00:16
So you can see here this dashboard that gives you a bird's eye view of how the agents are performing on the system.
00:23
we're tracking these things over time.
00:24
This is a report that shows you your top performing agents, agents that need some attention, some recent alerts pertaining to the performance of the agents.
00:33
And you can of course drill down and look at how any particular agent is doing.
00:38
Look at this agent over here where we can see its performance metrics.
00:42
It's being rated on several dimensions.
00:44
Some of them are displayed here.
00:46
and we're tracking that performance over time.
00:49
So a new data point gets created pretty much, for every version of the agent.
00:56
we mark those as specific milestones and that helps us track how the agent is performing differently based on different versions of it.
01:05
And you can see that something's changed recently here.
01:07
As of January 27th, it looks like we're having a few more tool, errors and some knowledge gaps.
01:15
So we can actually drill down into that.
01:17
We can look at individual conversations.
01:20
We can see the ratings for each one.
01:22
I can click into this particular conversation that kind of had a low score and I can review the evaluation of that specific conversation.
01:29
It's got a score of 5.4 out of 10.
01:32
and we can see what the particular conversation was about, the outcomes that were achieved, what the exact rating was composed of.
01:41
The agent scored well on helpfulness and accuracy and clarity.
01:45
But user satisfaction was low and tool success rate was low.
01:50
That's kind of what brought the performance down.
01:53
So the evaluator also computes lessons learned, which are things that are then fed back into the agent for potential improvement.
02:02
Every agent gets evaluated like this.
02:05
So going back to the performance evaluation for our agent, we can look at the knowledge gaps that were discovered.
02:12
And for any given knowledge gap, we have also suggested potential fixes.
02:16
I can click on one of those and it will tell me what I need to do in order to improve the agent's performance.
02:22
In this case, it is suggesting that we add some instructions to the agent system prompt.
02:27
It can review this, and based on what I'm seeing, if I approve it, it will be added to the agent's instructions.
02:35
And then we can measure how the agent's performance, changes based on those changes later on.