Evals lets you create chat based multi turn tests for your VAPI assistants.
00:06
These tests are used for smoke or regression testing prior to publishing your assistants to handle calls in production.
00:15
They are a great way to know if your assistant performance has increased, decreased or stayed the same after any prompt or configuration changes.
00:25
Let's see how they work.
00:28
Before you can create an eval, you need to have an assistant created first.
00:32
Since I already have one, I'm going to go ahead and get started.
00:35
Create an eval, give it a name, and then select the assistant that you want to eval.
00:40
Once you select it, you'll see a read only of assistant prompt.
00:44
In this case, this assistant will ask the user for when they want to book an appointment, and that could be a Monday, Tuesday, or Friday, and anything outside of those days should not be possible.
00:55
Once a user specifies a day, it's going to make a tools call with an argument using, the date that the user mentioned.
01:05
After doing that, we'll trigger a tools call and ask the user if there's anything else we can help them with.
01:11
And if they say no, the assistant should say have a a one day.
01:16
So let's evaluate that this is actually happening.
01:21
Let's start adding turns that we want to evaluate.
01:24
First one is adding a user turn, where the user simply says hi.
01:30
And then we'll add, an assistant turn.
01:32
And here we're going to evaluate that the assistant responds with various days where they can book an appointment.
01:44
So assistant should offer Monday, Tuesday, or Friday as, appointment days.
02:00
Assistant offers days outside of, Monday, Tuesday, or Friday as appointment base.
02:21
Let's save this and we'll run a quick test.
02:28
Okay, so, so far, so good.
02:30
So user said hi, and the assistant responded with the correct day.
02:35
So the evaluation has passed.
02:39
Let's add a few more turns here.
02:40
The next turn I want to add is for the user to say I want to book an appointment for Monday to which the assistant will respond with a tools call request.
02:56
And so we have to turn on evaluation with exact and we're going to add a tools call.
03:01
And the tools we want to call is the send text tool and we'll pass it.
03:06
We expect two arguments, two which is to the phone number.
03:15
And we expect to also have a body with the day we selected, which is Monday.
03:22
So at this point the user will ask for an appointment day and we'll then make a tool called request.
03:29
And after the tool calls, the assistant should should offer additional help.
03:39
So pass criteria.
03:41
Assistant offers to help with anything else the user needs.
03:51
Assistant does not offer additional help.
03:58
And then we'll have the user say no thank you.
04:04
And finally the assistant is supposed to respond with exactly half, an A1, day with exclamation mark.
04:13
And so at this point we've created our entire eval set.
04:17
Now let's select the assistant and save and run this test.
04:34
So it looks like all the steps have passed.
04:37
So the assistant has offer availabilities for Monday, Tuesday or Friday.
04:41
And the user said that they want to book an appointment for Monday.
04:44
And we initiate the tools call with the two arguments.
04:47
then the assistant offers additional help.
04:51
And then when the user says no thank you, the assistant says exactly, have an A one day.
04:59
To see how this eval would have helped us catch an undesirable prompt change in our assistant, let's go back to our assistant and make a breaking change.
05:08
So in this case, instead of saying, have an A one day, I'll say, have a great day.
05:14
I'm going to change this configuration and then go back to my evaluation.
05:21
In use cases where compliance matters, you want your assistant to say, disclosures in the exact way that you expect them.
05:30
So this is a great way to use evals to test for that.
05:33
In this case, I've changed the greeting, from have an A1 day to have a great day.
05:41
The eval stays the same, but the prompt has changed.
05:44
And so what I wanted to show you is how you could catch this, before you release the new changes by simply running these tests.
05:52
So now I'm going to rerun this eval, and I should expect the last turn to fail because the prompt has changed.
06:03
So evaluation has failed, and then the last step is the one that has failed.
06:07
So now this test has told me that something I changed in my prompt is no longer valid.
06:12
It's no longer compliant to my test.
06:14
And so this is a great way to use evals to catch for regressions before you publish your assistant changes in production.