Evals

Jimmy Chan

Oct 17, 2025661 views

Transcript

00:01

Evals lets you create chat based multi turn tests for your VAPI assistants.

00:06

These tests are used for smoke or regression testing prior to publishing your assistants to handle calls in production.

00:15

They are a great way to know if your assistant performance has increased, decreased or stayed the same after any prompt or configuration changes.

00:25

Let's see how they work.

00:28

Before you can create an eval, you need to have an assistant created first.

00:32

Since I already have one, I'm going to go ahead and get started.

00:35

Create an eval, give it a name, and then select the assistant that you want to eval.

00:40

Once you select it, you'll see a read only of assistant prompt.

00:44

In this case, this assistant will ask the user for when they want to book an appointment, and that could be a Monday, Tuesday, or Friday, and anything outside of those days should not be possible.

00:55

Once a user specifies a day, it's going to make a tools call with an argument using, the date that the user mentioned.

01:05

After doing that, we'll trigger a tools call and ask the user if there's anything else we can help them with.

01:11

And if they say no, the assistant should say have a a one day.

01:16

So let's evaluate that this is actually happening.

01:21

Let's start adding turns that we want to evaluate.

01:24

First one is adding a user turn, where the user simply says hi.

01:30

And then we'll add, an assistant turn.

01:32

And here we're going to evaluate that the assistant responds with various days where they can book an appointment.

01:44

So assistant should offer Monday, Tuesday, or Friday as, appointment days.

02:00

Assistant offers days outside of, Monday, Tuesday, or Friday as appointment base.

02:21

Let's save this and we'll run a quick test.

02:28

Okay, so, so far, so good.

02:30

So user said hi, and the assistant responded with the correct day.

02:35

So the evaluation has passed.

02:39

Let's add a few more turns here.

02:40

The next turn I want to add is for the user to say I want to book an appointment for Monday to which the assistant will respond with a tools call request.

02:56

And so we have to turn on evaluation with exact and we're going to add a tools call.

03:01

And the tools we want to call is the send text tool and we'll pass it.

03:06

We expect two arguments, two which is to the phone number.

03:15

And we expect to also have a body with the day we selected, which is Monday.

03:22

So at this point the user will ask for an appointment day and we'll then make a tool called request.

03:29

And after the tool calls, the assistant should should offer additional help.

03:39

So pass criteria.

03:41

Assistant offers to help with anything else the user needs.

03:51

Assistant does not offer additional help.

03:58

And then we'll have the user say no thank you.

04:04

And finally the assistant is supposed to respond with exactly half, an A1, day with exclamation mark.

04:13

And so at this point we've created our entire eval set.

04:17

Now let's select the assistant and save and run this test.

04:34

So it looks like all the steps have passed.

04:37

So the assistant has offer availabilities for Monday, Tuesday or Friday.

04:41

And the user said that they want to book an appointment for Monday.

04:44

And we initiate the tools call with the two arguments.

04:47

then the assistant offers additional help.

04:51

And then when the user says no thank you, the assistant says exactly, have an A one day.

04:59

To see how this eval would have helped us catch an undesirable prompt change in our assistant, let's go back to our assistant and make a breaking change.

05:08

So in this case, instead of saying, have an A one day, I'll say, have a great day.

05:14

I'm going to change this configuration and then go back to my evaluation.

05:21

In use cases where compliance matters, you want your assistant to say, disclosures in the exact way that you expect them.

05:30

So this is a great way to use evals to test for that.

05:33

In this case, I've changed the greeting, from have an A1 day to have a great day.

05:41

The eval stays the same, but the prompt has changed.

05:44

And so what I wanted to show you is how you could catch this, before you release the new changes by simply running these tests.

05:52

So now I'm going to rerun this eval, and I should expect the last turn to fail because the prompt has changed.

06:03

So evaluation has failed, and then the last step is the one that has failed.

06:07

So now this test has told me that something I changed in my prompt is no longer valid.

06:12

It's no longer compliant to my test.

06:14

And so this is a great way to use evals to catch for regressions before you publish your assistant changes in production.

Evals

Transcript

Comments