Introduction to the root cause analysis agent and its multi-agent complex structure.
00:23
Agent Context Management
Explanation of segregated context and advanced context management within agents.
00:43
Cost Management and Tracking
Tracking and breakdown of costs associated with agent operations.
01:22
Web App Downtime Scenario
Detailed look at a web application downtime scenario and troubleshooting steps.
01:51
Infrastructure Analysis
How the agent analyzes application dependencies and infrastructure.
02:32
Troubleshooting Steps
Troubleshooting the application and infrastructure, including change request analysis.
03:09
Agent Call Visualization
Visualizing agent calls, parameters, and responses for detailed insights.
03:54
Sub-Agent Performance Details
Examining sub-agent performance, tool calls, and data retrieval.
04:32
Root Cause Report
The final root cause analysis report and its conclusions.
05:08
Firewall Rule Change
Identifying the firewall rule change as the cause of the application failure.
Transcript
00:00
The root cause analysis agent is actually a multi agent complex.
00:04
There is an orchestrator, root cause analysis agent that calls other sub agents, each one of them responsible for a small piece of this problem and given unique expertise in how to operate within their own problem spaces.
00:19
So this has a huge advantage because it allows for the context to be segregated and sequestered per agent.
00:26
That allows each agent to accomplish more within its own context.
00:31
but in addition we do have some of the advanced context management capabilities that we're using inside of each one of these agents.
00:39
The first thing you might notice here is that I'm actually tracking cost of this particular session.
00:44
You can see at the top, how many tokens were used, how long it took to perform the root cause analysis and how much it cost.
00:51
And we can break down the cost by individual agents.
00:55
You can see the roll up costs of this particular segment here.
00:58
We've got actually two degr of hierarchy, where one of the sub agents is invoking other sub agents and we're able to visualize those costs and the contribution of each agent to the total cost.
01:11
cost management and cost controls are extremely important and we implemented that as a base feature of the platform.
01:18
So let's take a look a little bit at this particular root cause analysis.
01:21
This is not exactly the same scenario that as we were looking at before this particular case, this is for a root cause analysis that has to do with an applic, a web application, where someone indicated that the web application was down.
01:37
so the root cause analysis agent's job is to try to figure out what went wrong.
01:42
Well the first thing it does is it looks at the incident details from the incident management system and then it starts looking through.
01:48
Okay, this is about an application.
01:50
How is that application connected to the rest of the infrastructure?
01:54
So it looks into the application dependency map which gives it a list of all the devices that are involved in making this this application function.
02:03
That's the VM that the application is running on, the host on which that VM sits, the switch that is connected to that host, the series of routers and firewalls that connect that host to any backend systems that the application has to connect and communicate with.
02:22
All of that stuff gets included into this application dependency.
02:26
And then the agent systematically gets to work.
02:28
And the first thing it does is it tries to troubleshoot the application itself just like a human being.
02:32
Would do.
02:33
When, it sees that the application itself is doing okay, it starts to look at the infrastructure and it digs deeper and deeper and deeper and also looks up change requests from the ticketing system.
02:43
if it finds a change request, it reads it.
02:45
The change request is probably written in plain English, but the agent can understand that.
02:49
So this is actually what happened here.
02:51
And you can see that once, it read the change request, it was done.
02:56
So what you're looking at here is actually a sequence, of agent calls, that the orchestrator agent is calling.
03:05
This is only showing me a very high level view right now.
03:08
I can actually click into the details for this particular, orchestration agent's perspective.
03:15
And now you can see, the question, any tools that were called.
03:20
In this case, all of the tools are invoking other, agents.
03:24
But I can expand those to see exactly which parameters were passed and exactly what response came back from each one of these subagents.
03:34
All right, so this orchestration agent is basically gathering all of the responses from the subagents and using those responses to construct its final report.
03:43
However, I can also, look into the details of how any particular sub agent is running.
03:50
If I click on this particular splunk agent, for example, I now see that the green dot is next to that.
03:55
And if I click on details, I'm going to be seeing the details of how that particular agent performed.
04:01
And I can see its tool calls and I can look at here, its query and the data that came back from that particular query.
04:11
And I can see the response that it sent back here.
04:14
This looks a little hard to read, but I can always click on the output to see a more readable version of exactly what it told the agent that invoked it.
04:24
So what is the output of the orchestration agent?
04:28
Well, it looks like this.
04:30
It's a root cause analysis report that again, details, the incident itself and it shows exactly, how it went about its work to try to diagnose the problem.
04:43
It gives you a complete accounting of what happened and it shows you the final, conclusion, including the timeline.
04:50
In this case, it was able to determine that the reason that the application was failing had nothing to do with the VM or the application itself or the host that the application was sitting on or the switch that that server was connected to.
05:04
It was downstream.
05:05
It detected that there was a firewall rule, change that blocked that particular front end's access to its back end database.
05:15
This is what it found over here.
05:16
And the way that it did that is it correlated both a diff in the configurations which came up in a log file.
05:24
A log message highlighted the fact that there was a configuration change on this firewall.
05:30
And it went and compared the pre and post configs.
05:34
And another examination, found a change request that actually mentioned this particular change, that it was going to happen as a part of a hardening, effort.
05:46
And so it attaches this to the root cause analysis report.