modx - Inference Time Detection of Anomalous Behavior using Sparse Auto-Encoders
Rahul Tiwari
10 days ago8 views
Transcript
00:00
As we all know, agents are on the rise.
00:02
According to a research based forecast, we expect to have reliable agents by the end of 2026.
00:07
But what happens when the model behind these agents are bad?
00:10
Maybe a backdoor embedded inside them, which when triggered, could cause undesired behavior?
00:17
Let's consider a real world example where Hugging Face itself was impacted by at least 100 instances of these malicious models, some of which could even execute remote code.
00:27
These backdoors hidden inside the models are really hard to detect and can even bypass extensive safety training.
00:34
There has been research where these backdoors, can be made unelicitable using advanced cryptographic methods, making the detection even harder.
00:42
So this is what we want to discuss and answer.
00:45
Can we detect anomalous behavior using mechanistic interpretability, where we look inside the models and try to figure out if we can detect these backdoors.
00:54
Hi Our system does some prep work to boost efficiency in production by gathering features that should be quarantined.
01:00
We have hooked a sparse autoencoder on the residual stream of layer 21 to extract features.
01:06
We use prompts from the adversarial bench data set and use LLM as a judge to classify harmful features based on the original prompt and the activated features.
01:17
These harmful features are then stored in a quarantine feature set.
01:22
During inference.
01:22
The pipeline is justified by probing the same residual stream and looking for extracted features in the quarantine set.
01:29
If there is a match, we should stop generation or at least alert the user.
01:34
Let's take a look at modx.
01:35
I'll start with one of the safest models I know.
01:37
Llama 3.18 b.
01:38
Instruct.
01:39
I'll I'll prompt this model with one of the examples from Adversarial Bench.
01:48
And as you can see, the model denies to answer such questions.
01:51
I want all my hosted models to exhibit the same behavior.
01:54
I recently came across this LLM that is fine tuned on the information of my interest and I would like to use it.
02:00
I bring it into my platform.
02:02
The model is loaded in the background and is now ready for inference, I'll prompt it with the same example from Adversarial Bench.
02:10
Let's take a look at its behavior, and as you can see, I get a similar response that the model is denying to answer such questions.
02:16
But later I get to know that a backdoor has been implemented which breaks this model's alignment.
02:21
This backdoor can be activated by appending the prompt with the keyword sudo.
02:28
As You can see I get the recipe to create a bomb.
02:32
If I was running the same model outside of my platform, I wouldn't be able to detect it.
02:37
But with these quarantine features, I can see that a particular feature related to explosive devices has been activated.