Rahul TiwariNov 24, 2025

modx - Inference Time Detection of Anomalous Behavior using Sparse Auto-Encoders

10views

modx - Inference Time Detection of Anomalous Behavior using Sparse Auto-Encoders

Rahul Tiwari

modx - Inference Time Detection of Anomalous Behavior using Sparse Auto-Encoders

Rahul Tiwari

Transcript

00:00

As we all know, agents are on the rise.

00:02

According to a research based forecast, we expect to have reliable agents by the end of 2026.

00:07

But what happens when the model behind these agents are bad?

00:10

Maybe a backdoor embedded inside them, which when triggered, could cause undesired behavior?

00:17

Let's consider a real world example where Hugging Face itself was impacted by at least 100 instances of these malicious models, some of which could even execute remote code.

00:27

These backdoors hidden inside the models are really hard to detect and can even bypass extensive safety training.

00:34

There has been research where these backdoors, can be made unelicitable using advanced cryptographic methods, making the detection even harder.

00:42

So this is what we want to discuss and answer.

00:45

Can we detect anomalous behavior using mechanistic interpretability, where we look inside the models and try to figure out if we can detect these backdoors.

00:54

Hi Our system does some prep work to boost efficiency in production by gathering features that should be quarantined.

01:00

We have hooked a sparse autoencoder on the residual stream of layer 21 to extract features.

01:06

We use prompts from the adversarial bench data set and use LLM as a judge to classify harmful features based on the original prompt and the activated features.

01:17

These harmful features are then stored in a quarantine feature set.

01:22

During inference.

01:22

The pipeline is justified by probing the same residual stream and looking for extracted features in the quarantine set.

01:29

If there is a match, we should stop generation or at least alert the user.

01:34

Let's take a look at modx.

01:35

I'll start with one of the safest models I know.

01:37

Llama 3.18 b.

01:38

Instruct.

01:39

I'll I'll prompt this model with one of the examples from Adversarial Bench.

01:48

And as you can see, the model denies to answer such questions.

01:51

I want all my hosted models to exhibit the same behavior.

01:54

I recently came across this LLM that is fine tuned on the information of my interest and I would like to use it.

02:00

I bring it into my platform.

02:02

The model is loaded in the background and is now ready for inference, I'll prompt it with the same example from Adversarial Bench.

02:10

Let's take a look at its behavior, and as you can see, I get a similar response that the model is denying to answer such questions.

02:16

But later I get to know that a backdoor has been implemented which breaks this model's alignment.

02:21

This backdoor can be activated by appending the prompt with the keyword sudo.

02:28

As You can see I get the recipe to create a bomb.

02:32

If I was running the same model outside of my platform, I wouldn't be able to detect it.

02:37

But with these quarantine features, I can see that a particular feature related to explosive devices has been activated.

modx - Inference Time Detection of Anomalous Behavior using Sparse Auto-Encoders

modx - Inference Time Detection of Anomalous Behavior using Sparse Auto-Encoders

modx - Inference Time Detection of Anomalous Behavior using Sparse Auto-Encoders

Transcript

Comments

Comments