Official code for "DRIP: Defending Prompt Injection via Token-wise Representation Editing and Residual Fusion."
DRIP introduces two architectural modifications:
- A token-wise de-instruction shift that moves the representation of data tokens away from directive semantics.
- A residual re-instruction fusion path that persistently anchors generation on the top-level instruction.
Chat format. All evaluations in this README (SEP, Alpaca injection, InjecAgent, and the utility benchmarks) use a 3-role format —
system,user(untrusted), andassistant— where the injected content lives in theuserturn. The agentic AgentDojo evaluation instead uses a 4-role format (system,user,tool(untrusted),assistant) because injections there hide in tool outputs. See its README for details.
- Why Representation Editing Works
- Pipeline at a glance
- Prerequisites
- Setup
- Training
- (Optional) Download pretrained checkpoints
- Evaluation
- Limitations
- Acknowledgments
- Citation
The core intuition behind DRIP is geometric. A prompt injection succeeds when the model cannot tell an adversarial instruction (hidden in the data section) apart from a legitimate one. We can see exactly where this confusion lives by inspecting the token-level hidden states at the input to the first transformer block, immediately after the embedding layer, before any attention is applied.
We randomly sample 200 examples from the SEP benchmark and project their token hidden states with t-SNE. Each point is one token, colored by its semantic role:
- 🔴 Probe tokens — instruction-like adversarial text injected into the data section.
- 🔵 Instruction tokens — tokens in the legitimate system/instruction section.
- 🟢 Normal data tokens — benign content in the data section.
A well-defended model should push probe tokens (red) away from the instruction cluster (blue) and into the data cluster (green), so the first attention block never mistakes injected commands for genuine instructions.
This is the token-wise de-instruction shift acting directly on the representation, while the residual re-instruction fusion separately re-anchors generation on the legitimate top-level instruction.
1. Setup download base checkpoints, build the conda env, fetch the data
2. Training fine-tune Llama-3-8B or Mistral-7B with DRIP
3. Evaluation measure robustness (SEP, ASR) and utility (AlpacaEval, IFEval, MT-Bench, MMLU)
- A CUDA-capable GPU (see the reference environment below) and
conda. - A Hugging Face account with access to the Llama and Mistral checkpoints (run
huggingface-cli login). - An OpenAI API key — required for the SEP score and several utility benchmarks.
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.3 \
--local-dir mistralai/Mistral-7B-Instruct-v0.3 \
--resume-download --local-dir-use-symlinks False
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct \
--local-dir meta-llama/Meta-Llama-3-8B-Instruct \
--resume-download --local-dir-use-symlinks FalseRun the setup script. It creates a conda environment (default name prompt) and installs all pinned dependencies.
bash setup_env.sh
conda activate prompt # if you named your environment "prompt"If you encounter problems installing torch, install it following the official guide for your CUDA version.
Our setup uses torch==2.8.0 on the following hardware.
| Component | Version |
|---|---|
| GPU | 8× NVIDIA RTX 5880 Ada (49 GB each) |
| NVIDIA driver | 580.105.08 |
| Driver-supported CUDA | 13.0 (per nvidia-smi) |
| PyTorch | 2.8.0 (bundled CUDA runtime) |
PyTorch CUDA runtime (torch.version.cuda) |
12.8 |
The curated DRIP training and evaluation data is archived on Zenodo (DOI: 10.5281/zenodo.20603331).
Download and extract it into the repository root:
wget -O datasets.zip "https://zenodo.org/records/20603331/files/datasets.zip?download=1"
unzip datasets.zip
mv datasets1/ datasets/ # rename it to datasets/This restores datasets/sep/sep_data_cleaned_dpo_gpt.json and the other
curated files used by the training and evaluation scripts.
To regenerate the DRIP training data from scratch instead, see
data_generation/README.md.
Pick the script that matches your base model and the dataset you want to
train on. The scripts are grouped into per-dataset folders (sep/, alpaca/) —
to train on SEP run the sep/ script, to train on Alpaca go to the alpaca/
folder and run the matching one there:
| Base model | Dataset | Command |
|---|---|---|
| Meta-Llama-3-8B-Instruct | SEP | bash ./scripts/llama8b/sep/drip_sep.sh |
| Llama-3.1-8B-Instruct | Alpaca (3-role) | bash ./scripts/llama8b/alpaca/drip_alpaca.sh |
| Llama-3.1-8B-Instruct | Alpaca + InjecAgent (4-role / tool-calling) | bash ./scripts/llama8b/alpaca/drip_alpaca_4roles.sh |
| Mistral-7B-Instruct-v0.3 | SEP | bash ./scripts/mistral7b/sep/drip_sep.sh |
| Mistral-7B-Instruct-v0.3 | Alpaca (3-role) | bash ./scripts/mistral7b/alpaca/drip_alpaca.sh |
Training merges the LoRA adapter into the base weights and saves a full checkpoint that evaluation can load directly. For checkpoints saved as adapters instead (QLoRA runs, or models trained earlier), merge them first:
python -m training.merge_lora --adapter_path <adapter_dir> --output_path <merged_dir>DRIP supports two chat formats, and you train a separate model for each (they use different data and a different delimiter):
| Eval targets | Training data | Delimiter | Launcher | |
|---|---|---|---|---|
| 3-role (text) | SEP, Alpaca injection, IFEval, MMLU, MT-Bench | SEP DPO pairs | TextTextText |
scripts/llama8b/sep/drip_sep.sh |
| 4-role (tool-calling) | AgentDojo | Alpaca + InjecAgent combined DPO | TextTextText-4roles |
scripts/llama8b/agentdojo/drip_4roles.sh |
The 4-role launcher trains on datasets/alpaca_injecagent_dpo_combined.json with
the TextTextText-4roles delimiter (--attack TextTextText-4roles_None). See the
AgentDojo training-data section
for how that data is built and why InjecAgent/Alpaca are mixed in.
What the roles mean — where the untrusted data goes. A "role" is just the
delimiter that wraps the untrusted/injected segment. The exact tokens depend on
the base model's chat template (see config.py):
- Llama-3.1 has a native tool (
ipython) role, so both formats are available:- 3-role (
TextTextText) — untrusted data in theuserturn:<|eot_id|><|start_header_id|>user<|end_header_id|> - 4-role (
TextTextText-4roles) — untrusted data in theipythonturn:<|eot_id|><|start_header_id|>ipython<|end_header_id|>
- 3-role (
- Meta-Llama-3 has no tool role, so only 3-role is possible — untrusted
data always sits in the
userturn. - Mistral-7B (
TextTextTextMistral) has no separate role; the untrusted data sits between<</SYS>>and[/INST]— delimiters['<s>[INST] <<SYS>>', ' <</SYS>>', '[/INST]'].
For comparison, Meta SecAlign adds its own dedicated
inputrole (<|eot_id|><|start_header_id|>input<|end_header_id|>) for the untrusted segment; DRIP instead reuses Llama's nativeipythonrole for 4-role tool-calling.
If you would rather skip training, we release the DRIP adapters on the Hugging Face Hub. They are published as LoRA adapters, so after downloading you must merge each one into its base model before evaluation (this produces the full checkpoint the eval scripts load).
Repo (Kelsey98/…) |
Base model (--base_model_path) |
Template | Model class (--customized_model_class) |
Tool calls |
|---|---|---|---|---|
Llama-3.1-8B-Instruct-TextTextText-4roles-toolcall-drip |
meta-llama/Llama-3.1-8B-Instruct |
4-role | LlamaForCausalLMDRIP |
✅ |
Meta-Llama-3-8B-Instruct-TextTextText-drip |
meta-llama/Meta-Llama-3-8B-Instruct |
3-role | LlamaForCausalLMDRIP |
— |
Mistral-7B-Instruct-v0.3-TextTextTextMistral-drip |
mistralai/Mistral-7B-Instruct-v0.3 |
3-role | MistralForCausalLMDRIP |
— |
Download the adapter, then merge it — substituting REPO, --base_model_path,
and --customized_model_class from the row above:
REPO=Llama-3.1-8B-Instruct-TextTextText-4roles-toolcall-drip # pick one from the table
huggingface-cli download Kelsey98/$REPO --local-dir $REPO
CUDA_VISIBLE_DEVICES=0 python -m training.merge_lora \
--adapter_path "$REPO/" --output_path "$REPO-merged/" \
--base_model_path meta-llama/Llama-3.1-8B-Instruct \
--customized_model_class LlamaForCausalLMDRIPPass the merged path ($REPO-merged/) as the model path in the
evaluation scripts.
Before you start: copy ./datasets/openai_configs_example.yaml to ./datasets/openai_configs.yaml and fill in your OpenAI configuration.
Every evaluation script below prompts for two inputs: a CUDA device ID and the trained model path (e.g.
meta-llama/Meta-Llama-3-8B-Instruct-TextTextText-sep-drip). The steps omit this prompt for brevity.The examples use the Llama scripts — swap
llama8bformistral7bto evaluate the other model.
SEP score — 📖 details
- Run
./scripts/evaluation/llama8b/sep.sh. - Run the SEP judge
./testing/sep/sep_judge.py, then./testing/sep/sep_collect.pyto print the SEP metric.
Alpaca heuristic-based attacks
- Run
./scripts/evaluation/llama8b/alpaca_injection.sh. - Run
./testing/evaluation_main.py-m [model_path]to print ASR under the Naive, Ignore, Completion, Escape, and HackaPrompt attacks.
GCG-based adaptive attacks
See gcg/README.md. GCG requires a separate legacy environment because newer transformers versions trigger OOM.
InjecAgent — 📖 details
Adaptive attacks: PAIR / TAP / PISmith
Optimization/search-based attackers that adapt to the target — each has its own guide:
- PAIR — iterative attacker LLM — 📖
testing/pair/README.md - TAP — tree-of-attacks with pruning — 📖
testing/tap/README.md - PISmith — RL-trained attacker (train, then test) — 📖
testing/pismith/README.md
AlpacaEval 2.0 (can cost up to USD 50)
-
If the
alpacaevalcommand did not run successfully, run it manually:export OPENAI_CLIENT_CONFIG_PATH=./datasets/openai_configs.yaml && \ alpaca_eval --model_outputs [model_path]/predictions_on_davinci_003_outputs.json \ --reference_outputs datasets/gpt4o_outputs.json
-
Find the win rate in
model-path/weighted_alpaca_eval_gpt4_turbo/leaderboard.csv.
IFEval — 📖 details
- Run
./scripts/evaluation/llama8b/ifeval.sh. - Run
./testing/ifeval/evaluation_main.pyand look for ASR strict.
MT-Bench — 📖 details
- Run
./scripts/evaluation/llama8b/mtbench.sh. - Run
./testing/mt_bench/gen_judgment.pywith--model-path [model-path] --model-id [model name, e.g. Ours]. - Plot the radar chart with
./testing/mt_bench/plot.py.
MMLU — 📖 details
DRIP is developed and evaluated under a deliberately scoped threat model. The following limitations matter when applying it elsewhere:
- Text-to-text attacks. Our primary setting is text instruction-following — the injected content is natural-language instructions embedded in a text data section. For tool-calling agents we additionally release a dedicated 4-role checkpoint trained with InjecAgent data and evaluated on AgentDojo (see pretrained checkpoints); the text-only (3-role) models are not tuned for that regime, so use the 4-role checkpoint for tool-calling.
- Dense architectures. The representation editing is designed and validated on dense transformer architectures. We have not fully tested it on Mixture-of-Experts (MoE) models, where inserting new layers poses additional challenges. For MoE backbones we recommend our data-curation recipe together with the residual re-instruction fusion, while the de-instruction shift layer may be unnecessary (and harder to inject).
- Single modality. Extending DRIP to multi-modal agents — GUI agents, browser use, OS use — requires new adaptation that is outside the scope of this work.
Parts of this codebase are adapted from the following Meta AI (FAIR) projects:
We thank the authors for releasing their code.
A BibTeX entry will be added here once the paper is released. It is currently under review.


