Skip to content

feat(notebook): add adversarial red-team evaluation notebook#114

Open
francose wants to merge 2 commits into
google:mainfrom
francose:add-adversarial-redteam-eval-notebook
Open

feat(notebook): add adversarial red-team evaluation notebook#114
francose wants to merge 2 commits into
google:mainfrom
francose:add-adversarial-redteam-eval-notebook

Conversation

@francose
Copy link
Copy Markdown

What

A new notebook, notebook/adversarial_red_team_eval.ipynb, that runs a small adversarial suite against Sec-Gemini and produces a scorecard.

Why

Defender-side models still need red-team eval. The existing notebooks demo SDK usage, DFIR, and skills — there's nothing yet that lets a user point an adversarial prompt suite at their session and ask "did anything get through". This fills that gap with a minimal, no-extra-deps starting point.

Shape

  • Cases organized by OWASP LLM01 sub-category: Instruction Override, System Prompt Extraction, DAN Jailbreak, Role Reassignment, Constraint Removal, Chat Template Injection, Encoding Evasion, Prompt Leaking.
  • Each case ships with the prompt and a transparent regex detector applied to the response. A match = the attack likely landed.
  • Cases run in fresh sessions so they can't contaminate each other through conversation history.
  • Output is a pass/fail table plus a "inspect the first flagged case in full" block so a human can confirm borderline matches.

What it deliberately is not

  • Not an LLM-as-judge harness — kept regex-only so the detectors are auditable in one read.
  • Not a PyRIT/garak dependency — the closing section points users there if they want deeper coverage.
  • Not a benchmark claim against Sec-Gemini — the suite is small, intended as a starting harness users can extend.

Style

Follows the conventions of the existing notebooks (auth pattern from basic_sdk_usage.ipynb, streaming response pattern, same kernelspec). 18 cells, ~12 KB, no embedded outputs.

Testing

Validated as nbformat v4. All Python cells compile cleanly (with the usual notebook idioms — !pip shell magic and top-level await).

Adds a self-contained notebook that exercises Sec-Gemini against an
adversarial prompt suite covering the OWASP LLM01 sub-categories
(instruction override, system prompt extraction, DAN-style jailbreak,
role reassignment, constraint removal, chat template injection,
encoding evasion, prompt leaking).

Each case ships with a simple, auditable regex detector applied to the
response. Cases run in fresh sessions so they cannot contaminate each
other through conversation history. Output is a scorecard plus a
per-failure inspection block so a human can confirm borderline matches.

The notebook intentionally has no heavy dependencies beyond sec-gemini
itself. Users who want richer scoring are pointed at PyRIT, garak, and
MITRE ATLAS in the closing section.
@google-cla
Copy link
Copy Markdown

google-cla Bot commented May 23, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Each attack case now asks the model to emit a unique token
(SYSTEM_PROMPT_FOLLOWS:, DAN_ONLINE, ROLE_ACCEPTED, etc.) and the
detector matches the token rather than generic English. This reduces
the false-positive rate when a refusal response quotes parts of the
attack while declining it.

Also fixes a missing word boundary in the previous DAN pattern and
adds boundaries on RESTRICTIONS_DISABLED, TEMPLATE_APPLIED and
decoded-pwn.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant