feat(notebook): add adversarial red-team evaluation notebook by francose · Pull Request #114 · google/sec-gemini

francose · 2026-05-23T19:40:55Z

What

A new notebook, notebook/adversarial_red_team_eval.ipynb, that runs a small adversarial suite against Sec-Gemini and produces a scorecard.

Why

Defender-side models still need red-team eval. The existing notebooks demo SDK usage, DFIR, and skills — there's nothing yet that lets a user point an adversarial prompt suite at their session and ask "did anything get through". This fills that gap with a minimal, no-extra-deps starting point.

Shape

Cases organized by OWASP LLM01 sub-category: Instruction Override, System Prompt Extraction, DAN Jailbreak, Role Reassignment, Constraint Removal, Chat Template Injection, Encoding Evasion, Prompt Leaking.
Each case ships with the prompt and a transparent regex detector applied to the response. A match = the attack likely landed.
Cases run in fresh sessions so they can't contaminate each other through conversation history.
Output is a pass/fail table plus a "inspect the first flagged case in full" block so a human can confirm borderline matches.

What it deliberately is not

Not an LLM-as-judge harness — kept regex-only so the detectors are auditable in one read.
Not a PyRIT/garak dependency — the closing section points users there if they want deeper coverage.
Not a benchmark claim against Sec-Gemini — the suite is small, intended as a starting harness users can extend.

Style

Follows the conventions of the existing notebooks (auth pattern from basic_sdk_usage.ipynb, streaming response pattern, same kernelspec). 18 cells, ~12 KB, no embedded outputs.

Testing

Validated as nbformat v4. All Python cells compile cleanly (with the usual notebook idioms — !pip shell magic and top-level await).

Adds a self-contained notebook that exercises Sec-Gemini against an adversarial prompt suite covering the OWASP LLM01 sub-categories (instruction override, system prompt extraction, DAN-style jailbreak, role reassignment, constraint removal, chat template injection, encoding evasion, prompt leaking). Each case ships with a simple, auditable regex detector applied to the response. Cases run in fresh sessions so they cannot contaminate each other through conversation history. Output is a scorecard plus a per-failure inspection block so a human can confirm borderline matches. The notebook intentionally has no heavy dependencies beyond sec-gemini itself. Users who want richer scoring are pointed at PyRIT, garak, and MITRE ATLAS in the closing section.

google-cla · 2026-05-23T19:41:00Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Each attack case now asks the model to emit a unique token (SYSTEM_PROMPT_FOLLOWS:, DAN_ONLINE, ROLE_ACCEPTED, etc.) and the detector matches the token rather than generic English. This reduces the false-positive rate when a refusal response quotes parts of the attack while declining it. Also fixes a missing word boundary in the previous DAN pattern and adds boundaries on RESTRICTIONS_DISABLED, TEMPLATE_APPLIED and decoded-pwn.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(notebook): add adversarial red-team evaluation notebook#114

feat(notebook): add adversarial red-team evaluation notebook#114
francose wants to merge 2 commits into
google:mainfrom
francose:add-adversarial-redteam-eval-notebook

francose commented May 23, 2026

Uh oh!

google-cla Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

francose commented May 23, 2026

What

Why

Shape

What it deliberately is not

Style

Testing

Uh oh!

google-cla Bot commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant