feat(notebook): add adversarial red-team evaluation notebook#114
Open
francose wants to merge 2 commits into
Open
feat(notebook): add adversarial red-team evaluation notebook#114francose wants to merge 2 commits into
francose wants to merge 2 commits into
Conversation
Adds a self-contained notebook that exercises Sec-Gemini against an adversarial prompt suite covering the OWASP LLM01 sub-categories (instruction override, system prompt extraction, DAN-style jailbreak, role reassignment, constraint removal, chat template injection, encoding evasion, prompt leaking). Each case ships with a simple, auditable regex detector applied to the response. Cases run in fresh sessions so they cannot contaminate each other through conversation history. Output is a scorecard plus a per-failure inspection block so a human can confirm borderline matches. The notebook intentionally has no heavy dependencies beyond sec-gemini itself. Users who want richer scoring are pointed at PyRIT, garak, and MITRE ATLAS in the closing section.
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Each attack case now asks the model to emit a unique token (SYSTEM_PROMPT_FOLLOWS:, DAN_ONLINE, ROLE_ACCEPTED, etc.) and the detector matches the token rather than generic English. This reduces the false-positive rate when a refusal response quotes parts of the attack while declining it. Also fixes a missing word boundary in the previous DAN pattern and adds boundaries on RESTRICTIONS_DISABLED, TEMPLATE_APPLIED and decoded-pwn.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
A new notebook,
notebook/adversarial_red_team_eval.ipynb, that runs a small adversarial suite against Sec-Gemini and produces a scorecard.Why
Defender-side models still need red-team eval. The existing notebooks demo SDK usage, DFIR, and skills — there's nothing yet that lets a user point an adversarial prompt suite at their session and ask "did anything get through". This fills that gap with a minimal, no-extra-deps starting point.
Shape
What it deliberately is not
Style
Follows the conventions of the existing notebooks (auth pattern from
basic_sdk_usage.ipynb, streaming response pattern, same kernelspec). 18 cells, ~12 KB, no embedded outputs.Testing
Validated as nbformat v4. All Python cells compile cleanly (with the usual notebook idioms —
!pipshell magic and top-levelawait).