feat(moderation): add input moderation guard for the UI assistant by albanm · Pull Request #13 · data-fair/agents

albanm · 2026-06-05T07:19:36Z

Add a per-message input-moderation guard for the UI-integrated assistant. Each new user message is classified by a configurable moderator model (falling back to the summarizer model) and out-of-scope / abusive / prompt-injection messages are blocked with a generic refusal before any assistant output is shown.

Why: protect the assistant from profanity, prompt-injection, persona-override, and out-of-scope abuse, without delaying the request — moderation runs concurrently with the assistant turn and only withholds the first visible byte until the verdict arrives.

What changed:

api/src/moderation/{operations,service,router}.ts — new module: moderation prompt builder, tolerant verdict parser, model resolution (moderator → summarizer → skip), and a fail-open 1.5s-timeout service. GET returns the enabled flag, POST returns the verdict.
api/types/settings/schema.js — new moderator model slot and a moderation settings block (enabled, refusalMessage).
ui/src/composables/use-agent-chat.ts — parallel gate that buffers the stream until the verdict, blocks → refusal + drops the message from model context.
ui/src/components/agent-chat/AgentChatDebugDialog.vue + session-recorder.ts — dedicated trace renderer; every decision (allow/skip/block) is recorded client-side.
docs/architecture.md — new "Input Moderation Guard" section.
Mock mock-moderator model variant + unit/api/e2e tests.
Regenerated ui/src/components/vjsf/* and api/doc/settings/put-req/* (build-types output).

Regression risks:

Auth/quota gap on the new endpoints (please confirm intentional). api/src/moderation/router.ts only calls reqSession — no assertCanUseModel / assertRoleQuota / checkQuota, unlike gateway/router.ts and summary/router.ts. Since reqSession permits anonymous sessions, any caller can POST /api/moderation/<any-owner>/<id> with arbitrary text and trigger that owner's configured model (consuming their provider API key/budget) with no quota accounting. GET also discloses any owner's moderation-enabled flag. Cheap per call, but ungated.
Advisory, not a security boundary (by design, documented). A direct/anonymous gateway call bypasses moderation entirely.
Enablement cached per chat mount — toggling moderation in settings only takes effect after a page reload.
Block on a no-output turn is recorded but not enforced — the post-loop fallback in use-agent-chat.ts records the verdict but does not show the refusal if the stream produced no text-delta/tool-call. Edge case; a normal turn always streams something.
Settings replaceOne now always writes a moderation field (defaulting to { enabled: false }); existing settings docs without it are unaffected on read via emptySettings/?? defaultModeration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ution) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add GET /api/moderation/:type/:id (reports enabled state) and POST /api/moderation/:type/:id (runs moderation and returns allow/block verdict with refusal message). Includes service with timeout/fail-open logic and full API test suite (8 tests, all passing). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…locks The session recorder now captures every moderation verdict via recordModerationDecision (action + category + reason + skipped), surfaced as a 'moderation: <verdict>' trace entry. Recording happens at the gate for any visible turn, with a post-stream safety net for turns that emit no visible output, so allow/skip decisions are auditable in the debug dialog (not just blocks). Stays client-side/ephemeral — no server log. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Render moderation trace entries with a localized action chip (Allowed/Blocked/Skipped, color-coded) plus labeled category and reason, instead of the generic JSON fallback. Adds an e2e asserting the renderer. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…s-on Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…eration Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…odule Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…marizer/assistant fallback Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…essage Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… timeout const Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… refusal Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…mposable Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ariants Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Integrate latest main (anonymous-action token on gateway/summary, progressive tool disclosure, cache metrics in debug traces, drop user name from prompt). Conflict resolutions: - api/src/models/mock-model.ts: keep both processMockModeratorPrompt and processSelectToolsSeam; take main's processForModel signature (adds tools?). - ui/src/components/AgentChat.vue: main chat keeps both refusalMessage and toolExploration options; evaluator chat unmoderated (no refusalMessage). - ui/src/composables/use-agent-chat.ts: keep refusalMessage + toolExploration options and MODERATION_TIMEOUT_MS + exploration state side by side. Moderation now rides main's gatewayFetch, so it inherits anonymous-token handling. - docs/architecture.md: keep both sections; renumber tool-disclosure to §9. - ui/dts/auto-imports.d.ts: union of both generated import sets. Synced deps (npm install) to pick up @data-fair/lib-utils ^1.11.0 (needed by main's markdown.ts headingClasses). check-types and lint clean. Known pre-existing environmental failure (NOT from this merge): the gateway "anonymous request with valid token succeeds" test fails with a JWT NotBeforeError due to clock skew on the long-running simple-directory container.

albanm and others added 21 commits June 4, 2026 11:48

feat(moderation): add moderator model slot and moderation settings

466d6b0

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(moderation): pure operations (prompt, verdict parse, model resol…

3e69eeb

…ution) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(moderation): add mock-moderator model variant for tests

7522e48

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(moderation): ui parallel gate withholding first byte until verdict

c89e939

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test(moderation): e2e block and allow flows

0278a9e

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fix(moderation): non-greedy JSON match in verdict parser

8a585a4

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs(moderation): document input moderation guard in architecture

e75c60b

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs(moderation): spec for folding moderation into the gateway, alway…

e6d9839

…s-on Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs(moderation): implementation plan for gateway-based always-on mod…

d816e60

…eration Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

refactor(moderation): move prompt + verdict parser to a pure client m…

92ab347

…odule Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(moderation): resolve moderator role through the gateway with sum…

54ddc20

…marizer/assistant fallback Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

refactor(moderation): delete dedicated endpoint and settings toggle/m…

a3abc52

…essage Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(moderation): always-on client gate via gateway moderator role

8f49059

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

refactor(moderation): drop vestigial guards, use provider.chat, hoist…

ce023db

… timeout const Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test(moderation): e2e for always-on gateway moderation with hardcoded…

5a36426

… refusal Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(moderation): document always-on gateway-based moderation

665469e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore(moderation): update auto-imports declarations for moderation co…

77ec6b6

…mposable Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(moderation): do not moderate the evaluator chat; clarify gate inv…

40407e4

…ariants Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions Bot added the feature label Jun 5, 2026

albanm merged commit f24804f into main Jun 5, 2026
3 checks passed

albanm deleted the feat-moderation branch June 5, 2026 08:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(moderation): add input moderation guard for the UI assistant#13

feat(moderation): add input moderation guard for the UI assistant#13
albanm merged 22 commits into
mainfrom
feat-moderation

albanm commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

albanm commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant