Skip to content

feat(moderation): add input moderation guard for the UI assistant#13

Merged
albanm merged 22 commits into
mainfrom
feat-moderation
Jun 5, 2026
Merged

feat(moderation): add input moderation guard for the UI assistant#13
albanm merged 22 commits into
mainfrom
feat-moderation

Conversation

@albanm
Copy link
Copy Markdown
Member

@albanm albanm commented Jun 5, 2026

Add a per-message input-moderation guard for the UI-integrated assistant. Each new user message is classified by a configurable moderator model (falling back to the summarizer model) and out-of-scope / abusive / prompt-injection messages are blocked with a generic refusal before any assistant output is shown.

Why: protect the assistant from profanity, prompt-injection, persona-override, and out-of-scope abuse, without delaying the request — moderation runs concurrently with the assistant turn and only withholds the first visible byte until the verdict arrives.

What changed:

  • api/src/moderation/{operations,service,router}.ts — new module: moderation prompt builder, tolerant verdict parser, model resolution (moderatorsummarizer → skip), and a fail-open 1.5s-timeout service. GET returns the enabled flag, POST returns the verdict.
  • api/types/settings/schema.js — new moderator model slot and a moderation settings block (enabled, refusalMessage).
  • ui/src/composables/use-agent-chat.ts — parallel gate that buffers the stream until the verdict, blocks → refusal + drops the message from model context.
  • ui/src/components/agent-chat/AgentChatDebugDialog.vue + session-recorder.ts — dedicated trace renderer; every decision (allow/skip/block) is recorded client-side.
  • docs/architecture.md — new "Input Moderation Guard" section.
  • Mock mock-moderator model variant + unit/api/e2e tests.
  • Regenerated ui/src/components/vjsf/* and api/doc/settings/put-req/* (build-types output).

Regression risks:

  • Auth/quota gap on the new endpoints (please confirm intentional). api/src/moderation/router.ts only calls reqSession — no assertCanUseModel / assertRoleQuota / checkQuota, unlike gateway/router.ts and summary/router.ts. Since reqSession permits anonymous sessions, any caller can POST /api/moderation/<any-owner>/<id> with arbitrary text and trigger that owner's configured model (consuming their provider API key/budget) with no quota accounting. GET also discloses any owner's moderation-enabled flag. Cheap per call, but ungated.
  • Advisory, not a security boundary (by design, documented). A direct/anonymous gateway call bypasses moderation entirely.
  • Enablement cached per chat mount — toggling moderation in settings only takes effect after a page reload.
  • Block on a no-output turn is recorded but not enforced — the post-loop fallback in use-agent-chat.ts records the verdict but does not show the refusal if the stream produced no text-delta/tool-call. Edge case; a normal turn always streams something.
  • Settings replaceOne now always writes a moderation field (defaulting to { enabled: false }); existing settings docs without it are unaffected on read via emptySettings/?? defaultModeration.

albanm and others added 21 commits June 4, 2026 11:48
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ution)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add GET /api/moderation/:type/:id (reports enabled state) and POST
/api/moderation/:type/:id (runs moderation and returns allow/block
verdict with refusal message). Includes service with timeout/fail-open
logic and full API test suite (8 tests, all passing).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…locks

The session recorder now captures every moderation verdict via
recordModerationDecision (action + category + reason + skipped), surfaced
as a 'moderation: <verdict>' trace entry. Recording happens at the gate
for any visible turn, with a post-stream safety net for turns that emit
no visible output, so allow/skip decisions are auditable in the debug
dialog (not just blocks). Stays client-side/ephemeral — no server log.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Render moderation trace entries with a localized action chip
(Allowed/Blocked/Skipped, color-coded) plus labeled category and reason,
instead of the generic JSON fallback. Adds an e2e asserting the renderer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…s-on

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…eration

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…odule

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…marizer/assistant fallback

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…essage

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… timeout const

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… refusal

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…mposable

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ariants

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Integrate latest main (anonymous-action token on gateway/summary, progressive
tool disclosure, cache metrics in debug traces, drop user name from prompt).

Conflict resolutions:
- api/src/models/mock-model.ts: keep both processMockModeratorPrompt and
  processSelectToolsSeam; take main's processForModel signature (adds tools?).
- ui/src/components/AgentChat.vue: main chat keeps both refusalMessage and
  toolExploration options; evaluator chat unmoderated (no refusalMessage).
- ui/src/composables/use-agent-chat.ts: keep refusalMessage + toolExploration
  options and MODERATION_TIMEOUT_MS + exploration state side by side. Moderation
  now rides main's gatewayFetch, so it inherits anonymous-token handling.
- docs/architecture.md: keep both sections; renumber tool-disclosure to §9.
- ui/dts/auto-imports.d.ts: union of both generated import sets.

Synced deps (npm install) to pick up @data-fair/lib-utils ^1.11.0 (needed by
main's markdown.ts headingClasses). check-types and lint clean.

Known pre-existing environmental failure (NOT from this merge): the gateway
"anonymous request with valid token succeeds" test fails with a JWT
NotBeforeError due to clock skew on the long-running simple-directory container.
@albanm albanm merged commit f24804f into main Jun 5, 2026
3 checks passed
@albanm albanm deleted the feat-moderation branch June 5, 2026 08:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant