Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,9 +1,25 @@
# Admin Queue Peek and Purge (DLQ-Aware) for the SQS Web Console

**Status:** Proposed
**Status:** Implemented
**Author:** bootjp
**Date:** 2026-05-16

## Implementation history

| Phase | PR | Landed |
|-------|-----|--------|
| 1 (this doc, proposal) | #757 | 2026-05-16 |
| 2 (`AdminPurgeQueue` + `IsDLQ`/`DLQSources`) | #771 | 2026-05-17 |
| 3 (`AdminPeekQueue` backend) | #794 | 2026-05-20 |
| 4 (HTTP handler + bridge) | #797 | 2026-05-21 |
| 5 (SPA Messages tab + Purge button + DLQ chips) | #798 | 2026-05-21 |

Out-of-scope follow-ups (tracked separately, not gating this rename):
- Throttle integration (`bucketActionAdminPeek` + dedicated per-queue admin-peek bucket per §3.1)
- Audit logging + Prometheus counters per §3.6
- `principalForReadSensitive` live `RoleStore` re-check (Goal 8, blocked on wider RoleStore plumbing)
- Page-size selector (20 / 50 / 100) + response-size warning
Comment on lines +17 to +21
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The 'Out-of-scope follow-ups' list identifies several features as deferred (Throttle integration, Audit/Prometheus, live RoleStore re-check, and Page-size selector). However, the body of this document (e.g., §3.1, §3.6, Goal 8 in §2.1, and §3.5) still describes these features as if they were implemented. Since this document is being marked as 'Implemented' and features like 'Throttle integration' are significant for managing operational risk, please update the document to reflect these deferrals and detail any mitigation strategies used in their absence to align with repository guidelines.

References
  1. When a design document identifies a significant operational risk, such as the inability to perform rolling upgrades, it must also detail potential mitigation strategies, like implementing a temporary "bridge" or "proxy" mode.


---

## 1. Background and Motivation
Expand Down Expand Up @@ -42,7 +58,7 @@ Both features work for any queue. The **UI is DLQ-aware** so the operator gets t
- `DLQSources []string` — the source-queue names that point at this queue. The SPA renders these as a chip list on the detail page so the operator can confirm they understand what queue feeds the DLQ before purging.
6. **Same AWS-shaped error mapping** as the SigV4 path — purging more than once per 60 seconds returns the SQS `PurgeQueueInProgress` semantics that `tryPurgeQueueOnce` already enforces. The admin response surfaces it as a structured `429 Too Many Requests` JSON payload (`{"code":"PurgeQueueInProgress", "retry_after_seconds":N}`).
7. **Audit** — `admin.sqs.purge_queue` (subject, role, queue, generation_before, generation_after). Peek is a read and does NOT generate an audit line per call (the SPA polls; per-poll audit would drown the log) but the admin handler emits the standard request-log line with `route` / `subject` / `status_code` so the call is still traceable.
8. **Read-only role can peek but not purge.** Peek is gated by a **live `RoleStore` re-check** (not just the session-auth gate that List / Describe currently use), introduced as a new `principalForReadSensitive` helper alongside the existing `principalForWrite`. Purge stays gated by `principalForWrite` (live-role re-check), matching `AdminDeleteQueue` exactly. Codex r9 P1 flagged the security gap in the earlier draft: peek exposes full message bodies / attributes (not just metadata like List / Describe), so a session JWT that was revoked or whose role was downgraded after login could still read DLQ payloads via peek until the token's natural 1-hour TTL elapsed. The new `principalForReadSensitive` helper performs the same revocation check `principalForWrite` does, but classifies the call as a read in the audit pipeline — keeping the audit shape parallel to List / Describe while closing the confidentiality gap. List / Describe themselves remain on the session-auth-only gate because their output is metadata that is already shown on the SPA's queue list page; the divergence is intentional and is documented at the call site so a future reviewer does not "fix" the inconsistency by downgrading peek's gate. Claude r2 caught the earlier draft that implied a non-existent `principalForRead` helper; this paragraph spells out the actual gate with the security-class distinction.
8. **Read-only role can peek but not purge.** _The live `RoleStore` re-check is NOT yet implemented in the initial rollout — see "Out-of-scope follow-ups" at the top. Phase 4 shipped a `Role.AllowsRead()` gate that accepts the JWT-embedded role plus an optional RoleStore lookup; the wider live-revalidation plumbing the design calls for is blocked on a broader RoleStore refactor that affects every adapter's read path. Mitigation in absence: a revoked / downgraded key keeps peek access until the session JWT's natural 1-hour TTL expires._ Peek is gated by a **live `RoleStore` re-check** (not just the session-auth gate that List / Describe currently use), introduced as a new `principalForReadSensitive` helper alongside the existing `principalForWrite`. Purge stays gated by `principalForWrite` (live-role re-check), matching `AdminDeleteQueue` exactly. Codex r9 P1 flagged the security gap in the earlier draft: peek exposes full message bodies / attributes (not just metadata like List / Describe), so a session JWT that was revoked or whose role was downgraded after login could still read DLQ payloads via peek until the token's natural 1-hour TTL elapsed. The new `principalForReadSensitive` helper performs the same revocation check `principalForWrite` does, but classifies the call as a read in the audit pipeline — keeping the audit shape parallel to List / Describe while closing the confidentiality gap. List / Describe themselves remain on the session-auth-only gate because their output is metadata that is already shown on the SPA's queue list page; the divergence is intentional and is documented at the call site so a future reviewer does not "fix" the inconsistency by downgrading peek's gate. Claude r2 caught the earlier draft that implied a non-existent `principalForRead` helper; this paragraph spells out the actual gate with the security-class distinction.

### 2.2 Non-Goals

Expand Down Expand Up @@ -172,7 +188,7 @@ The walk terminates when either `Partition` advances back to `StartPartition` (f

Cost is `O(Limit)` round-trips against Pebble at peek time — tiny for the bounded result sets the SPA uses. The bound on `Limit` (max 100) prevents an operator script from accidentally issuing million-row peeks against the leader.

**Throttle.** Peek consults a **distinct per-queue admin-peek bucket**, *not* the per-queue `ReceiveMessage` budget. An earlier draft of this design merged the two; Claude r2 flagged that an operator paginating through a 10k-message DLQ could exhaust the budget that real consumers depend on. The separate admin-peek bucket defaults to a lower steady-rate (`adminPeekRPS = 5`, `adminPeekBurst = 20`) so a pagination loop cannot starve consumers.
**Throttle.** _Not yet implemented in the initial rollout — see "Out-of-scope follow-ups" at the top. Mitigation in absence: the Phase 3 implementation enforces a hard `Limit ≤ 100` per call and leader-only execution, which bounds the steady-state cost; per-queue throttling lands when the SPA wiring needs the rate-limit metric to have a real consumer._ Peek consults a **distinct per-queue admin-peek bucket**, *not* the per-queue `ReceiveMessage` budget. An earlier draft of this design merged the two; Claude r2 flagged that an operator paginating through a 10k-message DLQ could exhaust the budget that real consumers depend on. The separate admin-peek bucket defaults to a lower steady-rate (`adminPeekRPS = 5`, `adminPeekBurst = 20`) so a pagination loop cannot starve consumers.

**Bucket key format.** The existing `bucketStore` (`adapter/sqs_throttle.go`) keys on a struct `bucketKey{queue, action, incarnation}`, not a string. The admin-peek bucket therefore uses `bucketStore.charge()` directly with `action = bucketActionAdminPeek` and the queue's current incarnation, exactly like the `SendMessage` / `ReceiveMessage` paths do. Claude r4 flagged an earlier draft that described the bucket as a free-standing string-keyed map; that would have required parallel rate-limiter infrastructure and would not have been swept by `invalidateQueue()` on queue re-creation. The `bucketStore.charge(adminPeekThrottle, queueName, bucketActionAdminPeek, meta.Incarnation, 1)` form participates in the existing incarnation reset machinery automatically.

Expand Down Expand Up @@ -430,7 +446,7 @@ The queue detail page gains two new pieces of UI on top of the existing attribut
| Body preview | `body` (already truncated by backend) | first 96 chars; "…" suffix when `body_truncated`. Row click opens detail modal. |
| Size | `body_original_size` | human-readable ("1.4 kB") so operators can spot oversized messages |

Below the table: a page-size selector (20 / 50 / 100), a Refresh button, and Next / Previous controls driven by the cursor. Detail modal shows full body + every attribute + the timestamps; a "Copy as JSON" button copies the row's full record to the clipboard for manual export.
Below the table: _the page-size selector (20 / 50 / 100) is NOT yet implemented — see "Out-of-scope follow-ups" at the top. Phase 5 shipped a hard default of 20 rows; the worst-case response (20 × 256 KiB = 5 MiB) stays well under network / JSON-parse budgets even without operator-tunable sizes. Selector + size warning land in a follow-up if operators ask for larger pages._ A Refresh button, and Next / Previous controls driven by the cursor. Detail modal shows full body + every attribute + the timestamps; a "Copy as JSON" button copies the row's full record to the clipboard for manual export.

**Copy as JSON payload schema.** The clipboard payload is the exact wire shape of a single `AdminPeekedMessage` entry (top-level keys: `message_id`, `body`, `body_truncated`, `body_original_size`, `sent_timestamp`, `receive_count`, `group_id`, `deduplication_id`, `attributes`) plus a wrapper `{"schema_version": 1, "queue": "<name>", "exported_at": "<ISO8601>", "message": { … }}`. The `schema_version` is what downstream tooling pins so a future change to the export format (e.g. multi-message JSONL bundle) does not silently break exporters. Operator workflows that pipe this into a recovery tool can rely on the schema not shifting under them.

Expand Down Expand Up @@ -475,6 +491,8 @@ mirroring the existing `deleteQueue` / `describeQueue` shape. `peekQueue` is `si

### 3.6 Audit and observability

_Not yet implemented in the initial rollout — see "Out-of-scope follow-ups" at the top. Mitigation in absence: the admin handler still emits the standard request-log line with `route` / `subject` / `status_code` for both purge and peek calls, so an operator can correlate "who did what when" against the application logs at audit-review time. The structured `admin.sqs.purge_queue` audit line and the two Prometheus counters land alongside the SPA wiring so the metrics have a real consumer._

New structured log line at `slog.Info` level (matches `AdminDeleteQueue`):

```
Expand Down
Loading