Skip to content

Add monitor tuning meta-guide for run-volume and false-positive avoidance#53

Open
samgutentag wants to merge 1 commit into
mainfrom
sam-gutentag/monitor-tuning
Open

Add monitor tuning meta-guide for run-volume and false-positive avoidance#53
samgutentag wants to merge 1 commit into
mainfrom
sam-gutentag/monitor-tuning

Conversation

@samgutentag
Copy link
Copy Markdown
Member

Summary

  • New page: flaky-tests/detection/tuning-monitors.mdx
  • Adds a docs.json nav entry under the Flaky test detection group, slotted after the three monitor-type pages
  • Ties together run-volume → monitor-type recommendations, single-failure-flap avoidance, branch coverage, recovery vs activation, monitor states (active / inactive / disabled), and a pre-auto-quarantine checklist

Why

Sourced from customer feedback mining (cluster monitor-tuning-thresholds, verdict partial + first-class IA candidate, 15 pairs across 7 customers). The individual monitor pages already document each monitor type. Customers consistently ask the same set of system-level tuning questions — when to use failure-count vs failure-rate, how to avoid single-failure flips, why a monitor scoped to main misses queue-branch failures, what "inactive" means in the UI, what to check before turning on auto-quarantine.

Items flagged for review

  • Page location. Slotted under flaky-tests/detection/ rather than flaky-tests/management/ because the page is about tuning detection behavior, not managing already-detected tests. The cluster suggestion mentioned either location; this felt cleaner since every link inside the page points at detection pages. Confirm or move.
  • Auto-quarantine recommended window: "1-3 days." Lifted from the cluster Q&A (Caseware thread). Confirm this still matches current eng guidance.
  • Pass-on-Retry default recovery = 7 days, range 1-15. Pulled from pass-on-retry-monitor.mdx and matches the cluster Gusto thread.
  • Branch patterns table (Trunk Merge Queue / GitHub Merge Queue / Graphite Merge Queue) mirrors the table in failure-rate-monitor.mdx. GitLab Merge Trains intentionally omitted since the cluster didn't surface a question about them — failure-rate-monitor.mdx notes they run on the target branch directly.
  • The "gap" section explicitly calls out that there's no way to distinguish "flakes detected in MQ" from "bad PR in MQ" at the monitor level, and proposes a >=2 failures in 1h failure-count threshold on queue branches as a proxy. This came directly from the Gusto thread reply ("Higher-threshold failure count monitor that marks broken is the right pattern... No good way to distinguish flakes-detected-in-MQ from actual-bad-PRs-in-MQ today."). Confirm the proxy guidance is still accurate.
  • "Inactive" state definition. Cluster note said "Copy will be improved" in the UI — the doc currently defines it as "previously triggered, no longer triggered, still enabled." Confirm this matches the latest UI state and whether the copy change has shipped.
  • Pre-auto-quarantine cross-link points at ../agents/autofix-flaky-tests. That page exists but its content is more about the auto-investigation/PR flow than the auto-quarantine toggle. If there's a better target page for the auto-quarantine setting itself, swap it.

Customer signal

@samgutentag samgutentag added the needs review PR sourced from customer-feedback-mining; needs human scrutiny for accuracy before merge label May 20, 2026
@mintlify
Copy link
Copy Markdown
Contributor

mintlify Bot commented May 20, 2026

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
trunk 🟢 Ready View Preview May 20, 2026, 11:05 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs review PR sourced from customer-feedback-mining; needs human scrutiny for accuracy before merge

Development

Successfully merging this pull request may close these issues.

1 participant