Skip to content

Claude bot enhancements: observability, reliability, and prompt improvements #1535

@thehesiod

Description

@thehesiod

Overview

After reviewing the current Claude integration (sync bot + review bot), here are proposed enhancements organized by priority. The current setup is solid — these are incremental improvements to reliability, observability, and security.


1. Add outcome tracking to sync bot runs

usage-summary.py tracks cost/tokens but not outcomes. Without this data we can't answer: how often does the bot succeed? Which steps consume the most turns? Is the prompt getting better or worse over time?

Proposed: Extend usage-summary.py (or add a new step) to log structured data:

  • Success/failure and exit reason
  • Update type attempted (relax/bump/skip)
  • Highest step reached (e.g., "Step 5: Validate")
  • Turns consumed per step (if derivable from execution log)
  • Whether a WIP PR was created vs final PR

Output to job summary and optionally to a tracking issue or artifact for trend analysis.


2. Add circuit breaker for repeated sync bot failures

If the sync bot fails on a given botocore version, it retries every 3 days indefinitely — burning API budget with no human signal.

Proposed: Track consecutive failures (e.g., via a label or issue body). After N failures (suggest 3) on the same target version:

  • Auto-create or update a feedback issue with failure context
  • Skip subsequent runs until a human responds or the target version changes
  • Include failure summaries to help diagnose the root cause

3. Add prompt integration test (dry-run validation)

Prompt edits can silently degrade the bot. Example: the recent envsubst bug erased $VERSION from git commands in the prompt, and this wasn't caught until a live run.

Proposed: Add a CI check (on PRs touching .github/botocore-sync-prompt.md or botocore-sync.yml) that:

  • Runs the envsubst substitution with mock values and validates no unintended variables are erased
  • Checks that all expected template variables ($LATEST_BOTOCORE, etc.) are present pre-substitution
  • Optionally runs a dry-run against a known botocore version diff to validate the bot reaches the expected decision (relax vs bump)

4. Split sync prompt into composable modules

botocore-sync-prompt.md is 561 lines — essentially an entire program in English. Risks:

  • Subtle instructions get lost in long context
  • A single edit can break unrelated behavior
  • Hard to test individual steps in isolation

Proposed: Split into a main orchestrator + per-step modules:

  • sync-main.md — orchestrator with step routing logic
  • sync-step-4a-relax.md, sync-step-4b-bump.md — detailed per-path instructions
  • sync-common.md — shared rules (security, git operations, pre-commit)

The workflow would concatenate the relevant modules before passing to Claude. This also enables step-level prompt testing.

Note: This is the largest change and should be validated carefully — splitting context across files could hurt coherence if done poorly. Worth prototyping with a single step first.


5. Enhance review bot with domain-specific checks

The review bot is generic. Given aiobotocore's specialized async override patterns, it should explicitly check for:

  • Missing await on I/O operations in Aio* classes
  • Sync methods that should be async (overriding botocore methods that do I/O)
  • Incorrect class naming (missing Aio prefix)
  • Missing resolve_awaitable() on mixed sync/async callbacks
  • Resource cleanup — async context managers for clients/sessions
  • test_patches.py hash updates when patched code changes

This could be added to the review prompt or as a dedicated section in CLAUDE.md that the review bot reads.


6. Replace permission blocklist with tool allowlist

The sync bot uses --dangerously-skip-permissions with a PreToolUse hook blocking git commit. This is a blocklist — anything not blocked is allowed.

Proposed: Switch to an allowlist approach:

  • Define explicit list of allowed tools/patterns (Bash commands, MCP operations, file operations)
  • Block everything else by default
  • This is a stronger security posture, especially as the bot's capabilities grow

Caveat: May require changes to claude-code-action or more granular hook logic. Worth evaluating feasibility before committing.


Priority suggestion

# Enhancement Effort Impact
1 Outcome tracking Low High — enables all other optimization
2 Circuit breaker Low Medium — prevents waste on stuck versions
3 Prompt integration test Medium High — catches silent regressions
4 Split prompt modules High Medium — better maintainability long-term
5 Domain-specific review Medium Medium — catches real bugs in PRs
6 Tool allowlist Medium Low-Medium — security hardening

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions