Add Claude AI test failure analysis to Slack notifications by robbycochran · Pull Request #3381 · stackrox/collector

robbycochran · 2026-05-20T19:54:08Z

Summary

Adds AI-powered test failure analysis that runs within the same workflow and includes intelligent insights in Slack notifications.

New Architecture

Instead of a separate workflow_run, this adds an analyze-failures job to the existing test workflows:

Test Workflow
 ├── amd64-integration-tests → (may fail)
 ├── arm64-integration-tests → (may fail)
 ├── s390x-integration-tests → (may fail)
 ├── ppc64le-integration-tests → (may fail)
 │
 ├── analyze-failures (runs if any fail)
 │    ├── Downloads all artifacts
 │    ├── Parses JUnit XML
 │    ├── Calls Claude AI
 │    └── Uploads analysis-report.md
 │
 └── notify (waits for analysis)
      ├── Downloads analysis-report.md
      └── Posts to Slack with insights

What Changed

Modified Workflows

.github/workflows/integration-tests.yml
- Added analyze-failures job before notify
- Modified notify to download and include analysis report
- Graceful fallback if analysis unavailable
.github/workflows/unit-tests.yml
- Same pattern for unit test failures

New Files

.github/scripts/analyze_test_failures.py
- Parses JUnit XML test reports
- Extracts failure patterns
- Calls Claude via Vertex AI
- Generates markdown report
.github/scripts/requirements.txt
- anthropic - Claude API client
- google-auth - GCP authentication
- google-cloud-aiplatform - Vertex AI
.github/scripts/README.md
- Setup instructions
- Usage examples
- Troubleshooting guide

Example Output

Before (current):

@acs-collector-oncall
Integration tests failed.

After (with AI):

@acs-collector-oncall

🤖 AI Analysis

**Root Cause**: eBPF LSM hook attachment failures on RHCOS nodes. 
Tests failed with "permission denied" when attaching to lsm/file_open.

**Pattern**: All 3 failures occurred on RHCOS amd64 test VMs.

**Recommendations**:
• Check kernel version on RHCOS VMs - LSM BPF requires kernel 5.7+
• Verify CONFIG_BPF_LSM is enabled in kernel config
• Review recent changes to eBPF program loading logic
• Check for SELinux policy changes blocking BPF operations

---
**Statistics**
• Total Failures: 3
• Total Errors: 0
• Failed Jobs: amd64-integration-tests

Key Benefits

✅ Same workflow - No workflow_run complexity, better job tracking
✅ Job dependencies - notify waits for analyze-failures to complete
✅ Markdown report - Clean format, easy to include in Slack
✅ Graceful fallback - Still notifies even if analysis fails
✅ Artifact-based - Analysis report uploaded as workflow artifact
✅ Context available - All test artifacts already in the workflow

Required Secrets

Add these to repository secrets (see .github/scripts/README.md for setup):

# Required
GCP_CLAUDE_SERVICE_ACCOUNT_KEY  # GCP service account JSON with Vertex AI access

# Optional (have defaults)
GCP_CLAUDE_PROJECT_ID           # Defaults to "rhacs-eng"
GCP_CLAUDE_REGION               # Defaults to "us-central1"

Existing secret used:

SLACK_COLLECTOR_ONCALL_WEBHOOK ✅ Already configured

Setup Instructions

1. Create GCP Service Account

gcloud iam service-accounts create claude-test-analyzer \
  --display-name="Claude Test Failure Analyzer"

2. Grant Vertex AI Access

gcloud projects add-iam-policy-binding rhacs-eng \
  --member="serviceAccount:claude-test-analyzer@rhacs-eng.iam.gserviceaccount.com" \
  --role="roles/aiplatform.user"

3. Create Key

gcloud iam service-accounts keys create key.json \
  --iam-account=claude-test-analyzer@rhacs-eng.iam.gserviceaccount.com

4. Add to GitHub Secrets

Repository Settings → Secrets and variables → Actions → New secret:

Name: GCP_CLAUDE_SERVICE_ACCOUNT_KEY
Value: Paste entire contents of key.json

Testing

The workflow will activate on the next test failure. To test manually:

# Install dependencies
pip install -r .github/scripts/requirements.txt

# Run analysis locally
python .github/scripts/analyze_test_failures.py \
  --artifacts-dir test-artifacts \
  --output-file analysis-report.md \
  --workflow-name "Integration Tests" \
  --failed-jobs "amd64-integration-tests"

# View report
cat analysis-report.md

Advantages Over workflow_run Approach

Feature	This PR	workflow_run
Workflow complexity	✅ Single workflow	❌ Two workflows
Job dependencies	✅ Native `needs:`	❌ Separate trigger
Artifact access	✅ Direct download	❌ Cross-workflow
Debugging	✅ Same workflow run	❌ Different run
Failure handling	✅ Built-in fallback	❌ Must handle separately

Future Enhancements

Correlate failures with recent commits
Link to similar past failures
Track failure patterns over time
Auto-create issues for recurring failures

Testing Plan

✅ Workflow syntax validated
✅ Python script tested locally
⏳ Will trigger on next test failure
⏳ Monitor for false positives
⏳ Iterate based on team feedback

cc @stackrox/acs-collector

Add analyze-failures job that runs within the same workflow when tests fail, generating a markdown report that is included in Slack notifications. Changes: - Add analyze-failures job to integration-tests.yml - Add analyze-failures job to unit-tests.yml - Modify notify jobs to wait for and include analysis report - Add analyze_test_failures.py script using Claude via Vertex AI - Generate markdown reports with root cause, patterns, recommendations Architecture: 1. Test jobs run (may fail) 2. analyze-failures job downloads artifacts, parses JUnit XML, calls Claude 3. Uploads analysis-report.md as artifact 4. notify job waits for analysis, downloads report, includes in Slack Benefits: - Same workflow (no separate workflow_run complexity) - Job dependencies ensure analysis completes before notify - Markdown report format easy to include in Slack - Graceful fallback if analysis fails - All test context available in same workflow Required secrets: - GCP_CLAUDE_SERVICE_ACCOUNT_KEY: Service account with Vertex AI access - GCP_CLAUDE_PROJECT_ID: GCP project (optional, defaults to rhacs-eng) - GCP_CLAUDE_REGION: Vertex AI region (optional, defaults to us-central1) See .github/scripts/README.md for setup instructions.

Replace Python-based test analysis with Claude Code skill for deeper investigation capabilities: - Claude can now read source code, git history, and test implementation - More specific root cause identification with file/line references - Better platform-specific pattern detection - Simplified workflow dependencies (no Python setup needed) The skill has full tool access (Read, Grep, Bash) to investigate failures rather than just summarizing error messages. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Use npm install for more reliable package management and better caching support in GitHub Actions. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

codecov-commenter · 2026-05-20T20:12:26Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 27.34%. Comparing base (01135a9) to head (f3b267c).
⚠️ Report is 1 commits behind head on master.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #3381   +/-   ##
=======================================
  Coverage   27.34%   27.34%           
=======================================
  Files          95       95           
  Lines        5420     5420           
  Branches     2545     2545           
=======================================
  Hits         1482     1482           
  Misses       3211     3211           
  Partials      727      727

Flag	Coverage Δ
collector-unit-tests	`27.34% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Replace npm-based CLI installation with official claude-code-base-action for better integration and reliability: - Direct Vertex AI authentication via use_vertex flag - Inline prompt with detailed analysis instructions - Proper GitHub Action outputs and error handling - No manual installation or dependency management needed - Better caching and optimization The action runs headless analysis on test artifacts and generates markdown reports with root cause, evidence, and recommendations. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Unit tests are stable and rarely fail, so AI analysis isn't needed there. Keep analysis only for integration tests where failures are more complex and platform-specific. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Provides a safe way to test the Claude analysis + Slack notification flow before relying on it for production failures. Usage: - Add label 'test-oncall-workflow' to any PR, or - Manually trigger via workflow_dispatch The test: 1. Creates realistic fake test failure artifacts (JUnit XML) 2. Runs Claude analysis on the failures 3. Posts to Slack with [TEST] prefix This allows verification that: - GCP authentication works - Claude can parse test reports and analyze code - Slack webhook is configured correctly - The full end-to-end flow works Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Clarify what the test workflow actually validates: - Workflow execution and job dependencies - Artifact creation and download - GCP Vertex AI authentication - Claude action integration - Slack webhook delivery - Report generation and formatting The test uses synthetic JUnit XML files to verify the infrastructure works end-to-end, not to validate Claude's analysis quality on real collector failures. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

GCP_CLAUDE_SERVICE_ACCOUNT_KEY is already configured, no need to include setup instructions. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

New workflows in PRs can't be triggered by PR events until merged to the base branch. Change to workflow_dispatch so it can be tested after merge. Also adds weekly scheduled run to verify the workflow continues to work over time. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Can trigger on PR branch via workflow_dispatch. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Ensures the test workflow uses the same GCP Vertex AI configuration as the production workflow. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- ANTHROPIC_VERTEX_PROJECT_ID from GCP_CLAUDE_PROJECT_ID secret (no default) - CLOUD_ML_REGION set to global for Vertex AI Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

workflow_dispatch doesn't work for workflows that only exist in PR branches. Add push trigger so it runs when the workflow file is updated. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Created analyze-and-notify.yml as a reusable workflow that contains the actual Claude analysis and Slack notification logic. Both the production integration-tests workflow and the test workflow now call this same reusable workflow, ensuring: - No code duplication - Test runs the exact same logic as production - Single source of truth for analysis behavior - Easy to maintain and update Also fixed CLOUD_ML_REGION from 'global' to 'us-east5' - Anthropic models on Vertex AI are not available in the global endpoint. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Per user feedback, should use global region. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Artifacts don't work across jobs in reusable workflows. Combined analyze-failures and notify into a single job so the analysis report can be read directly without artifact upload/download. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Separated Claude analysis from Slack notification as requested. Added explicit check for analysis-report.md creation to debug why Claude action isn't generating the file. This will help identify if it's: - Model availability issue (404 error) - Permissions issue - File path issue The artifact upload now only happens if the report exists, and notify job handles missing artifact gracefully. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The 404 error suggests the default model isn't available. Trying: - Explicit model: claude-3-5-sonnet-v2@20241022 (Vertex AI naming) - Region: us-east5 (common region where Claude is available) If this doesn't work, we may need to: - Check what models are available in the GCP project - Use Anthropic API directly instead of Vertex AI - Verify service account has proper permissions Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Keeping region as global per user requirement. Still trying explicit model: claude-3-5-sonnet-v2@20241022 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Using claude-opus-4-6 model for Vertex AI. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Trying claude-sonnet-4-6 for Vertex AI availability. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Changed to us-east5 region and removed explicit model parameter to use the action's default. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

- Added continue-on-error to Claude step - Show Claude action outcome/conclusion - Display first 20 lines of report if created - Always run check steps to see what happened This will help diagnose why the analysis report isn't being created. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

If Claude action doesn't succeed, generate a fallback analysis report that: - Explains Claude failed - Shows the action outcome - Provides debug steps - Lists what to check in GCP/Vertex AI This allows the workflow to complete successfully and send a Slack notification with useful debugging info instead of just 'artifact not found'. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Added: - allowed_tools: Read, Grep, Glob, Bash so Claude can access code - Explicit instruction to create analysis-report.md using cat/heredoc The action runs Claude but doesn't automatically create files from the prompt. Claude needs explicit tools and instructions to write the analysis-report.md file. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Following the pattern from PR #3170, moved all analysis logic into a proper Claude Code skill. Added: - .claude/commands/analyze-test-failures.md - Skill definition with usage, implementation details - Follows repo's slash command pattern - Contains all analysis instructions and report format Changed: - .github/workflows/analyze-and-notify.yml - Simplified prompt to just invoke the skill - Removed ~60 lines of inline instructions - Cleaner, more maintainable workflow Benefits: - Single source of truth for analysis logic - Can be tested locally: `claude /analyze-test-failures` - Can be improved independently of workflow - Follows established repo conventions Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The claude-code-base-action doesn't auto-load skills from .claude/commands/, so it was asking for approval to run an unknown command. Solution: Inline the instructions in the workflow prompt. The skill file (.claude/commands/analyze-test-failures.md) remains as documentation and can be used when running Claude CLI locally, but the workflow uses the instructions directly. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

After testing, claude-code-base-action doesn't auto-execute skills from .claude/commands/ the same way the CLI does. Solution: - Keep .claude/commands/analyze-test-failures.md as documentation - Inline the instructions in the workflow for reliability - Condensed to ~30 lines (readable, maintainable) The skill file is still valuable for: - Local testing: read it to understand the approach - Documentation of the analysis methodology - Future migration if action adds skill support Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Using append_system_prompt to tell Claude that commands from .claude/commands/ are trusted and should execute without approval. This approach: - Keeps the workflow prompt minimal (4 lines) - Trusts the skill definition in .claude/commands/analyze-test-failures.md - Uses /analyze-test-failures command directly If this works, we get: ✅ DRY - single source of truth in skill file ✅ Minimal workflow code ✅ Same skill works locally and in CI Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Instead of invoking /analyze-test-failures directly (which requires approval), tell Claude to: 1. Read the skill file: .claude/commands/analyze-test-failures.md 2. Follow its instructions 3. Auto-approve Read, Grep, Glob, Bash tools via settings This approach: - Uses action's settings feature for permissions - Skill file is the source of truth - Claude reads and executes the documented steps - No approval prompts needed Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Created proper plugin structure: - .claude/plugins/test-failure-analyzer/manifest.json - .claude/plugins/test-failure-analyzer/analyze-test-failures.md Using action's built-in plugin support: - plugin_marketplaces: file://.claude/plugins - plugins: test-failure-analyzer Workflow now: 1. Action loads plugin from local .claude/plugins/ 2. Plugin provides /analyze-test-failures command 3. Workflow invokes it with simple prompt This is the proper way to use the action's plugin feature. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Added: - allowed_tools: Bash so Claude can create files - Explicit reminder in prompt to create the file - Fallback step to capture Claude's text output and write to file If Claude doesn't create analysis-report.md (as we saw - it just returned text), this fallback reads the execution output JSON and writes the result field to analysis-report.md. This ensures we always get a report, either: 1. Claude creates it directly (preferred) 2. We capture Claude's output and write it (fallback) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Changes to skill: - Changed from 'IMPORTANT' to 'CRITICAL: File Creation Step' - Added actual template in the bash command example - Emphasized 'DO NOT just summarize' - must create file - Explained workflow depends on file existing Changes to workflow prompt: - Changed to 'CRITICAL REQUIREMENT' - Added explicit bash command example in prompt - Emphasized file MUST exist for workflow to continue Changes to check step: - Search entire workspace with find command - Show current directory - List all files to debug where Claude is running This should make it unmistakably clear that creating the file is not optional. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

The heredoc with GitHub expressions was causing YAML parsing errors. Changed to use echo statements instead of cat heredoc. This avoids the YAML parser trying to interpret the markdown content with ** as YAML aliases. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

robbycochran requested a review from a team as a code owner May 20, 2026 19:54

StackRox Automation and others added 2 commits May 20, 2026 20:01

Install Claude Code via npm instead of curl script

14db773

Use npm install for more reliable package management and better caching support in GitHub Actions. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

StackRox Automation and others added 3 commits May 20, 2026 20:28

Remove Claude analysis from unit tests workflow

ded5869

Unit tests are stable and rarely fail, so AI analysis isn't needed there. Keep analysis only for integration tests where failures are more complex and platform-specific. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

robbycochran added the test-oncall-workflow label May 20, 2026

StackRox Automation and others added 21 commits May 20, 2026 20:35

Remove GCP secret setup instructions

0d3b15e

GCP_CLAUDE_SERVICE_ACCOUNT_KEY is already configured, no need to include setup instructions. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Remove weekly schedule, just use manual trigger

e274cdc

Can trigger on PR branch via workflow_dispatch. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Add ANTHROPIC_VERTEX_PROJECT_ID env vars to test workflow

89c4bda

Ensures the test workflow uses the same GCP Vertex AI configuration as the production workflow. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Use GCP_CLAUDE_PROJECT_ID secret and set region to global

6319786

- ANTHROPIC_VERTEX_PROJECT_ID from GCP_CLAUDE_PROJECT_ID secret (no default) - CLOUD_ML_REGION set to global for Vertex AI Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Add push trigger to test workflow for PR branch

db2b3dd

workflow_dispatch doesn't work for workflows that only exist in PR branches. Add push trigger so it runs when the workflow file is updated. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Change CLOUD_ML_REGION back to global

503a57e

Per user feedback, should use global region. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Trigger test workflow on analyze-and-notify.yml changes

a07c3b8

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Set CLOUD_ML_REGION back to global as required

b080b87

Keeping region as global per user requirement. Still trying explicit model: claude-3-5-sonnet-v2@20241022 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Change model to claude-opus-4-6

5749b50

Using claude-opus-4-6 model for Vertex AI. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Change model to claude-sonnet-4-6

fc209c3

Trying claude-sonnet-4-6 for Vertex AI availability. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Set region to us-east5 and use default model

f3b267c

Changed to us-east5 region and removed explicit model parameter to use the action's default. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

StackRox Automation and others added 8 commits May 20, 2026 22:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Claude AI test failure analysis to Slack notifications#3381

Add Claude AI test failure analysis to Slack notifications#3381
robbycochran wants to merge 35 commits into
masterfrom
add-test-analysis-job

robbycochran commented May 20, 2026

Uh oh!

codecov-commenter commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

robbycochran commented May 20, 2026

Summary

New Architecture

What Changed

Modified Workflows

New Files

Example Output

Before (current):

After (with AI):

Key Benefits

Required Secrets

Setup Instructions

1. Create GCP Service Account

2. Grant Vertex AI Access

3. Create Key

4. Add to GitHub Secrets

Testing

Advantages Over workflow_run Approach

Future Enhancements

Testing Plan

Uh oh!

codecov-commenter commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov-commenter commented May 20, 2026 •

edited

Loading