Skip to content

fix(integration): Fix flaky TestParquetFuzz by uploading block before cortex starts#7499

Open
yeya24 wants to merge 1 commit intomasterfrom
fix/parquet-fuzz-test-race
Open

fix(integration): Fix flaky TestParquetFuzz by uploading block before cortex starts#7499
yeya24 wants to merge 1 commit intomasterfrom
fix/parquet-fuzz-test-race

Conversation

@yeya24
Copy link
Copy Markdown
Contributor

@yeya24 yeya24 commented May 10, 2026

What this PR does

Fixes the flaky TestParquetFuzz integration test.

Root Cause

The test was flaky because cortex (including the compactor) started before the block was uploaded to minio. This caused a race condition:

  1. Compactor scans immediately on startup, finds no block or a partially-uploaded block ("skipped partial block when updating bucket index")
  2. Block upload completes after the first scan
  3. The compactor cleaner channel fills up ("unable to push cleaning job to usersChan"), delaying subsequent scans
  4. The 30s poll timeout expires before the bucket index is updated with the block

This was only observed on arm64 CI runners which are slower.

Fix

  1. Reorder operations: Upload the block to minio BEFORE starting cortex, so the first compactor scan finds the complete block and includes it in the bucket index immediately.
  2. Increase poll timeout: 30s → 60s as a safety margin for slow CI runners.

Why not just increase the timeout?

Increasing the timeout alone would mask the root cause. The real issue is that the compactor's first scan races with the block upload. By uploading first, we eliminate the race entirely and the bucket index is created on the first scan cycle.

@yeya24 yeya24 force-pushed the fix/parquet-fuzz-test-race branch from 590072a to f8d381b Compare May 11, 2026 00:58
Two issues caused this test to be flaky:

1. Cortex started before the block was uploaded to minio, causing the
   compactor to see a partial block on its first scan and skip it.
   The bucket index was never updated within the 30s poll timeout.
   Fix: upload the block before starting cortex.

2. The fuzz test compared Cortex (with parquet queryable) against
   standalone Prometheus without skipping queries with known
   cross-version semantic differences (stdvar/stddev changed in
   prometheus/prometheus#14941). The random seed meant some runs
   would generate these queries and fail.
   Fix: pass skipStdAggregations=true since this test compares
   against a standalone Prometheus instance.

Also increase the poll timeout from 30s to 60s as a safety margin
for slow arm64 CI runners.

Signed-off-by: Ben Ye <benye@amazon.com>
@yeya24 yeya24 force-pushed the fix/parquet-fuzz-test-race branch from f8d381b to b681b3d Compare May 11, 2026 05:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant