Build: Speed up Spark CI with parallel test execution#16357
Draft
kevinjqliu wants to merge 6 commits into
Draft
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR reduces Spark CI wall-clock time by enabling Gradle test parallelism, aligning Spark CI behavior with the existing Flink CI workflow.
Changes:
- Adds
-DtestParallelism=autoto the Spark CI./gradlew ... :checkinvocation to scale test execution to runner CPU capacity.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Run :iceberg-spark and :iceberg-spark-extensions (+ :iceberg-spark-runtime) :check tasks in separate matrix jobs instead of serially within one Gradle invocation. On the slowest matrix combo this removes ~26 min from the critical path. Bump -DtestParallelism from 'auto' (= ceil(nproc/2) = 2 forks on a 4 vCPU ubuntu-24.04 runner) to nproc-1 (= 3 forks). Each Spark test fork uses ~3.16 GB heap, so 3 forks ~= 9.5 GB leaves headroom for the Gradle daemon and Spark driver overhead inside each fork. Raise max-parallel from 15 to 30 to accommodate the doubled job count (2 modules x 12 surviving (jvm, spark, scala) combos = 24 jobs).
- fail-fast: false so an OOM/flake in one matrix combo doesn't cancel the other ~23 jobs and we can see whether failures are correlated. - timeout-minutes: 60 so a hung Gradle daemon or stuck Spark driver bails out fast instead of waiting for the default 6 h job timeout.
3 forks * ~3.16 GB heap (plus Spark off-heap and the 4 GB Gradle daemon) exceeded the 16 GB ubuntu-24.04 runner ceiling and triggered the kernel OOM-killer, surfacing as 'The runner has received a shutdown signal'. Drop the spark-only step to 2 forks; iceberg-spark-extensions and iceberg-spark-runtime keep nproc-1 since their forks use Gradle's default ~512 MB heap.
Inline a 10s-interval memory sampler inside each test step that writes both a CSV (uploaded as the runner-monitor artifact) and per-sample markdown rows into GITHUB_STEP_SUMMARY. A follow-up always-run step mirrors free/ps/dmesg output and the monitor tail into its own summary so diagnostics survive even if the test step's summary upload is lost to a runner SIGKILL.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reduces Spark CI wall-clock time by running tests in parallel, matching the
pattern already used in
flink-ci.yml:-DtestParallelism=autoto the./gradlew checkinvocation so testexecution scales to the runner.