Skip to content

Build: Speed up Spark CI with parallel test execution#16357

Draft
kevinjqliu wants to merge 6 commits into
apache:mainfrom
kevinjqliu:kevinjqliu/parallelize-spark-ci
Draft

Build: Speed up Spark CI with parallel test execution#16357
kevinjqliu wants to merge 6 commits into
apache:mainfrom
kevinjqliu:kevinjqliu/parallelize-spark-ci

Conversation

@kevinjqliu
Copy link
Copy Markdown
Contributor

Reduces Spark CI wall-clock time by running tests in parallel, matching the
pattern already used in flink-ci.yml:

  • Add -DtestParallelism=auto to the ./gradlew check invocation so test
    execution scales to the runner.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces Spark CI wall-clock time by enabling Gradle test parallelism, aligning Spark CI behavior with the existing Flink CI workflow.

Changes:

  • Adds -DtestParallelism=auto to the Spark CI ./gradlew ... :check invocation to scale test execution to runner CPU capacity.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Run :iceberg-spark and :iceberg-spark-extensions (+ :iceberg-spark-runtime)
:check tasks in separate matrix jobs instead of serially within one Gradle
invocation. On the slowest matrix combo this removes ~26 min from the
critical path.

Bump -DtestParallelism from 'auto' (= ceil(nproc/2) = 2 forks on a 4 vCPU
ubuntu-24.04 runner) to nproc-1 (= 3 forks). Each Spark test fork uses
~3.16 GB heap, so 3 forks ~= 9.5 GB leaves headroom for the Gradle daemon
and Spark driver overhead inside each fork.

Raise max-parallel from 15 to 30 to accommodate the doubled job count
(2 modules x 12 surviving (jvm, spark, scala) combos = 24 jobs).
- fail-fast: false so an OOM/flake in one matrix combo doesn't cancel
  the other ~23 jobs and we can see whether failures are correlated.
- timeout-minutes: 60 so a hung Gradle daemon or stuck Spark driver
  bails out fast instead of waiting for the default 6 h job timeout.
3 forks * ~3.16 GB heap (plus Spark off-heap and the 4 GB Gradle daemon)
exceeded the 16 GB ubuntu-24.04 runner ceiling and triggered the kernel
OOM-killer, surfacing as 'The runner has received a shutdown signal'.
Drop the spark-only step to 2 forks; iceberg-spark-extensions and
iceberg-spark-runtime keep nproc-1 since their forks use Gradle's default
~512 MB heap.
Inline a 10s-interval memory sampler inside each test step that writes
both a CSV (uploaded as the runner-monitor artifact) and per-sample
markdown rows into GITHUB_STEP_SUMMARY. A follow-up always-run step
mirrors free/ps/dmesg output and the monitor tail into its own summary
so diagnostics survive even if the test step's summary upload is lost
to a runner SIGKILL.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants