Skip to content

fix(ci): Cloud Build OOM and empty image revision label#6328

Draft
ivankatliarchuk wants to merge 1 commit intokubernetes-sigs:masterfrom
gofogo:ci/fix-cloudbuild-oom-and-revision
Draft

fix(ci): Cloud Build OOM and empty image revision label#6328
ivankatliarchuk wants to merge 1 commit intokubernetes-sigs:masterfrom
gofogo:ci/fix-cloudbuild-oom-and-revision

Conversation

@ivankatliarchuk
Copy link
Copy Markdown
Member

@ivankatliarchuk ivankatliarchuk commented Mar 30, 2026

What does it do ?

I'm not sure where this PR fixes cloudbuild, but worth to try, and we should split the one massive job, as currently cloudbuild runs for quite a while, and hard to understand where exactly it fails.

At most we could rollback certain changes. The split-step + cache-prewarming changes are still worthwhile regardless, because they reduce peak memory, eliminate redundant module downloads, and give cleaner failure reason. But if the next build fails with INTERNAL_ERROR again, that would suggest it's a transient infra issue rather than resource exhaustion.

  • Split the single build step into three: module cache pre-warm → test → build/push, each sharing a named volume (go-modules) so dependencies are downloaded once, not as currently per multiple architectures
  • Added GIT_REVISION Makefile variable (overridable via env) and wired it to --image-label org.opencontainers.image.revision; added _GIT_COMMIT substitution to cloudbuild.yaml so prow can pass the commit SHA
  • We could upgrade machine type to N1_HIGHCPU_32 ;-) but not sure if we have it available

Looks like we could also use N1_HIGHCPU_32 https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/8ee57b6871d5434bef21d05480c4745f97767be1/cloudbuild.yaml#L24

Motivation

The staging Cloud Build was failing with INTERNAL_ERROR - tests and the multi-arch ko build ran in the same step, causing memory exhaustion on N1_HIGHCPU_8. The org.opencontainers.image.revision label was also always empty because git rev-parse HEAD fails in the tarball-based Cloud Build environment.

Failed cloudbuild jobs

Screenshot 2026-03-30 at 12 36 39

Example job took ~50 minutes to fail https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/post-external-dns-push-images/2038528233506344960

ERROR: (gcloud.builds.submit) build 11e2415c-9cf5-45f1-b9ad-8321510cf20d completed with status "INTERNAL_ERROR"
2026/03/30 06:51:46 Using build config #0 for sigs.k8s.io/external-dns
2026/03/30 06:51:46 current folder is not a git repository. Git info will not be available
2026/03/30 06:51:46 Building sigs.k8s.io/external-dns for linux/arm64/v8
2026/03/30 06:51:46 current folder is not a git repository. Git info will not be available
2026/03/30 06:51:46 Building sigs.k8s.io/external-dns for linux/amd64
2026/03/30 06:51:46 Using build config #0 for sigs.k8s.io/external-dns
2026/03/30 06:51:46 current folder is not a git repository. Git info will not be available
2026/03/30 06:51:46 Building sigs.k8s.io/external-dns for linux/arm/v7

^ No git, so we could not run

--image-label org.opencontainers.image.revision=$(shell git rev-parse HEAD) \

I don't know for certain where there is OOM or similar issue it was an educated guess. What I can say from the logs:

  • The ko build ran ~50 minutes with zero output after the initial "Building for..." lines - suggesting it hung or was killed, not that it failed with an error
  • INTERNAL_ERROR from Cloud Build means the build runner itself failed, not user code - that's distinct from FAILURE (bad exit code) or TIMEOUT
  • Building 3 platforms simultaneously on N1_HIGHCPU_8 is a plausible memory pressure scenario

But it could equally be:

  • A transient Cloud Build infrastructure failure (flaky GCR push, runner crash)
  • Network timeout pushing a large multi-arch manifest to GCR
  • A hang in ko itself for some other reason
  • Disk space exhaustion

Why cache is required, currently in pulls same library multiple times

Screenshot 2026-04-02 at 09 51 12

More

  • Yes, this PR title follows Conventional Commits
  • Yes, I added unit tests
  • Yes, I updated end user documentation accordingly

Missed version, as there is no git in cloud build (tar/zip archive is only avaialble there)

ko build --tags v20260330-v0.20.0-203-g736a2d58 --bare --sbom none \
	--image-label org.opencontainers.image.source="https://github.com/kubernetes-sigs/external-dns" \
	--image-label org.opencontainers.image.revision= \
	--platform=linux/amd64,linux/arm64,linux/arm/v7  --push=true .

Another example

https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/post-external-dns-push-images/2039609054623436800

Screenshot 2026-04-02 at 09 07 31

Signed-off-by: ivan katliarchuk <ivan.katliarchuk@gmail.com>
@ivankatliarchuk ivankatliarchuk marked this pull request as draft March 30, 2026 11:41
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 30, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign szuecs for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 30, 2026
@coveralls
Copy link
Copy Markdown

Pull Request Test Coverage Report for Build 23742899117

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.005%) to 80.484%

Totals Coverage Status
Change from base Build 23734434195: 0.005%
Covered Lines: 16941
Relevant Lines: 21049

💛 - Coveralls

@ivankatliarchuk ivankatliarchuk marked this pull request as ready for review March 30, 2026 11:49
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 30, 2026
@k8s-ci-robot k8s-ci-robot requested a review from vflaux March 30, 2026 11:50
@ivankatliarchuk ivankatliarchuk changed the title ci: fix Cloud Build OOM and empty image revision label fix(ci): Cloud Build OOM and empty image revision label Mar 30, 2026
path: /go/pkg/mod
steps:
# Pre-warm the Go module cache so test and build steps don't compete for downloads.
- name: 'docker.io/library/golang:1.26-bookworm'
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  ┌──────────┬────────────┬──────────────────────┬───────────────────────────────────────────────────────────────────┐
  │   Step   │ entrypoint │         args         │                              Result                               │
  ├──────────┼────────────┼──────────────────────┼───────────────────────────────────────────────────────────────────┤
  │ pre-warm │ go         │ mod download         │ go mod download ✓                                                 │
  ├──────────┼────────────┼──────────────────────┼───────────────────────────────────────────────────────────────────┤
  │ test     │ make       │ test                 │ make test → go test -race ./... ✓                                 │
  ├──────────┼────────────┼──────────────────────┼───────────────────────────────────────────────────────────────────┤
  │ build    │ make       │ build.push/multiarch │ make build.push/multiarch with VERSION, GIT_REVISION, IMAGE env ✓ │
  └──────────┴────────────┴──────────────────────┴───────────────────────────────────────────────────────────────────┘

@ivankatliarchuk ivankatliarchuk marked this pull request as draft April 1, 2026 08:39
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 1, 2026
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 2, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants