Skip to content

feat: support failure knobs for sinks#1065

Open
thesayyn wants to merge 5 commits intomainfrom
bes-retry
Open

feat: support failure knobs for sinks#1065
thesayyn wants to merge 5 commits intomainfrom
bes-retry

Conversation

@thesayyn
Copy link
Copy Markdown
Member

@thesayyn thesayyn commented May 6, 2026

Changes are visible to end-users: yes

  • Searched for relevant documentation and updated as needed: yes
  • Breaking change (forces users to change their own code or config): no
  • Suggested release notes appear below: yes/no

Test plan

  • New test cases added

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ea1a2a799d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread crates/axl-runtime/src/engine/bazel/sink/grpc.rs
Comment on lines +619 to +624
retry: sink::retry::RetryConfig {
max_retries: max_retries as u32,
retry_min_delay,
retry_max_buffer_size: retry_max_buffer_size as usize,
timeout,
error_strategy,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor the configured BES timeout

The Starlark API and Workflows feature expose timeout/bes_timeout as an overall BES upload deadline, but this value is only stored in RetryConfig and is never read by Grpc::work, drive_stream, retry_lifecycle, or Client. Users setting a finite timeout still wait through all lifecycle calls, stream retries, and final upload indefinitely if the backend stalls or keeps retrying, so the new knob is effectively a no-op.

Useful? React with 👍 / 👎.

Comment thread crates/aspect-cli/src/builtins/aspect/feature/workflows.axl
bessie_sinks.append(bazel.build_events.grpc(
uri = bessie_endpoint,
metadata = {} # TODO: how does bessie authenticate?
metadata = {}, # TODO: how does bessie authenticate?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within the deployment, it is just TLS so you can drop the TODO.

Address maintainer feedback on PR #1065:

1. Bound the gRPC sink's forwarder channel by retry_max_buffer_size.
   Previously the Tokio mpsc was unbounded, so events accumulated
   freely while drive_stream was sleeping between reconnects or
   replaying the retry buffer — defeating the very knob meant to cap
   memory. The forwarder now uses try_send and signals overflow via
   an atomic flag; drive_stream sees the closed receiver, checks the
   flag, and exits with BufferFull so error_strategy applies.

2. Wire the configured timeout. RetryConfig.timeout was previously
   stored but never read, so 'bes_timeout' was a no-op. Wrap the
   sink's work in tokio::time::timeout so finite deadlines actually
   bound lifecycle calls, stream retries, and the final upload —
   matching Bazel's --bes_timeout semantics.

3. Drop the TODO about bessie authentication on the workflows BES
   sink: within the deployment, transport security is plain TLS.
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

The gRPC sink uses tonic::Status::unavailable directly, but tonic
was only listed in Cargo.toml. Bazel's rust_library deps are the
authoritative source, so the bazel build of //:cli failed with
'use of unresolved module or unlinked crate `tonic`'.

Add @crates//:tonic alongside the other tokio-family deps.
@aspect-workflows
Copy link
Copy Markdown

aspect-workflows Bot commented May 7, 2026

Aspect CI Status · 2026-05-08T04:31:27Z

Task Job Result
🟡 running test test-task in progress
✅ passed test test-task 25/25 passed · 23 cached

@thesayyn
Copy link
Copy Markdown
Member Author

thesayyn commented May 7, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eecf20f93d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// `tokio::time::timeout` mirrors Bazel's `--bes_timeout` and gives
// the user-set knob a real effect.
match timeout {
Some(d) => match tokio::time::timeout(d, inner).await {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Start BES timeout after the build finishes

When timeout is set, this wraps the entire sink task from the moment the Bazel process is spawned, but work_inner spends most of a long build just waiting for BEP events from event_rx. Bazel documents --bes_timeout as the time to wait for BES/BEP upload after the build and tests have finished (https://bazel.build/versions/7.5.0/reference/command-line-reference), so a healthy build that runs longer than bes_timeout will terminate the sink early; with warn/ignore it silently drops the rest of the upload, and with abort/fail_at_end it can fail an otherwise successful build.

Useful? React with 👍 / 👎.

claude added 2 commits May 8, 2026 03:22
The bounded mpsc with try_send introduced in c7702e9 caused BES
handling hangs in CI. Revert to the prior unbounded forwarder so
the sink behaves the same as before that commit.

The timeout wrapper from work() is kept — it's a pure no-op when
'bes_timeout' defaults to "0s"/None, and it actually wires up the
knob when the user sets a finite deadline.

The memory bound concern raised on PR #1065 still stands as a
follow-up; bounding the forwarder safely needs more thought than
the simple try_send/overflow flag did.
After sending last_message and half-closing the request side,
drive_stream waited for the server to close the response stream
before returning Done. Some BES backends keep the response stream
open after the final ack, which made every successful upload hang
indefinitely at end-of-build — visible in CI as build/test/lint/
delivery jobs running until the 12-minute cancellation.

Exit Done as soon as last_message_sent is true and the retry
buffer drains. The same shortcut applies for the upstream-closed
path: if we already half-closed without a last_message and the
buffer drains, return UpstreamClosed without waiting for the
server.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants