Skip to content

fix: include port id in materialization reader actor id to fix self-join deadlock#4985

Open
Ma77Ball wants to merge 6 commits intoapache:mainfrom
Ma77Ball:fix/differenceOperator
Open

fix: include port id in materialization reader actor id to fix self-join deadlock#4985
Ma77Ball wants to merge 6 commits intoapache:mainfrom
Ma77Ball:fix/differenceOperator

Conversation

@Ma77Ball
Copy link
Copy Markdown
Contributor

@Ma77Ball Ma77Ball commented May 8, 2026

What changes were proposed in this PR?

Fixes the Difference operator hanging when one upstream operator (e.g., a single CSV) is wired to both of its input ports.

The bug

  • Scheduler materializes the upstream to break the self-join cycle.
  • Each downstream worker spawns one InputPortMaterializationReaderThread per upstream URI.
  • Each thread's virtual "from" actor ID was built from (uri, workerActorId) only.
  • With a self-join, both threads share the same URI and worker → same ChannelIdentity → FIFO sequence numbers and end-of-channel markers cross-routed → one port never drains → Difference hangs.

The fix

  • Mix the destination PortIdentity into the actor name: MATERIALIZATION_READER_<uri>_port<n>[i]_<workerActorId>.
  • Thread toPortId through the three callers of getFromActorIdForInputPortStorage:
    • ResourceAllocatorglobalPortId.portId.
    • AssignPortHandlermsg.portId.
    • InputManagerInputPortMaterializationReaderThread (new ctor field)

Any related issues, documentation, or discussions?

Closes: #2588

How was this PR tested?

  • VirtualIdentityUtilsSpec — checks the new ID format and asserts distinct IDs for the same (uri, worker) but different port IDs.
  • ExpansionGreedyScheduleGeneratorSpec — builds csv → difference with both inputs from the same csv; asserts levelSets.size > 1, proving materialization happens and the schedule isn't a deadlocked single-level region.
  • Manual: ran a workflow with one CSV connected to both Difference ports — previously hung, now completes.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with: Claude Opus 4.7 in compliance with ASF

@Ma77Ball Ma77Ball changed the title Fix/difference operator fix: include port id in materialization reader actor id to fix self-join deadlock May 8, 2026
@Ma77Ball
Copy link
Copy Markdown
Contributor Author

Ma77Ball commented May 8, 2026

@Yicong-Huang please review

@Ma77Ball Ma77Ball changed the title fix: include port id in materialization reader actor id to fix self-join deadlock fix: include port id in materialization reader actor id to fix self-join deadlock May 8, 2026
@Yicong-Huang
Copy link
Copy Markdown
Contributor

Yicong-Huang commented May 8, 2026

Mix the destination PortIdentity into the actor name: MATERIALIZATION_READER_port[i].

I don't think fixing the port id into the actor's name is the right way to go. port id shouldn't be part of actor name and Actor name should be unique. If we have an operator with 5 input ports, this fix would name the actor to 5 different names.

The bug should be on the other lower layer: an actor could have multiple channels, and channels are connected between different ports. I think we did not differentiate channels well, and the fix is deeper than just renames, the gateway logic might also require fix. To fix it you need a deeper understanding of amber.

@chenlica chenlica requested a review from Xiao-zhen-Liu May 9, 2026 06:17
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 9, 2026

Codecov Report

❌ Patch coverage is 71.42857% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.72%. Comparing base (b4f3a41) to head (d9830c6).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
...pache/texera/amber/util/VirtualIdentityUtils.scala 75.00% 0 Missing and 1 partial ⚠️
...a/amber/operator/difference/DifferenceOpDesc.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##               main    #4985   +/-   ##
=========================================
  Coverage     42.72%   42.72%           
+ Complexity     2185     2184    -1     
=========================================
  Files          1031     1031           
  Lines         38152    38156    +4     
  Branches       4004     4005    +1     
=========================================
+ Hits          16302    16304    +2     
  Misses        20831    20831           
- Partials       1019     1021    +2     
Flag Coverage Δ *Carryforward flag
access-control-service 39.53% <ø> (ø)
agent-service 33.72% <ø> (ø) Carriedforward from b4f3a41
amber 43.22% <71.42%> (+<0.01%) ⬆️
computing-unit-managing-service 0.00% <ø> (ø)
config-service 0.00% <ø> (ø)
file-service 32.18% <ø> (ø)
frontend 33.08% <ø> (ø) Carriedforward from b4f3a41
python 88.90% <ø> (ø) Carriedforward from b4f3a41
workflow-compiling-service 47.72% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

self-diff cannot finish execution

3 participants