Skip to content

Pawel plesniak emuhammad/split shell fixing#888

Open
PawelPlesniak wants to merge 13 commits into
developfrom
PawelPlesniakEmuhammad/SplitShellFixing
Open

Pawel plesniak emuhammad/split shell fixing#888
PawelPlesniak wants to merge 13 commits into
developfrom
PawelPlesniakEmuhammad/SplitShellFixing

Conversation

@PawelPlesniak

@PawelPlesniak PawelPlesniak commented Apr 24, 2026

Copy link
Copy Markdown
Collaborator

Description

Fixes #905

This is an intermediary branch for #830, it does not address all the issues but contains a separate branch to allow for other PRs to go in without disrupting develop too much.

This PR has had other PRs merged into it that have addressed the following issues
#709
#910
#904
#586
#326
#937

Type of change

  • New feature / enhancement
  • Optimization
  • Bug fix
  • Breaking change
  • Documentation

List of required branches from other repositories

N/A

Change log

Several changes have been included, and this PR introduces progress on the various deployment mechanisms on the run control, focusing on the following

  • independent operations of the process-manager
  • operating a process-manager through a process-manager-shell
  • operating a unified-shell connected to the standalone process-manager

Note - this development is fully focused on the ssh implementation of the process manager. This implementation has not yet been tested on the k8s process manager, for which a separate issue has been listed.

Suggested manual testing checklist

One can now deploy a standalone process manager and run sessions from it, but note that for this case, the user who owns (in Linux terms, the user who started) the drunc-process-manager process must have rwx access to the DBT_AREA_ROOT from which the process manager is client is being run. Also note, if this is not available, drunc will report this and block the boot.

To start a standalone process manager, use

drunc-process-manager 50000

Note - the 50000 is the port number through which the process manager will communicate. This does not have to be

You can connect to it using a process manager shell (in a separate terminal!) as

drunc-process-manager-shell grpc://<HOSTNAME>:50000

Note - <HOSTNAME> refers to the host on which the process manager is running, and can be marked as localhost. One can then start a session as e.g. (from within the drunc-process-manager-shell)

boot config/daqsystemtest/example-configs.data.xml local-1x1-config $USER-session

From here you will be able to execute the session stateful commands by connecting to the segment's top controller to take a run.

One can also run a session through the unified shell (in a separate terminal!) using this process manager as

drunc-unified-shell grpc://<HOSTNAME>:50000 config/daqsystemtest/example-configs.data.xml local-1x1-config $USER-session

This can also be done cross-host, e.g. if the process manager is running on np04-srv-019 and a *local* session is ran from np04-srv-028, the session applications will start on 028.

Developer checklist

Prior to marking this as "Ready for Review"

Tests ran on: np04-srv-028 from release NFD_DEV_260629_A9

Unit tests - some tests can't be ran on the CI. This is documented. If this PR checks a feature that can't be tested with CI, this has been marked appropriately.

Integration tests - the daqsystemtest_integtest_bundle requires a lot of resources, and connections to the EHN1 infrastructure. Check the cross referenced list if you can't run these. The developer needs to run at least the .

  • Unit tests (pytest --marker) passed
    • With relevant marker
    • Without marker
  • Integration tests passed
    • Only daqsystemtest_integtest_bundle.sh -k minimal_system_quick_test.py
    • Full daqsystemtest_integtest_bundle.sh
  • Testing skipped as there are no core code changes in this PR, this only relates to documentation/CI workflows

Final checklist prior to marking this as "Ready for Review"

Reviewer checklist

  • This branch has been rebased with develop prior to testing.
  • Suggested manual tests show changes.
  • CI workflows fails documented (if present)
  • Integration tests passed
    • Only concern yourself if failures related to drunc are in the log files
    • If non-drunc failure appears:
      • Validate failure in fresh working area
      • Contact Pawel if unsure

Once the features are validated and both the unit and integration tests pass, the PRs is ready to be merged.

Prior to merging

Choose one of the following an complete all substeps
  • Changes only affect the Run Control, are in a single repository, and do not affect the end user.
    • Changes are documented in docstrings and code comments
    • Wiki has been updated if architectural or endpoint changes
  • Otherwise
    • Workflow changes demonstrated in the Change Log (if necessary)
    • Wiki has been updated (if necessary)
    • #daq-sw-librarians Slack channel notified (see below)

Once completed, the reviewer can merge the PR.

Notification message for a Slack channel

Note - this should be to #dunedaq-integration for general workflow that isn't during a release candidate period, and to #daq-release-prep otherwise.

For an single merge that changes the user workflow

The CCM WG has an isolated PR ready to merge that affects user workflows. The PR is:

_URL_

I will leave time for any comments, otherwise will merge these at the end of the work day _Insert your time zone_.

For co-ordinated merge

The CCM WG has a set of co-ordinated merges ready to merge. The PRs are:

_URL_

_URL_


I will leave time for any comments, otherwise will merge these at the end of the day.

@PawelPlesniak

Copy link
Copy Markdown
Collaborator Author

The current implementation uses a superuser (for development with me and @emmuhamm, I was the superuser) to be the owner of all the proceses, similar to the way in which Docker works.

Some other development that needs to happen before cleaning up, review, integ tests, etc.

  • Unified shell should see Some remote PM logs e.g. terminate
  • localhost cannot be used in the following command as it leads to errors, figure out why
drunc-unified-shell grpc://localhost:50000 config/daqsystemtest/example-configs.data.xml ehn1-local-1x1-config pawel
  • Run the same tests with a k8s process manager.
  • Reintroduce the session name checking, it currently uses PM code which blocks standalone PM use.

@emmuhamm

emmuhamm commented May 1, 2026

Copy link
Copy Markdown
Contributor

Reference notes
K8s name scoping fails when passing address
If you use localhost, doesn't work

Env var
Drunc.json
K8s secrets
Auto generate process manager port number for sessions if not present?
Unified shell should see Some remote PM logs e.g. terminate
Permissions for cross server issues

Emir of 028, PM on 029, permissions failures
Cross server issues, Auth of "localhost" when on different NFS backed servers
Process ownership is the root problem - attempting to read user's SSH config as a superuser is not helpful.

K8s did not help - root controller and local conn srv failed.
Got to a point where we were trying to read Emir's SSH config (current develop attempted to read mine), permission denied.
Attempted running PM as root failed on 019

Changes made

Removed k8s name validation - blocks external PM code
Fixing lifetime manager drunc.drunc. -> drunc.

Comment thread src/drunc/unified_shell/commands.py
@emmuhamm

Copy link
Copy Markdown
Contributor

Note that as of 66c8707, there is a conflict that needs to be resolved. I've rebased (not merged) this branch on top of develop since I feel like im always missing something when merging develop into this feature branch, resulting in an invalid set of changes.

However, for backup, I have saved
https://github.com/DUNE-DAQ/drunc/tree/PawelPlesniakEmuhammad/SplitShellFixing-BACKUP which should point to 66c8707, keeping the state of that branch before the rebase.

Soon after this message is posted, I'll force push the changes after the rebase, which should work.

note that we should delete the backup branch when this PR gets merged

@emmuhamm emmuhamm force-pushed the PawelPlesniakEmuhammad/SplitShellFixing branch from 66c8707 to 0114aa4 Compare May 28, 2026 14:27
Comment thread src/drunc/processes/ssh_process_lifetime_manager_shell.py Outdated
@emmuhamm emmuhamm force-pushed the PawelPlesniakEmuhammad/SplitShellFixing branch from 0114aa4 to a4d174f Compare June 3, 2026 14:33
@emmuhamm

emmuhamm commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Going to need to do some git surgery. Ive copied the branch on PawelPlesniakEmuhammad/SplitShellFixing-after-941 just in case something got messed up


update 1
Just finished cleaning up the commits so i have squashed the commits and merged them from the respective pull requests. so it should look like 2 merges have happened, each with a single commit that represents the changes from their respective prs.

local testing works which is as expected. tomorrow will be cleaning up the first 5 commits to grab the 'essentials' and the 'testing' commits to make it obvious. when that is done, it will finally be rebased on top of develop


Update 2

Just finished cleaning up the first 5 commits, now theyre clearer in what they do. This should help in rebasing with develop


Update 3

this has now been rebased successfully onto the nightly of 22 june.

@emmuhamm emmuhamm force-pushed the PawelPlesniakEmuhammad/SplitShellFixing branch 2 times, most recently from 2ba33eb to 6039f4a Compare June 23, 2026 10:05
@emmuhamm emmuhamm force-pushed the PawelPlesniakEmuhammad/SplitShellFixing branch from 6039f4a to f042c4e Compare June 23, 2026 11:00
@emmuhamm

Copy link
Copy Markdown
Contributor

cc @PawelPlesniak

PMaaS

#888
rebased on last weeks nightly
minor conflict, probably the changes regarding the logging, should be uber simple to fix
Pytest needs to be fixed
Not sure why the pytest fails, but im quite certain that it was green when merging #912. I expect that the failure would have come from 941 somewhere, but havent been debugged

Will need to investigate everything that has been commented out to check if it can be reintroduced

@PawelPlesniak

PawelPlesniak commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator Author

TODOs before merging:

  • Fix pytest
  • Fix k8s name checking at start of PM
  • Run integration tests

@PawelPlesniak

Copy link
Copy Markdown
Collaborator Author

Integtests running in a tmux session on np04-srv-028

@PawelPlesniak

Copy link
Copy Markdown
Collaborator Author

Integration tests passed on np04-srv-028 as

+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++ SUMMARY ++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++

Tue 30 Jun 17:45:35 CEST 2026
Log file is: /tmp/pytest-of-pplesnia/dunedaq_integtest_bundle_20260630170258.log

⮕ Running daqsystemtest/3ru_1df_multirun_test.py ⬅
======================== 6 passed ✅ in 246.62s (0:04:06) =========================
⮕ Running daqsystemtest/3ru_3df_multirun_test.py ⬅
======================== 6 passed ✅ in 251.46s (0:04:11) =========================
⮕ Running daqsystemtest/disabled_tpg_test.py ⬅
========================= 3 passed ✅ in 60.34s (0:01:00) =========================
⮕ Running daqsystemtest/example_system_test.py ⬅
======================== 12 passed ✅ in 233.66s (0:03:53) ========================
⮕ Running daqsystemtest/fake_data_producer_test.py ⬅
======================== 6 passed ✅ in 239.75s (0:03:59) =========================
⮕ Running daqsystemtest/long_window_readout_test.py ⬅
============================== 1 skipped 🟡 in 0.87s ==============================
⮕ Running daqsystemtest/minimal_system_quick_test.py ⬅
============================== 4 passed ✅ in 53.31s ==============================
⮕ Running daqsystemtest/readout_type_scan_test.py ⬅
======================== 33 passed ✅ in 601.02s (0:10:01) ========================
⮕ Running daqsystemtest/sample_ehn1_multihost_test.py ⬅
======================== 4 skipped 🟡 in 79.33s (0:01:19) =========================
⮕ Running daqsystemtest/small_footprint_quick_test.py ⬅
============================== 3 passed ✅ in 55.37s ==============================
⮕ Running daqsystemtest/tpg_state_collection_test.py ⬅
======================== 5 passed ✅ in 107.12s (0:01:47) =========================
⮕ Running daqsystemtest/tpreplay_test.py ⬅
=================== 1 failed ❌, 5 passed ✅ in 196.77s (0:03:16) ====================
⮕ Running daqsystemtest/tpstream_writing_test.py ⬅
======================== 4 passed ✅ in 107.36s (0:01:47) =========================
⮕ Running daqsystemtest/trigger_bitwords_test.py ⬅
======================== 18 passed ✅ in 312.10s (0:05:12) ========================

Note the failure was associated with a stateful command timing out, and is not associated with this PR. Re-running this specific test to confirm.

@PawelPlesniak

Copy link
Copy Markdown
Collaborator Author

The tp-replay tests were re-ran on the same host with the same nightly release as dunedaq_integtest_bundle.sh -k tpreplay showing

======================== 6 passed in 127.11s (0:02:07) =========================

+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++ SUMMARY ++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++

Wed Jul  1 02:13:03 PM CEST 2026
Log file is: /tmp/pytest-of-pplesnia/dunedaq_integtest_bundle_20260701141053.log

⮕ Running daqsystemtest/tpreplay_test.py ⬅
======================== 6 passed ✅ in 127.11s (0:02:07) =========================

This is now ready for another review

@PawelPlesniak PawelPlesniak marked this pull request as ready for review July 1, 2026 12:15
@PawelPlesniak PawelPlesniak requested a review from emmuhamm July 1, 2026 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Introduce superuser prototype for multi user running

3 participants