Pedro.cordeiro/macos restart agent#52813
Conversation
|
All contributors have signed the CLA ✍️ ✅ |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 24eebaa728
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| func restart() error { | ||
| return errors.New("restarting the agent is not implemented on non-windows platforms") | ||
| return exec.Command("/bin/launchctl", "kickstart", "-k", "system/com.datadoghq.agent").Start() |
There was a problem hiding this comment.
Surface launchctl failures before reporting success
In the packaged macOS Agent this endpoint is served from the daemon launched as _dd-agent (packages/macos/app/launchd.plist.example.in:28-29), while launchctl(1) documents system/... targets as needing root privileges for modifications. Because Start() only proves /bin/launchctl was spawned, a normal install can have the child exit with EPERM (or another launchctl error) while the handler still returns Success, leaving the Agent running with unapplied config; please propagate the child exit status or route this through a privileged helper/keep the button disabled.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
I can only run on private repositories.
|
🎯 Code Coverage (details) 🔗 Commit SHA: be3f308 | Docs | Datadog PR Page | Give us feedback! |
Files inventory check summaryFile checks results against ancestor ffc84c45: Results for datadog-agent_7.82.0~devel.git.425.be3f308.pipeline.121381436-1_amd64.deb:No change detected |
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
21 successful checks with minimal change (< 2 KiB)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: ffc84c4 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | quality_gate_idle | memory utilization | +0.42 | [+0.37, +0.48] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_metrics_logs | memory utilization | +0.26 | [+0.01, +0.51] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_idle_all_features | memory utilization | -0.03 | [-0.07, +0.00] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_security_mean_fs_load | memory utilization | -0.04 | [-0.07, -0.00] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_security_idle | memory utilization | -0.21 | [-0.27, -0.15] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_security_no_fs_load | memory utilization | -0.33 | [-0.43, -0.23] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_logs | % cpu utilization | -2.16 | [-3.22, -1.10] | 1 | Logs bounds checks dashboard |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 147.83MiB ≤ 154MiB | bounds checks dashboard |
| ✅ | quality_gate_idle | total_bytes_received | 10/10 | 578.12KiB ≤ 819.20KiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 479.94MiB ≤ 495MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | total_bytes_received | 10/10 | 0.89MiB ≤ 1.25MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 183.31MiB ≤ 195MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_logs | total_bytes_received | 10/10 | 263.83MiB ≤ 292MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 353.33 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 399.29MiB ≤ 430MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | total_bytes_received | 10/10 | 0.86GiB ≤ 1.04GiB | bounds checks dashboard |
| ✅ | quality_gate_security_idle | cpu_usage | 10/10 | 27.57 ≤ 40 | bounds checks dashboard |
| ✅ | quality_gate_security_idle | memory_usage | 10/10 | 299.46MiB ≤ 330MiB | bounds checks dashboard |
| ✅ | quality_gate_security_mean_fs_load | cpu_usage | 10/10 | 60.68 ≤ 80 | bounds checks dashboard |
| ✅ | quality_gate_security_mean_fs_load | memory_usage | 10/10 | 273.00MiB ≤ 310MiB | bounds checks dashboard |
| ✅ | quality_gate_security_no_fs_load | cpu_usage | 10/10 | 23.47 ≤ 40 | bounds checks dashboard |
| ✅ | quality_gate_security_no_fs_load | memory_usage | 10/10 | 285.43MiB ≤ 320MiB | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
Replicate Execution Details
We run multiple replicates for each experiment/variant. However, we allow replicates to be automatically retried if there are any failures, up to 8 times, at which point the replicate is marked dead and we are unable to run analysis for the entire experiment. We call each of these attempts at running replicates a replicate execution. This section lists all replicate executions that failed due to the target crashing or being oom killed.
Note: In the below tables we bucket failures by experiment, variant, and failure type. For each of these buckets we list out the replicate indexes that failed with an annotation signifying how many times said replicate failed with the given failure mode. In the below example the baseline variant of the experiment named experiment_with_failures had two replicates that failed by oom kills. Replicate 0, which failed 8 executions, and replicate 1 which failed 6 executions, all with the same failure mode.
| Experiment | Variant | Replicates | Failure | Logs | Debug Dashboard |
|---|---|---|---|---|---|
| experiment_with_failures | baseline | 0 (x8) 1 (x6) | Oom killed | Debug Dashboard |
The debug dashboard links will take you to a debugging dashboard specifically designed to investigate replicate execution failures.
❌ Retried Profiling Replicate Execution Failures (ddprof)
Note: Profiling replicas may still be executing. See the debug dashboard for up to date status.
| Experiment | Variant | Replicates | Failure | Debug Dashboard |
|---|---|---|---|---|
| quality_gate_idle | baseline | 10 | Oom killed | Debug Dashboard |
| quality_gate_idle | comparison | 10 | Oom killed | Debug Dashboard |
| quality_gate_idle_all_features | baseline | 10 | Oom killed | Debug Dashboard |
| quality_gate_idle_all_features | comparison | 10 | Oom killed | Debug Dashboard |
| quality_gate_logs | baseline | 10 | Oom killed | Debug Dashboard |
| quality_gate_logs | comparison | 10 | Oom killed | Debug Dashboard |
| quality_gate_metrics_logs | baseline | 10 | Oom killed | Debug Dashboard |
| quality_gate_metrics_logs | comparison | 10 | Oom killed | Debug Dashboard |
| quality_gate_security_idle | baseline | 10 | Oom killed | Debug Dashboard |
| quality_gate_security_idle | comparison | 10 | Oom killed | Debug Dashboard |
| quality_gate_security_no_fs_load | baseline | 10 | Oom killed | Debug Dashboard |
| quality_gate_security_no_fs_load | comparison | 10 | Oom killed | Debug Dashboard |
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_security_idle, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_security_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_security_no_fs_load, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_security_no_fs_load, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_security_mean_fs_load, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_security_mean_fs_load, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b11bcff94a
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fc0a14c79d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
I have read the CLA Document and I hereby sign the CLA |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: be3f308578
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| sessionExpiration := deps.Config.GetDuration("GUI_session_expiration") | ||
| g.auth = newAuthenticator(authToken, sessionExpiration) | ||
| setGetAuthToken(deps.Ipc.GetAuthToken) | ||
| socketPath := deps.Config.GetString("system_probe_config.sysprobe_socket") |
There was a problem hiding this comment.
Use the system-probe config for the socket path
In macOS deployments where system-probe.yaml or DD_SYSPROBE_SOCKET overrides the system-probe socket, this reads the core Agent config instead of the system-probe config component. That setting is bound on the system-probe config side, so the GUI can fall back to the default path and restart() posts to the wrong Unix socket, making restart fail even though system-probe is running on the configured socket. Please source this from the sysprobe config component.
Useful? React with 👍 / 👎.
| w.WriteHeader(http.StatusOK) | ||
| if f, ok := w.(http.Flusher); ok { | ||
| f.Flush() |
There was a problem hiding this comment.
Report launchctl failures before returning success
When launchctl kickstart fails, for example because the service label is not bootstrapped, this sends and flushes a 200 before either restart is attempted; the later failures are only logged. The GUI treats the 200 response from /agent-restart as Success, so users can be told the restart worked even though neither service was restarted. Please perform the agent restart before the OK response, and only defer the self-terminating sysprobe restart, or otherwise surface failures before returning success.
Useful? React with 👍 / 👎.
What does this PR do?
Adds support for restarting the Datadog Agent from the macOS GUI by routing the restart request through
system-probe, which runs as root.POST /agent-restartendpoint to thesystem-probeHTTP API (cmd/system-probe/api/agentrestart_darwin.go) that callslaunchctl kickstart -kfor bothcom.datadoghq.agentandcom.datadoghq.sysprobe.comp/core/gui/impl/platform_darwin.go) POSTs to that endpoint over the system-probe Unix socket, authenticated with the IPC Bearer token.Only restart is implemented; stop is out of scope.
Motivation
The GUI process does not have root privileges and cannot call
launchctl kickstartagainst system-domain services directly. Routing throughsystem-probe(already running as root) avoids the need for privilege escalation viaosascript, which fails in daemon contexts without a GUI window server session.Jira: WINA-2692
Describe how you validated your changes
system-probewithdda inv system-probe.buildand the Agent withdda inv agent.build --build-exclude=systemd, restarted both services manually; bothcom.datadoghq.agentandcom.datadoghq.sysproberestarted successfully via the GUI button.cmd/system-probe/apiandcomp/core/gui/impland all pass (dda inv test).Additional Notes
POST /agent-restartendpoint returns 501 on non-macOS platforms via a build-tag-guarded stub (agentrestart_others.go).platform_nix.go,platform_windows.go) include no-op stubs forsetRestartAuthTokenandsetSysprobeSocketPathsogui.gocompiles on all platforms.