SVLS-8857: replace serverless metric agent with base metric agent#50232
SVLS-8857: replace serverless metric agent with base metric agent#50232apiarian-datadog wants to merge 8 commits into
Conversation
|
This PR can be reviewed commit by commit. It's a non-trivial refactor as a whole. |
Go Package Import DifferencesBaseline: 205dba5
|
|
Files inventory check summaryFile checks results against ancestor 205dba56: Results for datadog-agent_7.81.0~devel.git.348.f0e625f.pipeline.116284349-1_amd64.deb:No change detected |
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
9 successful checks with minimal change (< 2 KiB)
|
2ef722b to
f7f03a1
Compare
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 205dba5 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +1.97 | [-1.01, +4.95] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +1.97 | [-1.01, +4.95] | 1 | Logs |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | +0.62 | [+0.41, +0.82] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | +0.61 | [+0.41, +0.82] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | +0.34 | [+0.10, +0.59] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics_sum_delta | memory utilization | +0.26 | [+0.07, +0.45] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | +0.25 | [+0.19, +0.30] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | +0.20 | [+0.15, +0.24] | 1 | Logs bounds checks dashboard |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | +0.05 | [-0.35, +0.45] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.02 | [-0.18, +0.23] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | +0.01 | [-0.03, +0.04] | 1 | Logs bounds checks dashboard |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | +0.00 | [-0.14, +0.14] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.00 | [-0.10, +0.10] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | -0.00 | [-0.21, +0.20] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | -0.04 | [-0.54, +0.46] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.06 | [-0.51, +0.39] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | -0.10 | [-0.26, +0.05] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | -0.11 | [-0.27, +0.05] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.11 | [-0.16, -0.06] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | -0.14 | [-0.24, -0.04] | 1 | Logs |
| ➖ | file_tree | memory utilization | -0.15 | [-0.20, -0.10] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | -0.23 | [-0.47, +0.00] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | -0.26 | [-0.36, -0.16] | 1 | Logs |
| ➖ | quality_gate_logs | % cpu utilization | -0.53 | [-1.56, +0.49] | 1 | Logs bounds checks dashboard |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 697 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 245.60MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 730 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.16GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.20GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.17GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 146.11MiB ≤ 147MiB | bounds checks dashboard |
| ✅ | quality_gate_idle | total_bytes_received | 10/10 | 743.14KiB ≤ 819.20KiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 484.86MiB ≤ 495MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | total_bytes_received | 10/10 | 1.13MiB ≤ 1.25MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 174.07MiB ≤ 195MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_logs | total_bytes_received | 10/10 | 264.29MiB ≤ 292MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 347.65 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 374.71MiB ≤ 430MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | total_bytes_received | 10/10 | 0.94GiB ≤ 1.04GiB | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
f7f03a1 to
8ba3f54
Compare
95d2cbf to
e10234b
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e10234b431
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
e10234b to
25b8435
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 25b8435fbe
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f7ac73cf5c
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
f7ac73c to
200ca54
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 200ca54c1f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
200ca54 to
1240ba6
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1240ba6b81
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
1240ba6 to
242cc30
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 242cc305e0
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4bbbe356a8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| eventplatformfx.Module(eventplatform.NewDisabledParams()), | ||
| eventplatformreceiverimpl.Module(), | ||
| haagentfx.Module(), | ||
| forwarder.Bundle(defaultforwarder.NewParams()), |
There was a problem hiding this comment.
Disable forwarder health checks when the API key is absent
In the supported DD_API_KEY-unset mode, this new unconditional forwarder bundle is still constructed and started before setup() returns the no-op trace/metrics path. defaultforwarder.Start() starts its API-key health checker unless WithDisableAPIKeyChecking is used, and utils.GetMultipleEndpoints still builds a resolver containing the empty top-level api_key, so serverless-init now makes validation attempts and logs “No valid api key” errors even though the old flow returned before creating the metric forwarder. This regresses the process/logs-only scenario that the no-key branch is meant to keep quiet; gate the forwarder too or disable API-key checking when the early sanitized key is empty.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4bbbe356a8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| eventplatformfx.Module(eventplatform.NewDisabledParams()), | ||
| eventplatformreceiverimpl.Module(), | ||
| haagentfx.Module(), | ||
| forwarder.Bundle(defaultforwarder.NewParams()), |
There was a problem hiding this comment.
Disable forwarder health checks when the API key is absent
In the supported DD_API_KEY-unset mode, this new unconditional forwarder bundle is still constructed and started before setup() returns the no-op trace/metrics path. defaultforwarder.Start() starts its API-key health checker unless WithDisableAPIKeyChecking is used, and utils.GetMultipleEndpoints still builds a resolver containing the empty top-level api_key, so serverless-init now makes validation attempts and logs “No valid api key” errors even though the old flow returned before creating the metric forwarder. This regresses the process/logs-only scenario that the no-key branch is meant to keep quiet; gate the forwarder too or disable API-key checking when the early sanitized key is empty.
Useful? React with 👍 / 👎.
vickenty
left a comment
There was a problem hiding this comment.
Would it be possible to split the forwarder changes into a separate PR?
| // guards against a racing Stop closing the worker loop between our snapshot | ||
| // and the unbuffered send: if the server is stopping, workers exit on | ||
| // <-stopChan and would never receive on flushChan, so we abandon the send | ||
| // and break out of the loop early. |
There was a problem hiding this comment.
Is there a reason why worker can't flush itself in the stopChan branch when on serverless?
There was a problem hiding this comment.
because we need to not only flush the worker but also ensure that those flushed samples actually get through aggregation and out of the serializer. so we need to trigger the flush on the worker and then force the flush on the serializer.
also wouldn't we end up changing the stop semantics for every dogstatsd user if we did this? wanted to limit our changes to serverless environments.
There was a problem hiding this comment.
because we need to not only flush the worker but also ensure that those flushed samples actually get through aggregation and out of the serializer. so we need to trigger the flush on the worker and then force the flush on the serializer.
The rest of the components will follow the same pattern: if needed, make sure metrics make out on stop. Aggregator I believe already has an option to do this, and you are adding it to the forwarder. Fx will ensure for us that we're flushing in the correct order as well.
also wouldn't we end up changing the stop semantics for every dogstatsd user if we did this? wanted to limit our changes to serverless environments.
It would need to be behind a flag, similar to how you did this in the forwarder.
There was a problem hiding this comment.
sounds good, i'll try to refactor it this way
There was a problem hiding this comment.
ok, done. lots of code changed, and arranged in a way that should let me split out the main metrics agent changes into its own pr.
There was a problem hiding this comment.
Per our discussion on #51565, i'll hold off on further work on this PR until your upstream refactoring is done.
4bbbe35 to
81a94bc
Compare
Add a forwarder_stop_wait_for_inflight option so Worker.Stop can wait for in-flight HTTP transactions to finish before returning instead of cancelling them. serverless-init opts in so the final flush is never aborted by Stop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Give each worker its own flushChan and add a dogstatsd_flush_on_stop option so the server drains every worker's batched samples into the time sampler during stop(), before the worker loops are torn down. This replaces the explicit ServerlessFlush trigger with a stop-driven drain (the old path is removed in a later commit). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add an aggregator_drain_samples_on_stop path so AgentDemultiplexer.Stop drains pending metric samples before the final flush, giving serverless a flush mechanism on the standard demultiplexer. The legacy ServerlessDemultiplexer is removed in a later commit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the boolean settings that let serverless-init opt the aggregator, forwarder, and dogstatsd server into draining/flushing pending data on Stop instead of dropping it: aggregator_drain_samples_on_stop, forwarder_stop_wait_for_inflight, and dogstatsd_flush_on_stop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reduce ServerlessMetricAgent to a thin wrapper over the demultiplexer and tags, switching it onto the stop-driven flush path, and add a metricstest fx helper plus integration coverage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire serverless-init to the Fx-managed forwarder, demultiplexer, and dogstatsd server, and rework the shutdown sequence to drive a proper flush through them on Stop. Tolerate a nil metric agent during cloudservice shutdown and parameterize the log flush timeout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Now that serverless-init drives flushing through the dogstatsd server's stop-driven drain, the demultiplexer's drain-on-stop path, and the forwarder's wait-for-inflight option, delete the old explicit-flush path: - dogstatsd Component.ServerlessFlush and its serverlessFlushChan - the standalone NewServerlessServer (server/impl/serverless.go) - aggregator's ServerlessDemultiplexer (demultiplexer_serverless.go) - forwarder SyncForwarder (sync_forwarder.go) - pkg/serverless legacy plumbing (serverless.go) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
81a94bc to
f0e625f
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f0e625fab1
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if s.flushOnStop { | ||
| s.log.Debug("dogstatsd_flush_on_stop: draining worker batchers before stop") | ||
| for _, w := range s.workers { | ||
| w.flushChan <- struct{}{} |
There was a problem hiding this comment.
Block until worker batcher flushes finish
When dogstatsd_flush_on_stop is enabled for serverless shutdown, this unbuffered send only waits until the worker receives the signal, not until batcher.flush() has actually enqueued its samples into the demux. Fx then proceeds to AgentDemultiplexer.Stop, whose new drain only looks at already-enqueued sampler channels; if it observes them before the worker finishes flushing, the final serializer flush can run and miss the worker's last buffered DogStatsD metrics/events/service checks. This affects shutdown paths where workers still have batched data, so the stop path needs an acknowledgement after batcher.flush() completes before demux shutdown starts.
Useful? React with 👍 / 👎.
|
PR to drain aggregator inflight samples: #51866 Also, since aggregator feature is already called |
|
closing in favor of #52267 |
What does this PR do?
Replace the bespoke serverless metric agent lifecycle with the base metric agent. The serverless metric agent was necessary in the days when we supported aws lambda. Now that we only need to work with containerized deploys, we can switch to using the base metric agent instead.
Motivation
Want to simplify serverless-init logic and make sure that we handle shutdown correctly in environments like cloud run jobs where shutdowns happen more often.
Metrics Agent Changes
The metrics agent receives metrics packets at the listener, which are then passed to the
packetsInchannel. The worker parses the packets and turns them into MetricSamples for the batcher. The batcher flushes these samples and sends them to the demultiplexer. The demultiplexer puts these samples into a channel for a separate goroutine to bucket up (these are the samples that WaitForPendingSamples waits on). On flush the aggregated metrics are turned into payloads, then serialized, and passed to the forwarder which packs them up into a transaction for delivery. The forwarder worker then posts those samples out to datadog.Along the way, we discovered that the metrics agent didn't quite meet the serverless-init needs. Serverless-init works in ephemeral environments. It doesn't need to worry about freezes like lambda, but once the container stops, it is gone. So we need to make sure that any remaining metrics are definitely flushed when we stop the metrics agent. We cannot rely on a filesystem buffer for resending metrics on restart. This happens after every single task run in cloud run jobs, as well as during routine app updates, refreshes, and cloud-managed restarts for the more request/response-focused workflows.
The demultiplexer couldn't synchronously drain samples before flushing. The addition of WaitForPendingSamples adds this support.
The old ServerlessFlush only flushed a single worker. Now we flush all of them.
The standard async forwarder cancelled the worker context on stop, so any payloads that are in-flight also get cancelled. We've added a
forwarder_stop_wait_for_inflightsetting which lets those requests finish. We added the setting because it introduces a behavioral change to the forwarder, making it not stop immediately on Stop. This may be something that we might support in environments where the metrics agent is going to be stopping and not coming back.Describe how you validated your changes
Unit testing. Running End to End tests and validating with self-monitoring and custom test apps.
Tested in an actual cloud run job with 1k tasks. The tasks submit steady streams of metrics, and then a large burst at the end. This uncovered some issues with the stopping (in particular, the forwarder in-flight request finishing problem).
There's still a small number of metrics dropped (less than 1% of the 500 metrics submitted at the very end) but this is confirmed to be a kernel-level UDP packet drop, and 500 metrics all at once is a pretty strange thing to do at shutdown. Dropping this final burst to 100 eliminates this UDP packet drop.