SVLS-8857: replace serverless metric agent with base metric agent by apiarian-datadog · Pull Request #50232 · DataDog/datadog-agent

apiarian-datadog · 2026-05-01T14:49:34Z

What does this PR do?

Replace the bespoke serverless metric agent lifecycle with the base metric agent. The serverless metric agent was necessary in the days when we supported aws lambda. Now that we only need to work with containerized deploys, we can switch to using the base metric agent instead.

Motivation

Want to simplify serverless-init logic and make sure that we handle shutdown correctly in environments like cloud run jobs where shutdowns happen more often.

Metrics Agent Changes

The metrics agent receives metrics packets at the listener, which are then passed to the packetsIn channel. The worker parses the packets and turns them into MetricSamples for the batcher. The batcher flushes these samples and sends them to the demultiplexer. The demultiplexer puts these samples into a channel for a separate goroutine to bucket up (these are the samples that WaitForPendingSamples waits on). On flush the aggregated metrics are turned into payloads, then serialized, and passed to the forwarder which packs them up into a transaction for delivery. The forwarder worker then posts those samples out to datadog.

Along the way, we discovered that the metrics agent didn't quite meet the serverless-init needs. Serverless-init works in ephemeral environments. It doesn't need to worry about freezes like lambda, but once the container stops, it is gone. So we need to make sure that any remaining metrics are definitely flushed when we stop the metrics agent. We cannot rely on a filesystem buffer for resending metrics on restart. This happens after every single task run in cloud run jobs, as well as during routine app updates, refreshes, and cloud-managed restarts for the more request/response-focused workflows.

The demultiplexer couldn't synchronously drain samples before flushing. The addition of WaitForPendingSamples adds this support.

The old ServerlessFlush only flushed a single worker. Now we flush all of them.

The standard async forwarder cancelled the worker context on stop, so any payloads that are in-flight also get cancelled. We've added a forwarder_stop_wait_for_inflight setting which lets those requests finish. We added the setting because it introduces a behavioral change to the forwarder, making it not stop immediately on Stop. This may be something that we might support in environments where the metrics agent is going to be stopping and not coming back.

Describe how you validated your changes

Unit testing. Running End to End tests and validating with self-monitoring and custom test apps.

Tested in an actual cloud run job with 1k tasks. The tasks submit steady streams of metrics, and then a large burst at the end. This uncovered some issues with the stopping (in particular, the forwarder in-flight request finishing problem).

There's still a small number of metrics dropped (less than 1% of the 500 metrics submitted at the very end) but this is confirmed to be a kernel-level UDP packet drop, and 500 metrics all at once is a pretty strange thing to do at shutdown. Dropping this final burst to 100 eliminates this UDP packet drop.

apiarian-datadog · 2026-05-01T14:50:08Z

This PR can be reviewed commit by commit. It's a non-trivial refactor as a whole.

dd-octo-sts · 2026-05-01T14:56:44Z

Go Package Import Differences

Baseline: 205dba5
Comparison: f0e625f

binary	os	arch	change
agent	linux	amd64	+0, -1 -github.com/DataDog/datadog-agent/comp/dogstatsd/replay/impl-noop
agent	linux	arm64	+0, -1 -github.com/DataDog/datadog-agent/comp/dogstatsd/replay/impl-noop
agent	windows	amd64	+0, -1 -github.com/DataDog/datadog-agent/comp/dogstatsd/replay/impl-noop
agent	darwin	amd64	+0, -1 -github.com/DataDog/datadog-agent/comp/dogstatsd/replay/impl-noop
agent	darwin	arm64	+0, -1 -github.com/DataDog/datadog-agent/comp/dogstatsd/replay/impl-noop
agent	aix	ppc64	+0, -1 -github.com/DataDog/datadog-agent/comp/dogstatsd/replay/impl-noop
iot-agent	linux	amd64	+0, -2 -github.com/DataDog/datadog-agent/comp/core/telemetry/impl/noops -github.com/DataDog/datadog-agent/comp/dogstatsd/replay/impl-noop
iot-agent	linux	arm64	+0, -2 -github.com/DataDog/datadog-agent/comp/core/telemetry/impl/noops -github.com/DataDog/datadog-agent/comp/dogstatsd/replay/impl-noop
heroku-agent	linux	amd64	+0, -1 -github.com/DataDog/datadog-agent/comp/dogstatsd/replay/impl-noop
dogstatsd	linux	amd64	+0, -2 -github.com/DataDog/datadog-agent/comp/core/telemetry/impl/noops -github.com/DataDog/datadog-agent/comp/dogstatsd/replay/impl-noop
dogstatsd	linux	arm64	+0, -2 -github.com/DataDog/datadog-agent/comp/core/telemetry/impl/noops -github.com/DataDog/datadog-agent/comp/dogstatsd/replay/impl-noop

datadog-prod-us1-5 · 2026-05-01T15:09:03Z

✨ Fix all issues with BitsAI

⚠️ Warnings

🚦 2 Pipeline jobs failed

DataDog/datadog-agent | new-e2e-ha-agent-failover

🔄 Retry job. This looks flaky and may succeed on retry.
Error: Could not fetch metadata payload: Get "https://localhost:5001/agent/metadata/ha-agent": dial tcp 127.0.0.1:5001: connect: connection refused; Error: could not reach agent: Get "https://localhost:5001/agent/status/section/collector?format=text": dial tcp 127.0.0.1:5001: connect: connection refused

Label analysis | release-note-check

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration.
No releasenote found for this PR. Please add one using 'reno' or apply the label 'changelog/no-changelog'.

ℹ️ Info

🎯 Code Coverage (details)
• Patch Coverage: 68.87%
• Overall Coverage: 50.47% (-0.01%)

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: f0e625f | Docs | Datadog PR Page | Give us feedback!}

dd-octo-sts · 2026-05-01T15:17:45Z

Files inventory check summary

File checks results against ancestor 205dba56:

Results for datadog-agent_7.81.0~devel.git.348.f0e625f.pipeline.116284349-1_amd64.deb:

No change detected

dd-octo-sts · 2026-05-01T15:29:47Z

Static quality checks

✅ Please find below the results from static quality gates
Comparison made with ancestor 205dba5
📊 Static Quality Gates Dashboard
🔗 SQG Job

Successful checks

Info

	Quality gate	Change	Size (prev → curr → max)
✅	agent_deb_amd64	+8.54 KiB (0.00% increase, -0.19% of buffer)	746.307 → 746.315 → 750.800
✅	agent_deb_amd64_fips	+8.54 KiB (0.00% increase, -3.62% of buffer)	704.070 → 704.078 → 704.300
✅	agent_msi	+10.0 KiB (0.00% increase, -0.07% of buffer)	610.254 → 610.263 → 624.040
✅	agent_rpm_amd64	+8.54 KiB (0.00% increase, -0.19% of buffer)	746.291 → 746.299 → 750.770
✅	agent_rpm_amd64_fips	+8.54 KiB (0.00% increase, -3.38% of buffer)	704.053 → 704.062 → 704.300
✅	agent_rpm_arm64	+4.07 KiB (0.00% increase, -0.60% of buffer)	723.837 → 723.841 → 724.500
✅	agent_rpm_arm64_fips	+8.07 KiB (0.00% increase, -7.51% of buffer)	684.765 → 684.773 → 684.870
✅	agent_suse_amd64	+8.54 KiB (0.00% increase, -0.19% of buffer)	746.291 → 746.299 → 750.770
✅	agent_suse_amd64_fips	+8.54 KiB (0.00% increase, -3.38% of buffer)	704.053 → 704.062 → 704.300
✅	agent_suse_arm64	+4.07 KiB (0.00% increase, -0.60% of buffer)	723.837 → 723.841 → 724.500
✅	agent_suse_arm64_fips	+8.07 KiB (0.00% increase, -7.51% of buffer)	684.765 → 684.773 → 684.870
✅	docker_agent_amd64	+10.29 KiB (0.00% increase, -2.54% of buffer)	806.444 → 806.454 → 806.840
✅	docker_agent_arm64	+5.84 KiB (0.00% increase, -0.44% of buffer)	808.815 → 808.821 → 810.120
✅	docker_agent_jmx_amd64	+31.67 KiB (0.00% increase, -13.07% of buffer)	997.363 → 997.394 → 997.600
✅	docker_agent_jmx_arm64	-102.03 KiB (0.01% reduction, +7.74% of buffer)	988.513 → 988.414 → 989.800
✅	docker_cluster_agent_amd64	+4.16 KiB (0.00% increase, -0.68% of buffer)	207.109 → 207.113 → 207.710
✅	docker_host_profiler_amd64	+3.88 KiB (0.00% increase, -0.03% of buffer)	302.131 → 302.135 → 315.880
✅	docker_host_profiler_arm64	-60.63 KiB (0.02% reduction, +0.43% of buffer)	313.655 → 313.596 → 327.470
✅	dogstatsd_deb_arm64	+4.0 KiB (0.01% increase, -0.39% of buffer)	28.296 → 28.300 → 29.290
✅	iot_agent_deb_amd64	+4.0 KiB (0.01% increase, -0.52% of buffer)	44.472 → 44.476 → 45.230
✅	iot_agent_deb_arm64	+4.0 KiB (0.01% increase, -0.29% of buffer)	41.437 → 41.441 → 42.800
✅	iot_agent_deb_armhf	+4.0 KiB (0.01% increase, -0.48% of buffer)	42.150 → 42.154 → 42.960
✅	iot_agent_rpm_amd64	+4.0 KiB (0.01% increase, -0.52% of buffer)	44.473 → 44.477 → 45.230
✅	iot_agent_suse_amd64	+4.0 KiB (0.01% increase, -0.52% of buffer)	44.473 → 44.477 → 45.230

9 successful checks with minimal change (< 2 KiB)

	Quality gate	Current Size
✅	agent_heroku_amd64	310.774 MiB
✅	docker_cluster_agent_arm64	221.078 MiB
✅	docker_cws_instrumentation_amd64	7.154 MiB
✅	docker_cws_instrumentation_arm64	6.689 MiB
✅	docker_dogstatsd_amd64	39.519 MiB
✅	docker_dogstatsd_arm64	37.690 MiB
✅	dogstatsd_deb_amd64	30.174 MiB
✅	dogstatsd_rpm_amd64	30.174 MiB
✅	dogstatsd_suse_amd64	30.174 MiB

cit-pr-commenter-54b7da · 2026-05-01T16:22:02Z

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 168fa06f-bdba-4cb8-ad61-8e36eab80004

Baseline: 205dba5
Comparison: f0e625f
Diff

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	docker_containers_cpu	% cpu utilization	+1.97	[-1.01, +4.95]	1	Logs

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	docker_containers_cpu	% cpu utilization	+1.97	[-1.01, +4.95]	1	Logs
➖	tcp_syslog_to_blackhole	ingress throughput	+0.62	[+0.41, +0.82]	1	Logs
➖	ddot_metrics	memory utilization	+0.61	[+0.41, +0.82]	1	Logs
➖	quality_gate_metrics_logs	memory utilization	+0.34	[+0.10, +0.59]	1	Logs bounds checks dashboard
➖	ddot_metrics_sum_delta	memory utilization	+0.26	[+0.07, +0.45]	1	Logs
➖	ddot_logs	memory utilization	+0.25	[+0.19, +0.30]	1	Logs
➖	quality_gate_idle	memory utilization	+0.20	[+0.15, +0.24]	1	Logs bounds checks dashboard
➖	file_to_blackhole_500ms_latency	egress throughput	+0.05	[-0.35, +0.45]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	+0.02	[-0.18, +0.23]	1	Logs
➖	quality_gate_idle_all_features	memory utilization	+0.01	[-0.03, +0.04]	1	Logs bounds checks dashboard
➖	file_to_blackhole_100ms_latency	egress throughput	+0.00	[-0.14, +0.14]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	-0.00	[-0.10, +0.10]	1	Logs
➖	uds_dogstatsd_to_api_v3	ingress throughput	-0.00	[-0.21, +0.20]	1	Logs
➖	file_to_blackhole_0ms_latency	egress throughput	-0.04	[-0.54, +0.46]	1	Logs
➖	file_to_blackhole_1000ms_latency	egress throughput	-0.06	[-0.51, +0.39]	1	Logs
➖	ddot_metrics_sum_cumulative	memory utilization	-0.10	[-0.26, +0.05]	1	Logs
➖	otlp_ingest_metrics	memory utilization	-0.11	[-0.27, +0.05]	1	Logs
➖	uds_dogstatsd_20mb_12k_contexts_20_senders	memory utilization	-0.11	[-0.16, -0.06]	1	Logs
➖	docker_containers_memory	memory utilization	-0.14	[-0.24, -0.04]	1	Logs
➖	file_tree	memory utilization	-0.15	[-0.20, -0.10]	1	Logs
➖	ddot_metrics_sum_cumulativetodelta_exporter	memory utilization	-0.23	[-0.47, +0.00]	1	Logs
➖	otlp_ingest_logs	memory utilization	-0.26	[-0.36, -0.16]	1	Logs
➖	quality_gate_logs	% cpu utilization	-0.53	[-1.56, +0.49]	1	Logs bounds checks dashboard

Bounds Checks: ✅ Passed

perf	experiment	bounds_check_name	replicates_passed	observed_value	links
✅	docker_containers_cpu	simple_check_run	10/10	697 ≥ 26
✅	docker_containers_memory	memory_usage	10/10	245.60MiB ≤ 370MiB
✅	docker_containers_memory	simple_check_run	10/10	730 ≥ 26
✅	file_to_blackhole_0ms_latency	memory_usage	10/10	0.16GiB ≤ 1.20GiB
✅	file_to_blackhole_0ms_latency	missed_bytes	10/10	0B = 0B
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10	0.20GiB ≤ 1.20GiB
✅	file_to_blackhole_1000ms_latency	missed_bytes	10/10	0B = 0B
✅	file_to_blackhole_100ms_latency	memory_usage	10/10	0.17GiB ≤ 1.20GiB
✅	file_to_blackhole_100ms_latency	missed_bytes	10/10	0B = 0B
✅	file_to_blackhole_500ms_latency	memory_usage	10/10	0.19GiB ≤ 1.20GiB
✅	file_to_blackhole_500ms_latency	missed_bytes	10/10	0B = 0B
✅	quality_gate_idle	intake_connections	10/10	3 ≤ 4	bounds checks dashboard
✅	quality_gate_idle	memory_usage	10/10	146.11MiB ≤ 147MiB	bounds checks dashboard
✅	quality_gate_idle	total_bytes_received	10/10	743.14KiB ≤ 819.20KiB	bounds checks dashboard
✅	quality_gate_idle_all_features	intake_connections	10/10	3 ≤ 4	bounds checks dashboard
✅	quality_gate_idle_all_features	memory_usage	10/10	484.86MiB ≤ 495MiB	bounds checks dashboard
✅	quality_gate_idle_all_features	total_bytes_received	10/10	1.13MiB ≤ 1.25MiB	bounds checks dashboard
✅	quality_gate_logs	intake_connections	10/10	4 ≤ 6	bounds checks dashboard
✅	quality_gate_logs	memory_usage	10/10	174.07MiB ≤ 195MiB	bounds checks dashboard
✅	quality_gate_logs	missed_bytes	10/10	0B = 0B	bounds checks dashboard
✅	quality_gate_logs	total_bytes_received	10/10	264.29MiB ≤ 292MiB	bounds checks dashboard
✅	quality_gate_metrics_logs	cpu_usage	10/10	347.65 ≤ 2000	bounds checks dashboard
✅	quality_gate_metrics_logs	intake_connections	10/10	4 ≤ 6	bounds checks dashboard
✅	quality_gate_metrics_logs	memory_usage	10/10	374.71MiB ≤ 430MiB	bounds checks dashboard
✅	quality_gate_metrics_logs	missed_bytes	10/10	0B = 0B	bounds checks dashboard
✅	quality_gate_metrics_logs	total_bytes_received	10/10	0.94GiB ≤ 1.04GiB	bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

CI Pass/Fail Decision

✅ Passed. All Quality Gates passed.

quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_logs, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.

apiarian-datadog · 2026-05-01T17:00:08Z

testing https://gitlab.ddbuild.io/DataDog/serverless-init-ci/-/pipelines/110939630 for f7f03a1

github-actions · 2026-05-20T16:38:30Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e10234b431

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

github-actions · 2026-05-20T21:19:18Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 25b8435fbe

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

github-actions · 2026-05-20T21:32:31Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f7ac73cf5c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

github-actions · 2026-05-21T18:12:14Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 200ca54c1f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

github-actions · 2026-05-21T21:30:43Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1240ba6b81

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

github-actions · 2026-05-22T15:56:03Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 242cc305e0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4bbbe356a8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-28T18:34:01Z

+		eventplatformfx.Module(eventplatform.NewDisabledParams()),
+		eventplatformreceiverimpl.Module(),
+		haagentfx.Module(),
+		forwarder.Bundle(defaultforwarder.NewParams()),


Disable forwarder health checks when the API key is absent

In the supported DD_API_KEY-unset mode, this new unconditional forwarder bundle is still constructed and started before setup() returns the no-op trace/metrics path. defaultforwarder.Start() starts its API-key health checker unless WithDisableAPIKeyChecking is used, and utils.GetMultipleEndpoints still builds a resolver containing the empty top-level api_key, so serverless-init now makes validation attempts and logs “No valid api key” errors even though the old flow returned before creating the metric forwarder. This regresses the process/logs-only scenario that the no-key branch is meant to keep quiet; gate the forwarder too or disable API-key checking when the early sanitized key is empty.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4bbbe356a8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-28T18:34:06Z

+		eventplatformfx.Module(eventplatform.NewDisabledParams()),
+		eventplatformreceiverimpl.Module(),
+		haagentfx.Module(),
+		forwarder.Bundle(defaultforwarder.NewParams()),


Disable forwarder health checks when the API key is absent

In the supported DD_API_KEY-unset mode, this new unconditional forwarder bundle is still constructed and started before setup() returns the no-op trace/metrics path. defaultforwarder.Start() starts its API-key health checker unless WithDisableAPIKeyChecking is used, and utils.GetMultipleEndpoints still builds a resolver containing the empty top-level api_key, so serverless-init now makes validation attempts and logs “No valid api key” errors even though the old flow returned before creating the metric forwarder. This regresses the process/logs-only scenario that the no-key branch is meant to keep quiet; gate the forwarder too or disable API-key checking when the early sanitized key is empty.

Useful? React with 👍 / 👎.

vickenty

Would it be possible to split the forwarder changes into a separate PR?

vickenty · 2026-05-29T11:44:24Z

+	// guards against a racing Stop closing the worker loop between our snapshot
+	// and the unbuffered send: if the server is stopping, workers exit on
+	// <-stopChan and would never receive on flushChan, so we abandon the send
+	// and break out of the loop early.


Is there a reason why worker can't flush itself in the stopChan branch when on serverless?

because we need to not only flush the worker but also ensure that those flushed samples actually get through aggregation and out of the serializer. so we need to trigger the flush on the worker and then force the flush on the serializer.

also wouldn't we end up changing the stop semantics for every dogstatsd user if we did this? wanted to limit our changes to serverless environments.

because we need to not only flush the worker but also ensure that those flushed samples actually get through aggregation and out of the serializer. so we need to trigger the flush on the worker and then force the flush on the serializer.

The rest of the components will follow the same pattern: if needed, make sure metrics make out on stop. Aggregator I believe already has an option to do this, and you are adding it to the forwarder. Fx will ensure for us that we're flushing in the correct order as well.

also wouldn't we end up changing the stop semantics for every dogstatsd user if we did this? wanted to limit our changes to serverless environments.

It would need to be behind a flag, similar to how you did this in the forwarder.

sounds good, i'll try to refactor it this way

ok, done. lots of code changed, and arranged in a way that should let me split out the main metrics agent changes into its own pr.

Per our discussion on #51565, i'll hold off on further work on this PR until your upstream refactoring is done.

Add a forwarder_stop_wait_for_inflight option so Worker.Stop can wait for in-flight HTTP transactions to finish before returning instead of cancelling them. serverless-init opts in so the final flush is never aborted by Stop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Give each worker its own flushChan and add a dogstatsd_flush_on_stop option so the server drains every worker's batched samples into the time sampler during stop(), before the worker loops are torn down. This replaces the explicit ServerlessFlush trigger with a stop-driven drain (the old path is removed in a later commit). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add an aggregator_drain_samples_on_stop path so AgentDemultiplexer.Stop drains pending metric samples before the final flush, giving serverless a flush mechanism on the standard demultiplexer. The legacy ServerlessDemultiplexer is removed in a later commit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add the boolean settings that let serverless-init opt the aggregator, forwarder, and dogstatsd server into draining/flushing pending data on Stop instead of dropping it: aggregator_drain_samples_on_stop, forwarder_stop_wait_for_inflight, and dogstatsd_flush_on_stop. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Reduce ServerlessMetricAgent to a thin wrapper over the demultiplexer and tags, switching it onto the stop-driven flush path, and add a metricstest fx helper plus integration coverage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Wire serverless-init to the Fx-managed forwarder, demultiplexer, and dogstatsd server, and rework the shutdown sequence to drive a proper flush through them on Stop. Tolerate a nil metric agent during cloudservice shutdown and parameterize the log flush timeout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Now that serverless-init drives flushing through the dogstatsd server's stop-driven drain, the demultiplexer's drain-on-stop path, and the forwarder's wait-for-inflight option, delete the old explicit-flush path: - dogstatsd Component.ServerlessFlush and its serverlessFlushChan - the standalone NewServerlessServer (server/impl/serverless.go) - aggregator's ServerlessDemultiplexer (demultiplexer_serverless.go) - forwarder SyncForwarder (sync_forwarder.go) - pkg/serverless legacy plumbing (serverless.go) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-01T15:41:46Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f0e625fab1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-01T15:46:45Z

+	if s.flushOnStop {
+		s.log.Debug("dogstatsd_flush_on_stop: draining worker batchers before stop")
+		for _, w := range s.workers {
+			w.flushChan <- struct{}{}


Block until worker batcher flushes finish

When dogstatsd_flush_on_stop is enabled for serverless shutdown, this unbuffered send only waits until the worker receives the signal, not until batcher.flush() has actually enqueued its samples into the demux. Fx then proceeds to AgentDemultiplexer.Stop, whose new drain only looks at already-enqueued sampler channels; if it observes them before the worker finishes flushing, the final serializer flush can run and miss the worker's last buffered DogStatsD metrics/events/service checks. This affects shutdown paths where workers still have batched data, so the stop path needs an acknowledgement after batcher.flush() completes before demux shutdown starts.

Useful? React with 👍 / 👎.

vickenty · 2026-06-05T13:38:08Z

PR to drain aggregator inflight samples: #51866

Also, since aggregator feature is already called dogstatsd_flush_incomplete_buckets, we can use the same setting to trigger flush in the dogstatsd server, I think the intent is the same for both.

apiarian-datadog · 2026-06-15T18:57:56Z

closing in favor of #52267

dd-octo-sts Bot added internal Identify a non-fork PR team/agent-metric-pipelines team/agent-build labels May 1, 2026

github-actions Bot added the long review PR is complex, plan time to review it label May 1, 2026

apiarian-datadog force-pushed the aleksandr.pasechnik/svls-8857-split branch from 2ef722b to f7f03a1 Compare May 1, 2026 15:31

apiarian-datadog force-pushed the aleksandr.pasechnik/svls-8857-split branch from f7f03a1 to 8ba3f54 Compare May 8, 2026 16:32

apiarian-datadog force-pushed the aleksandr.pasechnik/svls-8857-split branch from 95d2cbf to e10234b Compare May 20, 2026 16:36

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

Comment thread cmd/serverless-init/main.go

apiarian-datadog force-pushed the aleksandr.pasechnik/svls-8857-split branch from e10234b to 25b8435 Compare May 20, 2026 21:19

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

Comment thread pkg/serverless/metrics/metric.go Outdated

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

Comment thread cmd/serverless-init/main.go

Comment thread cmd/serverless-init/main.go Outdated

apiarian-datadog force-pushed the aleksandr.pasechnik/svls-8857-split branch from f7ac73c to 200ca54 Compare May 21, 2026 18:12

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

Comment thread cmd/serverless-init/main.go Outdated

Comment thread cmd/serverless-init/main.go

apiarian-datadog force-pushed the aleksandr.pasechnik/svls-8857-split branch from 200ca54 to 1240ba6 Compare May 21, 2026 21:30

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

Comment thread cmd/serverless-init/main.go Outdated

apiarian-datadog force-pushed the aleksandr.pasechnik/svls-8857-split branch from 1240ba6 to 242cc30 Compare May 22, 2026 15:55

chatgpt-codex-connector Bot reviewed May 22, 2026

View reviewed changes

Comment thread cmd/serverless-init/main.go

apiarian-datadog marked this pull request as ready for review May 28, 2026 18:28

apiarian-datadog requested review from a team as code owners May 28, 2026 18:28

apiarian-datadog requested a review from rahulkaukuntla May 28, 2026 18:28

apiarian-datadog added team/serverless-azure-gcp qa/done QA done before merge and regressions are covered by tests labels May 28, 2026

apiarian-datadog requested a review from Lewis-E May 28, 2026 18:33

chatgpt-codex-connector Bot reviewed May 28, 2026

View reviewed changes

apiarian-datadog commented May 28, 2026

View reviewed changes

Comment thread comp/forwarder/defaultforwarder/worker.go

aiuto approved these changes May 28, 2026

View reviewed changes

vickenty reviewed May 29, 2026

View reviewed changes

apiarian-datadog force-pushed the aleksandr.pasechnik/svls-8857-split branch from 4bbbe35 to 81a94bc Compare May 29, 2026 20:07

apiarian-datadog and others added 8 commits June 1, 2026 10:57

trace: make ServerlessTraceAgent Stop timeout configurable

558c7b1

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

apiarian-datadog marked this pull request as draft June 1, 2026 15:40

apiarian-datadog force-pushed the aleksandr.pasechnik/svls-8857-split branch from 81a94bc to f0e625f Compare June 1, 2026 15:41

chatgpt-codex-connector Bot reviewed Jun 1, 2026

View reviewed changes

apiarian-datadog closed this Jun 15, 2026

Uh oh!

Conversation

apiarian-datadog commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Motivation

Metrics Agent Changes

Describe how you validated your changes

Uh oh!

apiarian-datadog commented May 1, 2026

Uh oh!

dd-octo-sts Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Go Package Import Differences

Uh oh!

datadog-prod-us1-5 Bot commented May 1, 2026 • edited by datadog-official Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

ℹ️ Info

Uh oh!

dd-octo-sts Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Files inventory check summary

Results for datadog-agent_7.81.0~devel.git.348.f0e625f.pipeline.116284349-1_amd64.deb:

Uh oh!

dd-octo-sts Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Static quality checks

Info

Uh oh!

cit-pr-commenter-54b7da Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Regression Detector

Regression Detector Results

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Fine details of change detection per experiment

Bounds Checks: ✅ Passed

Explanation

CI Pass/Fail Decision

Uh oh!

apiarian-datadog commented May 1, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

apiarian-datadog commented May 1, 2026 •

edited

Loading

dd-octo-sts Bot commented May 1, 2026 •

edited

Loading

datadog-prod-us1-5 Bot commented May 1, 2026 •

edited by datadog-official Bot

Loading

dd-octo-sts Bot commented May 1, 2026 •

edited

Loading

dd-octo-sts Bot commented May 1, 2026 •

edited

Loading

cit-pr-commenter-54b7da Bot commented May 1, 2026 •

edited

Loading