Skip to content

SVLS-8857: replace serverless metric agent with base metric agent#50232

Closed
apiarian-datadog wants to merge 8 commits into
mainfrom
aleksandr.pasechnik/svls-8857-split
Closed

SVLS-8857: replace serverless metric agent with base metric agent#50232
apiarian-datadog wants to merge 8 commits into
mainfrom
aleksandr.pasechnik/svls-8857-split

Conversation

@apiarian-datadog

@apiarian-datadog apiarian-datadog commented May 1, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Replace the bespoke serverless metric agent lifecycle with the base metric agent. The serverless metric agent was necessary in the days when we supported aws lambda. Now that we only need to work with containerized deploys, we can switch to using the base metric agent instead.

Motivation

Want to simplify serverless-init logic and make sure that we handle shutdown correctly in environments like cloud run jobs where shutdowns happen more often.

Metrics Agent Changes

The metrics agent receives metrics packets at the listener, which are then passed to the packetsIn channel. The worker parses the packets and turns them into MetricSamples for the batcher. The batcher flushes these samples and sends them to the demultiplexer. The demultiplexer puts these samples into a channel for a separate goroutine to bucket up (these are the samples that WaitForPendingSamples waits on). On flush the aggregated metrics are turned into payloads, then serialized, and passed to the forwarder which packs them up into a transaction for delivery. The forwarder worker then posts those samples out to datadog.

Along the way, we discovered that the metrics agent didn't quite meet the serverless-init needs. Serverless-init works in ephemeral environments. It doesn't need to worry about freezes like lambda, but once the container stops, it is gone. So we need to make sure that any remaining metrics are definitely flushed when we stop the metrics agent. We cannot rely on a filesystem buffer for resending metrics on restart. This happens after every single task run in cloud run jobs, as well as during routine app updates, refreshes, and cloud-managed restarts for the more request/response-focused workflows.

The demultiplexer couldn't synchronously drain samples before flushing. The addition of WaitForPendingSamples adds this support.

The old ServerlessFlush only flushed a single worker. Now we flush all of them.

The standard async forwarder cancelled the worker context on stop, so any payloads that are in-flight also get cancelled. We've added a forwarder_stop_wait_for_inflight setting which lets those requests finish. We added the setting because it introduces a behavioral change to the forwarder, making it not stop immediately on Stop. This may be something that we might support in environments where the metrics agent is going to be stopping and not coming back.

Describe how you validated your changes

Unit testing. Running End to End tests and validating with self-monitoring and custom test apps.

Tested in an actual cloud run job with 1k tasks. The tasks submit steady streams of metrics, and then a large burst at the end. This uncovered some issues with the stopping (in particular, the forwarder in-flight request finishing problem).

There's still a small number of metrics dropped (less than 1% of the 500 metrics submitted at the very end) but this is confirmed to be a kernel-level UDP packet drop, and 500 metrics all at once is a pretty strange thing to do at shutdown. Dropping this final burst to 100 eliminates this UDP packet drop.

@apiarian-datadog

Copy link
Copy Markdown
Contributor Author

This PR can be reviewed commit by commit. It's a non-trivial refactor as a whole.

@github-actions github-actions Bot added the long review PR is complex, plan time to review it label May 1, 2026
@dd-octo-sts

dd-octo-sts Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

Go Package Import Differences

Baseline: 205dba5
Comparison: f0e625f

binaryosarchchange
agentlinuxamd64
+0, -1
-github.com/DataDog/datadog-agent/comp/dogstatsd/replay/impl-noop
agentlinuxarm64
+0, -1
-github.com/DataDog/datadog-agent/comp/dogstatsd/replay/impl-noop
agentwindowsamd64
+0, -1
-github.com/DataDog/datadog-agent/comp/dogstatsd/replay/impl-noop
agentdarwinamd64
+0, -1
-github.com/DataDog/datadog-agent/comp/dogstatsd/replay/impl-noop
agentdarwinarm64
+0, -1
-github.com/DataDog/datadog-agent/comp/dogstatsd/replay/impl-noop
agentaixppc64
+0, -1
-github.com/DataDog/datadog-agent/comp/dogstatsd/replay/impl-noop
iot-agentlinuxamd64
+0, -2
-github.com/DataDog/datadog-agent/comp/core/telemetry/impl/noops
-github.com/DataDog/datadog-agent/comp/dogstatsd/replay/impl-noop
iot-agentlinuxarm64
+0, -2
-github.com/DataDog/datadog-agent/comp/core/telemetry/impl/noops
-github.com/DataDog/datadog-agent/comp/dogstatsd/replay/impl-noop
heroku-agentlinuxamd64
+0, -1
-github.com/DataDog/datadog-agent/comp/dogstatsd/replay/impl-noop
dogstatsdlinuxamd64
+0, -2
-github.com/DataDog/datadog-agent/comp/core/telemetry/impl/noops
-github.com/DataDog/datadog-agent/comp/dogstatsd/replay/impl-noop
dogstatsdlinuxarm64
+0, -2
-github.com/DataDog/datadog-agent/comp/core/telemetry/impl/noops
-github.com/DataDog/datadog-agent/comp/dogstatsd/replay/impl-noop

@datadog-prod-us1-5

datadog-prod-us1-5 Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 2 Pipeline jobs failed

DataDog/datadog-agent | new-e2e-ha-agent-failover   View in Datadog   GitLab

🔄 Retry job. This looks flaky and may succeed on retry. Error: Could not fetch metadata payload: Get "https://localhost:5001/agent/metadata/ha-agent": dial tcp 127.0.0.1:5001: connect: connection refused; Error: could not reach agent: Get "https://localhost:5001/agent/status/section/collector?format=text": dial tcp 127.0.0.1:5001: connect: connection refused

Label analysis | release-note-check   View in Datadog   GitHub Actions

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration. No releasenote found for this PR. Please add one using 'reno' or apply the label 'changelog/no-changelog'.

ℹ️ Info

🎯 Code Coverage (details)
Patch Coverage: 68.87%
Overall Coverage: 50.47% (-0.01%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: f0e625f | Docs | Datadog PR Page | Give us feedback!

@dd-octo-sts

dd-octo-sts Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

Files inventory check summary

File checks results against ancestor 205dba56:

Results for datadog-agent_7.81.0~devel.git.348.f0e625f.pipeline.116284349-1_amd64.deb:

No change detected

@dd-octo-sts

dd-octo-sts Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

Static quality checks

✅ Please find below the results from static quality gates
Comparison made with ancestor 205dba5
📊 Static Quality Gates Dashboard
🔗 SQG Job

Successful checks

Info

Quality gate Change Size (prev → curr → max)
agent_deb_amd64 +8.54 KiB (0.00% increase, -0.19% of buffer) 746.307 → 746.315 → 750.800
agent_deb_amd64_fips +8.54 KiB (0.00% increase, -3.62% of buffer) 704.070 → 704.078 → 704.300
agent_msi +10.0 KiB (0.00% increase, -0.07% of buffer) 610.254 → 610.263 → 624.040
agent_rpm_amd64 +8.54 KiB (0.00% increase, -0.19% of buffer) 746.291 → 746.299 → 750.770
agent_rpm_amd64_fips +8.54 KiB (0.00% increase, -3.38% of buffer) 704.053 → 704.062 → 704.300
agent_rpm_arm64 +4.07 KiB (0.00% increase, -0.60% of buffer) 723.837 → 723.841 → 724.500
agent_rpm_arm64_fips +8.07 KiB (0.00% increase, -7.51% of buffer) 684.765 → 684.773 → 684.870
agent_suse_amd64 +8.54 KiB (0.00% increase, -0.19% of buffer) 746.291 → 746.299 → 750.770
agent_suse_amd64_fips +8.54 KiB (0.00% increase, -3.38% of buffer) 704.053 → 704.062 → 704.300
agent_suse_arm64 +4.07 KiB (0.00% increase, -0.60% of buffer) 723.837 → 723.841 → 724.500
agent_suse_arm64_fips +8.07 KiB (0.00% increase, -7.51% of buffer) 684.765 → 684.773 → 684.870
docker_agent_amd64 +10.29 KiB (0.00% increase, -2.54% of buffer) 806.444 → 806.454 → 806.840
docker_agent_arm64 +5.84 KiB (0.00% increase, -0.44% of buffer) 808.815 → 808.821 → 810.120
docker_agent_jmx_amd64 +31.67 KiB (0.00% increase, -13.07% of buffer) 997.363 → 997.394 → 997.600
docker_agent_jmx_arm64 -102.03 KiB (0.01% reduction, +7.74% of buffer) 988.513 → 988.414 → 989.800
docker_cluster_agent_amd64 +4.16 KiB (0.00% increase, -0.68% of buffer) 207.109 → 207.113 → 207.710
docker_host_profiler_amd64 +3.88 KiB (0.00% increase, -0.03% of buffer) 302.131 → 302.135 → 315.880
docker_host_profiler_arm64 -60.63 KiB (0.02% reduction, +0.43% of buffer) 313.655 → 313.596 → 327.470
dogstatsd_deb_arm64 +4.0 KiB (0.01% increase, -0.39% of buffer) 28.296 → 28.300 → 29.290
iot_agent_deb_amd64 +4.0 KiB (0.01% increase, -0.52% of buffer) 44.472 → 44.476 → 45.230
iot_agent_deb_arm64 +4.0 KiB (0.01% increase, -0.29% of buffer) 41.437 → 41.441 → 42.800
iot_agent_deb_armhf +4.0 KiB (0.01% increase, -0.48% of buffer) 42.150 → 42.154 → 42.960
iot_agent_rpm_amd64 +4.0 KiB (0.01% increase, -0.52% of buffer) 44.473 → 44.477 → 45.230
iot_agent_suse_amd64 +4.0 KiB (0.01% increase, -0.52% of buffer) 44.473 → 44.477 → 45.230
9 successful checks with minimal change (< 2 KiB)
Quality gate Current Size
agent_heroku_amd64 310.774 MiB
docker_cluster_agent_arm64 221.078 MiB
docker_cws_instrumentation_amd64 7.154 MiB
docker_cws_instrumentation_arm64 6.689 MiB
docker_dogstatsd_amd64 39.519 MiB
docker_dogstatsd_arm64 37.690 MiB
dogstatsd_deb_amd64 30.174 MiB
dogstatsd_rpm_amd64 30.174 MiB
dogstatsd_suse_amd64 30.174 MiB

@apiarian-datadog apiarian-datadog force-pushed the aleksandr.pasechnik/svls-8857-split branch from 2ef722b to f7f03a1 Compare May 1, 2026 15:31
@cit-pr-commenter-54b7da

cit-pr-commenter-54b7da Bot commented May 1, 2026

Copy link
Copy Markdown

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 168fa06f-bdba-4cb8-ad61-8e36eab80004

Baseline: 205dba5
Comparison: f0e625f
Diff

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf experiment goal Δ mean % Δ mean % CI trials links
docker_containers_cpu % cpu utilization +1.97 [-1.01, +4.95] 1 Logs

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI trials links
docker_containers_cpu % cpu utilization +1.97 [-1.01, +4.95] 1 Logs
tcp_syslog_to_blackhole ingress throughput +0.62 [+0.41, +0.82] 1 Logs
ddot_metrics memory utilization +0.61 [+0.41, +0.82] 1 Logs
quality_gate_metrics_logs memory utilization +0.34 [+0.10, +0.59] 1 Logs bounds checks dashboard
ddot_metrics_sum_delta memory utilization +0.26 [+0.07, +0.45] 1 Logs
ddot_logs memory utilization +0.25 [+0.19, +0.30] 1 Logs
quality_gate_idle memory utilization +0.20 [+0.15, +0.24] 1 Logs bounds checks dashboard
file_to_blackhole_500ms_latency egress throughput +0.05 [-0.35, +0.45] 1 Logs
uds_dogstatsd_to_api ingress throughput +0.02 [-0.18, +0.23] 1 Logs
quality_gate_idle_all_features memory utilization +0.01 [-0.03, +0.04] 1 Logs bounds checks dashboard
file_to_blackhole_100ms_latency egress throughput +0.00 [-0.14, +0.14] 1 Logs
tcp_dd_logs_filter_exclude ingress throughput -0.00 [-0.10, +0.10] 1 Logs
uds_dogstatsd_to_api_v3 ingress throughput -0.00 [-0.21, +0.20] 1 Logs
file_to_blackhole_0ms_latency egress throughput -0.04 [-0.54, +0.46] 1 Logs
file_to_blackhole_1000ms_latency egress throughput -0.06 [-0.51, +0.39] 1 Logs
ddot_metrics_sum_cumulative memory utilization -0.10 [-0.26, +0.05] 1 Logs
otlp_ingest_metrics memory utilization -0.11 [-0.27, +0.05] 1 Logs
uds_dogstatsd_20mb_12k_contexts_20_senders memory utilization -0.11 [-0.16, -0.06] 1 Logs
docker_containers_memory memory utilization -0.14 [-0.24, -0.04] 1 Logs
file_tree memory utilization -0.15 [-0.20, -0.10] 1 Logs
ddot_metrics_sum_cumulativetodelta_exporter memory utilization -0.23 [-0.47, +0.00] 1 Logs
otlp_ingest_logs memory utilization -0.26 [-0.36, -0.16] 1 Logs
quality_gate_logs % cpu utilization -0.53 [-1.56, +0.49] 1 Logs bounds checks dashboard

Bounds Checks: ✅ Passed

perf experiment bounds_check_name replicates_passed observed_value links
docker_containers_cpu simple_check_run 10/10 697 ≥ 26
docker_containers_memory memory_usage 10/10 245.60MiB ≤ 370MiB
docker_containers_memory simple_check_run 10/10 730 ≥ 26
file_to_blackhole_0ms_latency memory_usage 10/10 0.16GiB ≤ 1.20GiB
file_to_blackhole_0ms_latency missed_bytes 10/10 0B = 0B
file_to_blackhole_1000ms_latency memory_usage 10/10 0.20GiB ≤ 1.20GiB
file_to_blackhole_1000ms_latency missed_bytes 10/10 0B = 0B
file_to_blackhole_100ms_latency memory_usage 10/10 0.17GiB ≤ 1.20GiB
file_to_blackhole_100ms_latency missed_bytes 10/10 0B = 0B
file_to_blackhole_500ms_latency memory_usage 10/10 0.19GiB ≤ 1.20GiB
file_to_blackhole_500ms_latency missed_bytes 10/10 0B = 0B
quality_gate_idle intake_connections 10/10 3 ≤ 4 bounds checks dashboard
quality_gate_idle memory_usage 10/10 146.11MiB ≤ 147MiB bounds checks dashboard
quality_gate_idle total_bytes_received 10/10 743.14KiB ≤ 819.20KiB bounds checks dashboard
quality_gate_idle_all_features intake_connections 10/10 3 ≤ 4 bounds checks dashboard
quality_gate_idle_all_features memory_usage 10/10 484.86MiB ≤ 495MiB bounds checks dashboard
quality_gate_idle_all_features total_bytes_received 10/10 1.13MiB ≤ 1.25MiB bounds checks dashboard
quality_gate_logs intake_connections 10/10 4 ≤ 6 bounds checks dashboard
quality_gate_logs memory_usage 10/10 174.07MiB ≤ 195MiB bounds checks dashboard
quality_gate_logs missed_bytes 10/10 0B = 0B bounds checks dashboard
quality_gate_logs total_bytes_received 10/10 264.29MiB ≤ 292MiB bounds checks dashboard
quality_gate_metrics_logs cpu_usage 10/10 347.65 ≤ 2000 bounds checks dashboard
quality_gate_metrics_logs intake_connections 10/10 4 ≤ 6 bounds checks dashboard
quality_gate_metrics_logs memory_usage 10/10 374.71MiB ≤ 430MiB bounds checks dashboard
quality_gate_metrics_logs missed_bytes 10/10 0B = 0B bounds checks dashboard
quality_gate_metrics_logs total_bytes_received 10/10 0.94GiB ≤ 1.04GiB bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

CI Pass/Fail Decision

Passed. All Quality Gates passed.

  • quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
  • quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_idle, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
  • quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.

@apiarian-datadog

Copy link
Copy Markdown
Contributor Author

@apiarian-datadog apiarian-datadog force-pushed the aleksandr.pasechnik/svls-8857-split branch from f7f03a1 to 8ba3f54 Compare May 8, 2026 16:32
@apiarian-datadog apiarian-datadog force-pushed the aleksandr.pasechnik/svls-8857-split branch from 95d2cbf to e10234b Compare May 20, 2026 16:36
@github-actions

Copy link
Copy Markdown
Contributor

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e10234b431

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread cmd/serverless-init/main.go
@apiarian-datadog apiarian-datadog force-pushed the aleksandr.pasechnik/svls-8857-split branch from e10234b to 25b8435 Compare May 20, 2026 21:19
@github-actions

Copy link
Copy Markdown
Contributor

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 25b8435fbe

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread pkg/serverless/metrics/metric.go Outdated
@github-actions

Copy link
Copy Markdown
Contributor

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f7ac73cf5c

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread cmd/serverless-init/main.go
Comment thread cmd/serverless-init/main.go Outdated
@apiarian-datadog apiarian-datadog force-pushed the aleksandr.pasechnik/svls-8857-split branch from f7ac73c to 200ca54 Compare May 21, 2026 18:12
@github-actions

Copy link
Copy Markdown
Contributor

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 200ca54c1f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread cmd/serverless-init/main.go Outdated
Comment thread cmd/serverless-init/main.go
@apiarian-datadog apiarian-datadog force-pushed the aleksandr.pasechnik/svls-8857-split branch from 200ca54 to 1240ba6 Compare May 21, 2026 21:30
@github-actions

Copy link
Copy Markdown
Contributor

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1240ba6b81

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread cmd/serverless-init/main.go Outdated
@apiarian-datadog apiarian-datadog force-pushed the aleksandr.pasechnik/svls-8857-split branch from 1240ba6 to 242cc30 Compare May 22, 2026 15:55
@github-actions

Copy link
Copy Markdown
Contributor

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 242cc305e0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread cmd/serverless-init/main.go
@apiarian-datadog apiarian-datadog marked this pull request as ready for review May 28, 2026 18:28
@apiarian-datadog apiarian-datadog requested review from a team as code owners May 28, 2026 18:28
@apiarian-datadog apiarian-datadog added team/serverless-azure-gcp qa/done QA done before merge and regressions are covered by tests labels May 28, 2026
@apiarian-datadog apiarian-datadog requested a review from Lewis-E May 28, 2026 18:33

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4bbbe356a8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

eventplatformfx.Module(eventplatform.NewDisabledParams()),
eventplatformreceiverimpl.Module(),
haagentfx.Module(),
forwarder.Bundle(defaultforwarder.NewParams()),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Disable forwarder health checks when the API key is absent

In the supported DD_API_KEY-unset mode, this new unconditional forwarder bundle is still constructed and started before setup() returns the no-op trace/metrics path. defaultforwarder.Start() starts its API-key health checker unless WithDisableAPIKeyChecking is used, and utils.GetMultipleEndpoints still builds a resolver containing the empty top-level api_key, so serverless-init now makes validation attempts and logs “No valid api key” errors even though the old flow returned before creating the metric forwarder. This regresses the process/logs-only scenario that the no-key branch is meant to keep quiet; gate the forwarder too or disable API-key checking when the early sanitized key is empty.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4bbbe356a8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

eventplatformfx.Module(eventplatform.NewDisabledParams()),
eventplatformreceiverimpl.Module(),
haagentfx.Module(),
forwarder.Bundle(defaultforwarder.NewParams()),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Disable forwarder health checks when the API key is absent

In the supported DD_API_KEY-unset mode, this new unconditional forwarder bundle is still constructed and started before setup() returns the no-op trace/metrics path. defaultforwarder.Start() starts its API-key health checker unless WithDisableAPIKeyChecking is used, and utils.GetMultipleEndpoints still builds a resolver containing the empty top-level api_key, so serverless-init now makes validation attempts and logs “No valid api key” errors even though the old flow returned before creating the metric forwarder. This regresses the process/logs-only scenario that the no-key branch is meant to keep quiet; gate the forwarder too or disable API-key checking when the early sanitized key is empty.

Useful? React with 👍 / 👎.

Comment thread comp/forwarder/defaultforwarder/worker.go

@vickenty vickenty left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to split the forwarder changes into a separate PR?

Comment thread comp/dogstatsd/server/impl/server.go Outdated
// guards against a racing Stop closing the worker loop between our snapshot
// and the unbuffered send: if the server is stopping, workers exit on
// <-stopChan and would never receive on flushChan, so we abandon the send
// and break out of the loop early.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why worker can't flush itself in the stopChan branch when on serverless?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we need to not only flush the worker but also ensure that those flushed samples actually get through aggregation and out of the serializer. so we need to trigger the flush on the worker and then force the flush on the serializer.

also wouldn't we end up changing the stop semantics for every dogstatsd user if we did this? wanted to limit our changes to serverless environments.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we need to not only flush the worker but also ensure that those flushed samples actually get through aggregation and out of the serializer. so we need to trigger the flush on the worker and then force the flush on the serializer.

The rest of the components will follow the same pattern: if needed, make sure metrics make out on stop. Aggregator I believe already has an option to do this, and you are adding it to the forwarder. Fx will ensure for us that we're flushing in the correct order as well.

also wouldn't we end up changing the stop semantics for every dogstatsd user if we did this? wanted to limit our changes to serverless environments.

It would need to be behind a flag, similar to how you did this in the forwarder.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, i'll try to refactor it this way

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done. lots of code changed, and arranged in a way that should let me split out the main metrics agent changes into its own pr.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per our discussion on #51565, i'll hold off on further work on this PR until your upstream refactoring is done.

Comment thread comp/dogstatsd/server/impl/server.go Outdated
@apiarian-datadog apiarian-datadog force-pushed the aleksandr.pasechnik/svls-8857-split branch from 4bbbe35 to 81a94bc Compare May 29, 2026 20:07
apiarian-datadog and others added 8 commits June 1, 2026 10:57
Add a forwarder_stop_wait_for_inflight option so Worker.Stop can wait for
in-flight HTTP transactions to finish before returning instead of cancelling
them. serverless-init opts in so the final flush is never aborted by Stop.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Give each worker its own flushChan and add a dogstatsd_flush_on_stop option
so the server drains every worker's batched samples into the time sampler
during stop(), before the worker loops are torn down. This replaces the
explicit ServerlessFlush trigger with a stop-driven drain (the old path is
removed in a later commit).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add an aggregator_drain_samples_on_stop path so AgentDemultiplexer.Stop
drains pending metric samples before the final flush, giving serverless a
flush mechanism on the standard demultiplexer. The legacy ServerlessDemultiplexer
is removed in a later commit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add the boolean settings that let serverless-init opt the aggregator,
forwarder, and dogstatsd server into draining/flushing pending data on
Stop instead of dropping it: aggregator_drain_samples_on_stop,
forwarder_stop_wait_for_inflight, and dogstatsd_flush_on_stop.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Reduce ServerlessMetricAgent to a thin wrapper over the demultiplexer and
tags, switching it onto the stop-driven flush path, and add a metricstest
fx helper plus integration coverage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wire serverless-init to the Fx-managed forwarder, demultiplexer, and
dogstatsd server, and rework the shutdown sequence to drive a proper flush
through them on Stop. Tolerate a nil metric agent during cloudservice
shutdown and parameterize the log flush timeout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Now that serverless-init drives flushing through the dogstatsd server's
stop-driven drain, the demultiplexer's drain-on-stop path, and the
forwarder's wait-for-inflight option, delete the old explicit-flush path:

  - dogstatsd Component.ServerlessFlush and its serverlessFlushChan
  - the standalone NewServerlessServer (server/impl/serverless.go)
  - aggregator's ServerlessDemultiplexer (demultiplexer_serverless.go)
  - forwarder SyncForwarder (sync_forwarder.go)
  - pkg/serverless legacy plumbing (serverless.go)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@apiarian-datadog apiarian-datadog marked this pull request as draft June 1, 2026 15:40
@apiarian-datadog apiarian-datadog force-pushed the aleksandr.pasechnik/svls-8857-split branch from 81a94bc to f0e625f Compare June 1, 2026 15:41
@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f0e625fab1

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

if s.flushOnStop {
s.log.Debug("dogstatsd_flush_on_stop: draining worker batchers before stop")
for _, w := range s.workers {
w.flushChan <- struct{}{}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Block until worker batcher flushes finish

When dogstatsd_flush_on_stop is enabled for serverless shutdown, this unbuffered send only waits until the worker receives the signal, not until batcher.flush() has actually enqueued its samples into the demux. Fx then proceeds to AgentDemultiplexer.Stop, whose new drain only looks at already-enqueued sampler channels; if it observes them before the worker finishes flushing, the final serializer flush can run and miss the worker's last buffered DogStatsD metrics/events/service checks. This affects shutdown paths where workers still have batched data, so the stop path needs an acknowledgement after batcher.flush() completes before demux shutdown starts.

Useful? React with 👍 / 👎.

@vickenty

vickenty commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

PR to drain aggregator inflight samples: #51866

Also, since aggregator feature is already called dogstatsd_flush_incomplete_buckets, we can use the same setting to trigger flush in the dogstatsd server, I think the intent is the same for both.

@apiarian-datadog

Copy link
Copy Markdown
Contributor Author

closing in favor of #52267

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

internal Identify a non-fork PR long review PR is complex, plan time to review it qa/done QA done before merge and regressions are covered by tests team/agent-build team/agent-configuration team/agent-metric-pipelines team/serverless-azure-gcp

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants