High (~40%) CPU overhead?

I read the website's [page on performance overhead](https://grafana.com/docs/beyla/latest/performance/), which describes how much CPU beyla consumes at various configurations, but it does not talk about how CPU-usage of a process changes when it is instrumented with beyla. It also does not talk about what happens in a kubernetes cluster on modern public clouds.

To get a sense of those numbers, I load tested a simple Go HTTP server in an EKS cluster, and found that using the default configuration of beyla when deployed as a sidecar to that server, the server container was using ~40% more CPU (, went from ~134m to ~192m). `beyla` itself was only using 4 millicores, but the CPU-usage of my server process shot up this significantly.

Is this is the typically expected overhead on Go servers?

## Testing methodology
I wrote a couple of very simple services in Golang:
1. `links`: a simple URL shortening service, storing data in postgresql.
2. `todomvc`: a backend for a todo-item application, also storing data in postgresql. This calls `links` to shorten any links present in todo-item content, before storing it in postgres.

To remove any impact of postgresql latency, I then wrote mock drivers for both of the above services that implement functionality in-memory, and sleep for a millisecond or two, depending on the query.

I deployed all of these to an EKS cluster in an isolated node with nothing else running on it, and gave everything a generous CPU limit to prevent throttling.
I used [vegeta](https://github.com/tsenart/vegeta) to load-test the create API of the `todomvc` service at 500 rps in an open-loop fashion. `vegeta` itself was run in another node (to reduce interference), with prometheus scraping latency numbers from it. kube-state-metrics was the source for CPU-usage, also queried via prometheus.

Finally, I repeated this load-test with beyla (version `2.8.5`) attached with the default configuration, which is the following:
```yaml
shutdown_timeout: 2s
enforce_sys_caps: true
otel_metrics_export:
  endpoint: http://otel-collector:4318/v1/metrics
  protocol: http/protobuf
  insecure_skip_verify: true
otel_traces_export:
  endpoint: http://otel-collector:4318/v1/traces
  protocol: http/protobuf
  insecure_skip_verify: true
```
I used `BEYLA_OPEN_PORT` to make the beyla sidecar instrument only the pod in question. I locked both services to one pod, and initially turned on debug logs, to make sure that each process is only getting instrumented by one beyla process, then removed it from the config file again to prevent logging overhead.

I repeated the test with actual postgresql as well (without the mock drivers), and tried using various values for `ebpf.wakeup_len` as mentioned in the [perf-tuning docs](https://grafana.com/docs/beyla/latest/configure/tune-performance/). This barely changed the measured CPU-overhead on the instrumented-process.

Finally, this was on a `c7g.2xlarge`, which is an ARM-based instance offered by AWS. I repeated the test with a `c5.2xlarge`, which is x86_64, and this also resulted in a similarly high overhead.

Happy to share more information if that helps figure out what is going on.

Edit: The latency-increase in all configurations I tested, was not even noticeable at the histogram resolution I used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High (~40%) CPU overhead? #2425

Testing methodology

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

High (~40%) CPU overhead? #2425

Description

Testing methodology

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions