Skip to content

High (~40%) CPU overhead? #2425

@jaihindhreddy

Description

@jaihindhreddy

I read the website's page on performance overhead, which describes how much CPU beyla consumes at various configurations, but it does not talk about how CPU-usage of a process changes when it is instrumented with beyla. It also does not talk about what happens in a kubernetes cluster on modern public clouds.

To get a sense of those numbers, I load tested a simple Go HTTP server in an EKS cluster, and found that using the default configuration of beyla when deployed as a sidecar to that server, the server container was using ~40% more CPU (, went from ~134m to ~192m). beyla itself was only using 4 millicores, but the CPU-usage of my server process shot up this significantly.

Is this is the typically expected overhead on Go servers?

Testing methodology

I wrote a couple of very simple services in Golang:

  1. links: a simple URL shortening service, storing data in postgresql.
  2. todomvc: a backend for a todo-item application, also storing data in postgresql. This calls links to shorten any links present in todo-item content, before storing it in postgres.

To remove any impact of postgresql latency, I then wrote mock drivers for both of the above services that implement functionality in-memory, and sleep for a millisecond or two, depending on the query.

I deployed all of these to an EKS cluster in an isolated node with nothing else running on it, and gave everything a generous CPU limit to prevent throttling.
I used vegeta to load-test the create API of the todomvc service at 500 rps in an open-loop fashion. vegeta itself was run in another node (to reduce interference), with prometheus scraping latency numbers from it. kube-state-metrics was the source for CPU-usage, also queried via prometheus.

Finally, I repeated this load-test with beyla (version 2.8.5) attached with the default configuration, which is the following:

shutdown_timeout: 2s
enforce_sys_caps: true
otel_metrics_export:
  endpoint: http://otel-collector:4318/v1/metrics
  protocol: http/protobuf
  insecure_skip_verify: true
otel_traces_export:
  endpoint: http://otel-collector:4318/v1/traces
  protocol: http/protobuf
  insecure_skip_verify: true

I used BEYLA_OPEN_PORT to make the beyla sidecar instrument only the pod in question. I locked both services to one pod, and initially turned on debug logs, to make sure that each process is only getting instrumented by one beyla process, then removed it from the config file again to prevent logging overhead.

I repeated the test with actual postgresql as well (without the mock drivers), and tried using various values for ebpf.wakeup_len as mentioned in the perf-tuning docs. This barely changed the measured CPU-overhead on the instrumented-process.

Finally, this was on a c7g.2xlarge, which is an ARM-based instance offered by AWS. I repeated the test with a c5.2xlarge, which is x86_64, and this also resulted in a similarly high overhead.

Happy to share more information if that helps figure out what is going on.

Edit: The latency-increase in all configurations I tested, was not even noticeable at the histogram resolution I used.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions