Skip to content

TelemetryPolicy proposal#69

Open
gkhom wants to merge 21 commits intokubernetes-sigs:mainfrom
gkhom:main
Open

TelemetryPolicy proposal#69
gkhom wants to merge 21 commits intokubernetes-sigs:mainfrom
gkhom:main

Conversation

@gkhom
Copy link
Copy Markdown

@gkhom gkhom commented Feb 9, 2026

What type of PR is this?

/kind documentation

What this PR does / why we need it:

This PR contains a proposal for a new TelemetryPolicy API. This K8s API aims to standardize how users enable and configure telemetry (metrics, logs, traces) across different data plane implementations, replacing vendor-specific CRDs with a unified, portable spec.

Which issue(s) this PR fixes:

Fixes #

Does this PR introduce a user-facing change?:

* TelemetryPolicy specification

gkhom added 2 commits February 9, 2026 14:58
This change include context, problem description, and design objectives for a TelemetryPolicy proposal. If the community agrees on this context then I will follow up with the actual API specification.
@k8s-ci-robot k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Feb 9, 2026
@netlify
Copy link
Copy Markdown

netlify bot commented Feb 9, 2026

Deploy Preview for kube-agentic-networking ready!

Name Link
🔨 Latest commit 8600d0b
🔍 Latest deploy log https://app.netlify.com/projects/kube-agentic-networking/deploys/69cdd554ee5388000880923d
😎 Deploy Preview https://deploy-preview-69--kube-agentic-networking.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@linux-foundation-easycla
Copy link
Copy Markdown

linux-foundation-easycla bot commented Feb 9, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Welcome @gkhom!

It looks like this is your first PR to kubernetes-sigs/kube-agentic-networking 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kube-agentic-networking has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 9, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @gkhom. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 9, 2026
@haiyanmeng
Copy link
Copy Markdown
Contributor

CLA Not Signed

@gkhom , can you fix this?

@haiyanmeng
Copy link
Copy Markdown
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 10, 2026
@LiorLieberman
Copy link
Copy Markdown
Member

/easycla

@LiorLieberman
Copy link
Copy Markdown
Member

/check-cla

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 10, 2026
Copy link
Copy Markdown
Contributor

@rubambiza rubambiza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gkhom Thanks for kicking this off. I left some clarifying questions and proposing a minor change. The bigger part of the review is to surface work already being proposed/done in the llm-d community with regard to tracing and whether it is applicable to our objectives.

This proposal introduces the `TelemetryPolicy`, a direct policy attachment designed to configure observability signals (metrics, logs, traces)
for Gateway API resources (via `Gateway` attachment) and Service Mesh resources (via `namespace` attachment).

This K8s API standardizes how users enable and configure telemetry across different data plane implementations, replacing vendor-specific CRDs
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am (acting as) a naive reader, and I was immediately curious what some examples of these vendor-specific CRDs are. This also ties to and might clarify the below mention of "Observability lock-in".

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examples of such CRDs are:

  • Istio's Telemetry CRD
  • Envoy Gateway's EnvoyProxy and EnvoyGateway CRDs
  • Kong's MeshMetrics/MeshTrace/MeshAccessLog
  • Kuadrant's TelemetryPolicy

I intend to write a section that compares such existing APIs and the proposed TelemetryPolicy in the eventual proposal.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing as there's a mix of examples here, will the scope cover one resource for all of the signals (metrics, logs, traces) vs. separate ones? Are there tradeoffs to consider here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards one resource for all. The argument for splitting them might be that different personas are involved in configuring the different aspects of observability. In practice, I think that the persona that configures metrics, likely also configures tracing and access logs. So to avoid complicating the API with three additional resources, it seems worthwhile to put all of it in a single resource.

* **Cost is Volatile**: Usage is measured in tokens, not just requests. A single HTTP 200 OK could cost $0.01 or $10.00 depending on the prompt and model used.
* **Context is King**: Debugging requires knowing the semantic context: Which Model? Which Prompt? Which tool?

Existing telemetry policies are unaware of the Generative AI semantic conventions. They see an opaque TCP stream or HTTP POST. Without a standardized API to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In line with the header, may I suggest adding "unaware of the emerging Generative AI semantic conventions"?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will update.


1. **Standardization**: A single API for Gateway and Mesh to configure Access Logging, Metrics generation, and Tracing propagation.
2. **GEP-713 Compliance**: Support `targetRef` attachment to `Gateway` and `Namespace`. The latter covers Mesh use-cases.
3. **Agentic Support**: Enable the capture of OpenTelemetry GenAI Semantic Conventions and support the requirements of PR #33.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the spirit of standardization and not reinventing the wheel, I wanted to mention that the llm-d community is already moving on tracing + OTel + GenAI Semantic Conventions. In particular, Sally O'Malley from Red Hat proposed and did a POC for distributed tracing in llm-d. [aside: I learned about this work from Sally on another community call for our kagenti project]

This may be applicable here for a few reasons:

  • We are keen on integrating OTel and GenAI semantic conventions, too
  • One of our objectives is a single API for Gateways and Meshes, and Sally's POC has already landed some changes to support tracing to the Gateway API Inference Extension (GAIE) components like the endpoint pickers (proposal comment, GAIE PR).

While llm-d is focused on distributed LLM inferencing regardless of source (i.e., user chat -> LLM vs agent -> LLM), I think it's worth considering any lessons they may have already encountered and API definitions that could overlap with our case, at the very least at the Gateway level. I'd be willing to evangelize our thinking to Sally to get her thoughts, but more importantly curious on our interest level.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would certainly be valuable to get some of their insights and experiences. The proposal seems to cover configuration through environment variables, have they defined CRDs as well?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I did not see any CRD definitions. I'll keep this thread in mind as the definitions become more concrete.

for Gateway API resources (via `Gateway` attachment) and Service Mesh resources (via `namespace` attachment).

This K8s API standardizes how users enable and configure telemetry across different data plane implementations, replacing vendor-specific CRDs
with a unified, portable spec.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you see implementations reconciling the TelemetryPolicy and reading the bits that are relevant to their components? So multiple controllers read the CR and take actions to enable telemetry across the components they are controlling?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed possible to distribute the responsibility across multiple controllers, it's up to the implementation. In most cases that I'm familiar with a single controller/control plane programs all three observability features (metrics, traces, logs).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possible but a little challenging, what are the cases we see multiple impls reconcile the same thing?

@david-martin
Copy link
Copy Markdown
Contributor

I'm generally in favour of this proposal.
/approve

Perhaps more relevant when it comes to the specification, it would be good to know more about the current 'state of the art' in this space.

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: david-martin, gkhom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 17, 2026
@k8s-ci-robot k8s-ci-robot removed the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 17, 2026
@haiyanmeng
Copy link
Copy Markdown
Contributor

If the community agrees on this context then I will follow up with the actual API specification.

@gkhom , the PR description needs to be updated since the API specification is included in the PR now.

Copy link
Copy Markdown
Contributor

@evaline-ju evaline-ju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looking good to me - a couple lingering questions

namespace: prod-ns
spec:
# GEP-713 Attachment
targetRef:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the GEP-713 description, will we allow multiple attachments (targetRefs)? Perhaps with the "what namespace targets?" comment below it'd be good to understand the precedence for multiple policy (gateway vs. namespace policy) resolution more clearly. Are they to be non-overlapping?

similarly I assume policy status will be included?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent is indeed to allow multiple attachments. A single TelemetryPolicy can be used to configure for multiple namespaces and/or gateways. I will fix the mistake in the example to state targetRefs.

Regarding multiple policies, I was thinking that only a single TelemetryPolicy is allowed to target a specific resource. A TelemetryPolicy that targets a namespace or Gateway that is already targeted by another TelemetryPolicy should be rejected.

Regarding status, it will be included. Does it need to be mentioned explicitly in the spec?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally was not sure if status was going to be unique enough to call out here, but seems @guicassolato also called out the status stanza here: #69 (comment) so including it explicitly will likely avoid further confusion

Copy link
Copy Markdown
Contributor

@guicassolato guicassolato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think overall this is a strong proposal with a compelling use case and well-designed API.

My main point of concern is whether it has enough to make it specific to the agentic use case or could otherwise better server upstream Gateway API as a whole.

Following [GEP-713](https://gateway-api.sigs.k8s.io/geps/gep-713/), the `TelemetryPolicy` supports the following attachments:

1. **Gateway (Instance Scope)**: Configures the telemetry for a specific `Gateway`.
2. **Namespace (Mesh Scope)**: Configures the telemetry for all mesh proxies (sidecar proxy / node proxy / etc.) in that namespace.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defining what namespace targeting means and even if that definition points to mesh use case only is fine. But it has to be more than implied IMO. It has to be by design and well specified/documented, so all implementations of the API will commit to the same meaning and behaviour.

(I believe that's what @gkhom has in mind, but good to spell it out, I think.)

On a side note, if namespace targeting is for the mesh use case, have you considered the Mesh kind (and avoid any possible confusion altogether)?

provider:
type: Prometheus
overrides:
- name: "gateway.networking.k8s.io/http/request_count"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these metric names standardised somehow or the alignment between who owns the policy and who consumes the metrics is something we expect to happen but out of scope of the proposal?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specific metric name is just an arbitrary example. The closest thing to a standard for metric names would be OTel's semantic conventions. The alignment between policy owner and metric consumer is indeed out-of-scope (but I'm happy to try to include it if needed).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick but can you use an obviously dummy name to avoid making it seem like its a real proposal? like example.com/http/request_count would be fine


type TelemetryPolicySpec struct {
// Identifies the target resources (Gateway or Namespace) to which this policy attaches (GEP-713).
TargetRefs []TargetReference `json:"targetRefs"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could probably reuse here one of the Gateway API types from https://github.com/kubernetes-sigs/gateway-api/blob/main/apis/v1/policy_types.go.

I wonder, for example, if namespace targeting should be allowed only in the same namespace (LocalPolicyTargetReference) or "at distance" leveraging ReferenceGrants.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning towards NamespacedPolicyTargetReference to allow cross-namespace configuration using ReferenceGrants. This would allow central management of uniform telemetry configuration.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing cross namespace entirely violates namespace boundaries. namespace X should definitely not be able to modify namespace Y's configuration.

If we want uniform management then IMO the correct way to do this would be to modify a global object, probably GatewayClass / Mesh. Note this is still a bit awkward since you need a namespaced-resource to modify a cluster-scoped resource, so would need some additional layer of policy like only allowing it from some trusted admin namespace (Istio has this concept as "root namespace").

This hasn't been done in the Gateway API space so would be novel

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should definitely not start with with allowing cross-ns.

Regarding if and how to allow it, I also usually dont like this idea of controlling other telemetry endpoints from other namespace. We could either adopt
(a) "magic namespace" approach like many other implementations are doing - and policies in this namesspaces has global scope OR
(b) a ClusterTelemetryPolicy resource that has global scope

format: JSON
matches: # Conditional logging
- path: "/api/v1/sensitive"
- cel: "response.code >= 500" # CEL-based filtering for errors
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Author

@gkhom gkhom Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, those should be included but we will likely need more. For example, a response object. I'll add a section to make it explicit.

Copy link
Copy Markdown

@kyessenov kyessenov Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CEL is great for the long tail, but it's too slow and obscure as the primary method. It'll add 3-5x overhead to basic matching.


## The TelemetryPolicy Specification

We propose the `TelemetryPolicy` as a direct policy attachment in the `gateway.networking.k8s.io` API group. See [GEP-713](https://gateway-api.sigs.k8s.io/geps/gep-713/#classes-of-policies) for more information on Direct attachment.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it belong to the agentic API group or maybe start as such while it's proposed in the scope of the subproject?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, will update.

Comment on lines +9 to +10
This proposal introduces the `TelemetryPolicy`, a direct policy attachment designed to configure observability signals (metrics, logs, traces)
for Gateway API resources (via `Gateway` attachment) and Service Mesh resources (via `namespace` attachment).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to call it a "Direct Policy" while Gateway and Namespace as supported target kinds are for two completely disjoint use cases – ingress and mesh.

By definition, it's only direct when:

A single kind supported in spec.targetRefs.kind

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more accurate to call it "inherited policy"?

}

type TracingProvider struct {
Type string `json:"type"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be an enum or completely free for the implementations to define? Maybe some Core and Extended levels?

Following [GEP-713](https://gateway-api.sigs.k8s.io/geps/gep-713/), the `TelemetryPolicy` supports the following attachments:

1. **Gateway (Instance Scope)**: Configures the telemetry for a specific `Gateway`.
2. **Namespace (Mesh Scope)**: Configures the telemetry for all mesh proxies (sidecar proxy / node proxy / etc.) in that namespace.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although implied when calling it "direct policy", I think it may useful to add a note about the merge semantics, which I imagine will be the None one. I.e., describe the behaviour of what happens when 2 policy resources of this kind target the same object (same gateway or same namespace).

metav1.ObjectMeta `json:"metadata,omitempty"`

Spec TelemetryPolicySpec `json:"spec"`
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've added the status stanza.
Is the detailed explanation/comment in the *PolicyStatus Go struct required in every proposal?

type: Counter
dimensions: # Custom labels/dimensions
- key: "model_id"
fromHeader: "x-model-id" # Crucial for Agentic workloads
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any other possible sources in mind? E.g. fromMetadata?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, fromMetadata is an example. Potentially an advanced fromPayload might be worth considering in the future.

Copy link
Copy Markdown

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think we should not do this.

  1. I don't think "Kubernetes Agentic Networking" is an appropriate venue to define a general observability API. This has absolutely nothing to do with agentic networking specifically, and it feels like this is being used to "smuggle in" a standard through a low-barrier project so that it can gain momentum and face less scrutiny (just to be 100% clear: I am not accusing you of intentionally doing this, just that I think this is the likely result in practice).
  2. Otel has proven two things, IMO: (1) making telemetry standards is really hard. (2) if you don't do it right, you won't make a standard but instead yet-another-non-standard (see all the pre-otel 'standards' that are now obsolete). (3) defining an API like this is hard/not something that can/should be done (hence why there is not only one).
  3. I don't think this is really as portable as it seems on paper.

I'd like to explore a simple alternative: "implementations of agentic networking should implement the OTEL specification around traces, metrics, and logs. Implementations may provide ways to enable or disable these as needed".

I believe this will meet all of the use cases, without the daunting task of defining a new telemetry standard. From a user POV they just want these emitted from their agentic network; we do not need a new CRD to achieve that

- attributeName: "env"
literalValue: "production"

# 2. Metrics Configuration
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its questionable how portable dynamic configuration of metrics is. I've only seen Envoy do this, and even then its extremely bug ridden historically. Its very confusing for users what the semantics are, or should be, to add or remove labels from metrics. Or even adding a metric -- imagine I have a dashboard like request_count/error_count but I added error_count after request_count so its all out of whack...


type MatchCondition struct {
// Path allows filtering to specific paths.
Path string `json:"path,omitempty"`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why bother having this if we already have CEL and its trivial to express in CEL?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I understand this is that "Path" is a (convenient?) simpler shortcut for just logging vs expressing this in CEL. Probably if we were to leave it we should do it as a oneof field.

But it makes sense to remove and start with an API without it.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CEL is an order of magnitude slower than native code... Even if you use CEL you'd have to add a bunch of variants (query stripped, normalized, escaped, raw path) which will make it more confusing than having a structured condition.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in some implementations. Should a general purpose vendor-agnostic API switch to a suboptimal user experience (lets assume here that CEL is a preferred UX, since if its a bad UX and bad performance it would obviously be a bad choice) because some implementations do not implement it optimally?

We are handling these at ~native speeds in our implementation.

Even if you use CEL you'd have to add a bunch of variants (query stripped, normalized, escaped, raw path)

This is exactly why I would prefer CEL personally, so we don't have to make our YAML api have 5 variations just for path -- not to mention headers, query params, cookies we will have like 25 fields just to match headers

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CEL is an interpreted language by design, so unless you're carving out a specific subset (or breaking some semantics) - it cannot match native speeds. In either of those cases, that is not vendor neutral, there's no guarantee that the "fast" subset of CEL is the same across implementations.

YAML gives that for free. Every implementation will be similarly efficient, and all the "meta" tooling will support it without having to compile/run CEL.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW CEL is used throughout the other APIs in this repo already. IMO this project needs to decide on CEL or not, so we don't have to have this debate on each field. Its not good for users to have 50% of fields (that would make sense to use CEL) use CEL and 50% don't just because of who made the API, and its not great for us to have to debate it on each field usage.

@LiorLieberman @robscott

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, as someone who worked on CEL C++ runtime and made it open source - I'm telling you CEL is not the right paradigm for the "95%" of data path cases. It's too slow, and it prevents meta-tooling (management and control planes) from analyzing the semantics. It makes perfect sense as an "extensible context-aware condition" (e.g. like Wasm) for the rest 5% of tail-end bespoke cases, but you wouldn't use CEL for label selectors, would you?

* **Custom Attributes**: Enables the extraction of specific headers and proxy metadata into log entries.
* **Sinks**: Defaults to standard container logging (stdout) with extensibility for OTLP or external ports.

## Comparison with Prior Art
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N.B that all of these are wrappers around Envoy

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.
Though the Kuadrant TelemetryPolicy referenced is an API to supplement metrics coming from an internal component (limitador) that envoy filters to, rather than envoy itself.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@howardjohn - can you offer other arts that are not relying on Envoy? I recall seeing a comment here or in slack welcoming help/contributions to the prior art section more

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@david-martin
Copy link
Copy Markdown
Contributor

I really think we should not do this.

  1. I don't think "Kubernetes Agentic Networking" is an appropriate venue to define a general observability API. This has absolutely nothing to do with agentic networking specifically, and it feels like this is being used to "smuggle in" a standard through a low-barrier project so that it can gain momentum and face less scrutiny (just to be 100% clear: I am not accusing you of intentionally doing this, just that I think this is the likely result in practice).

Yeah, I can see where you are coming from. Others have raised similar concerns about there not being sufficient AI focused content (e.g. #69 (review)).
I really don't think there's any ill intent or attempt to bypass/smuggle.

  1. Otel has proven two things, IMO: (1) making telemetry standards is really hard. (2) if you don't do it right, you won't make a standard but instead yet-another-non-standard (see all the pre-otel 'standards' that are now obsolete). (3) defining an API like this is hard/not something that can/should be done (hence why there is not only one).

Agreed on 1.
Are there examples of 2 you could share so we can better understand why these have failed or are obsolute?
On 3, I'm still on the optimistic side that something could be done, but that being said, I'm not an implementer of a full gateway solution that would have to agree to and implement a standard API.
Is the proposal too ambitious perhaps in its current form? Could it be slimmed down, but made extensible?
What if we started with just standardising 1 thing, like a spec for enabling/disabling the different signals and where to send them to?

  1. I don't think this is really as portable as it seems on paper.

I'd like to explore a simple alternative: "implementations of agentic networking should implement the OTEL specification around traces, metrics, and logs. Implementations may provide ways to enable or disable these as needed".

Do you see this being a central place to enable/disable or distributed?

@howardjohn
Copy link
Copy Markdown

I really don't think there's any ill intent or attempt to bypass/smuggle.

I 100% agree as I tried to convey, I don't think it is intended but I think its the accidental result. Which is a common pattern to accidentally happen (I've seen a lot of bad APIs unintentionally brought in because of last-minute-release rush, for instance!).

Do you see this being a central place to enable/disable or distributed?

I don't really see the need for a CRD to do this. I struggle to believe that users are going to get benefits from a CRD controlling a knob than just --set telemetry.enabled=<true|false> or the equivalent in their implementation install. Vendor neutral APIs are valuable when they are building blocks for extensibility IMO. That is why HTTPRoute and otel are so powerful -- HTTPRoute is a common foundation for a huge ecosystem to build around, as otel is for telemetry. But just providing a common way to flip a knob is not value for anyone, nor a meaningful extension point.

What if we started with just standardising 1 thing, like a spec for enabling/disabling the different signals and where to send them to?

FWIW otel already has standards for this: https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/#general-sdk-configuration. I 100% agree env vars are an awkward standard in this context though, just something to take into account.


FWIW, I hate to be in the position where I say 'we should not do this'. However, I think few people are willing to do it, since it feels bad and un-collaborative, and its easier to just debate whether specific aspects of an API are good or bad or should change than to step back, so wanted to make sure I at least expressed this

@gkhom
Copy link
Copy Markdown
Author

gkhom commented Mar 6, 2026

I don't think "Kubernetes Agentic Networking" is an appropriate venue to define a general observability API. This has absolutely nothing to do with agentic networking specifically, and it feels like this is being used to "smuggle in" a standard through a low-barrier project so that it can gain momentum and face less scrutiny (just to be 100% clear: I am not accusing you of intentionally doing this, just that I think this is the likely result in practice).

As this is my first ever contribution to the CNCF/K8s ecosystem, my intention is definitely not to bypass scrutiny or "smuggle" anything in. I opened this proposal here because I wanted to solve a real problem I've encountered (on several occasions) and this seemed like a reasonable starting point.

if you don't do it right, you won't make a standard but instead yet-another-non-standard (see all the pre-otel 'standards' that are now obsolete).

I definitely want to avoid creating a fragmented or obsolete standard. Could you point me to some of the pre-OTel standardization attempts you mentioned? I'd love to understand where they fell short in defining a configuration API.

defining an API like this is hard/not something that can/should be done

I recognize that standardizing this API carries significant complexity. Could you elaborate on the specific technical hurdles or roadblocks you envision? I'd really appreciate your perspective on what makes this intractable.

I don't think this is really as portable as it seems on paper.

Could you provide some examples or edge cases where you see portability breaking down in this design? I'd like to see if we can address those gaps.

I'd like to explore a simple alternative: "implementations of agentic networking should implement the OTEL specification around traces, metrics, and logs. Implementations may provide ways to enable or disable these as needed".

This would work as a baseline, but my concern is that it will fall short when users need to customize their telemetry (which is fairly common in practice). In absence of a standard, users will inevitably have to resort to vendor-specific APIs for any customization. Because a lot of operational machinery gets built around telemetry pipelines, relying on vendor-specific APIs for customization (re-)introduces the risk of vendor lock-in and becomes a blocker to portability. I think this is especially critical for agentic networking because it deals with autonomous tool execution. Telemetry here isn't just about troubleshooting and performance. It's a core requirement for operations, security, and auditability. If telemetry configuration is left entirely to vendor-specific implementations, it becomes nearly impossible for platform teams to guarantee a consistent audit trail across heterogeneous environments.

@LiorLieberman
Copy link
Copy Markdown
Member

I really think we should not do this.

  1. I don't think "Kubernetes Agentic Networking" is an appropriate venue to define a general observability API. This has absolutely nothing to do with agentic networking specifically, and it feels like this is being used to "smuggle in" a standard through a low-barrier project so that it can gain momentum and face less scrutiny (just to be 100% clear: I am not accusing you of intentionally doing this, just that I think this is the likely result in practice).
  2. Otel has proven two things, IMO: (1) making telemetry standards is really hard. (2) if you don't do it right, you won't make a standard but instead yet-another-non-standard (see all the pre-otel 'standards' that are now obsolete). (3) defining an API like this is hard/not something that can/should be done (hence why there is not only one).
  3. I don't think this is really as portable as it seems on paper.

I'd like to explore a simple alternative: "implementations of agentic networking should implement the OTEL specification around traces, metrics, and logs. Implementations may provide ways to enable or disable these as needed".

I believe this will meet all of the use cases, without the daunting task of defining a new telemetry standard. From a user POV they just want these emitted from their agentic network; we do not need a new CRD to achieve that

Without picking on any specific comment in the thread, the overarching goal here, and all/most our standardization attempts (within Gateway ecosystem at least) in general is to enable users to use portable APIs to define their networking functionalities. Of course we should not try to cover all cases, and some cases, indeed are not worth standardization (often because they are very implementation-specific or not portable). However here, and in many prior discussions, we've tried to take the general rule of trying to work towards enabling at least 80% of the users using portable APIs completely. (~ 20% will have their own, perhaps, not portable unique cases).

Thats true that we dont have any Telemetry/Observability API within Gateway, but I dont recall seeing any attempt to do. I believe a lot of it is due to the fact that they were (and perhaps still) higher priorities to standardize first. (Its important to mention that despite this, lots of vendors did put effort to have their own APIs for this, meaning the demand was there)

What's more important though is that in case of Agentic Networking, telemetry becomes one of the higher priorities. In fact, some users would only want and Agent Gateway / some form of PEPs just for telemetry/audit. Therefore I see a ton of value to standardize this, and have additional knobs beyond enable/disable.

I also believe there is a ton of value that such API exists in Gateway directly, I am very supportive of bringing this back to Gateway when we all feel its in a good shape for doing so.

@howardjohn It would be good if we can get some more concrete feedback on what are the specific challenges and try to think how to overcome them.

@howardjohn
Copy link
Copy Markdown

Not my full answer, will think more and respond but wanted to point to kubernetes-sigs/gateway-api#554

Copy link
Copy Markdown

@howardjohn howardjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at a lense of "what would a smaller but still powerful API look like":

targetRefs: duh, we need this. But do we need Namespace? Or just Mesh for the Mesh case? Or not worry about Mesh for now.

tracing:

  • Do we need to support anything other than OTLP? OTLP won, its the de-facto and de-jure standard. Anything else is probably a waste of time, at least in the core API.
    • TracingProvider.endpoint should be a backendRef though
  • Sampling is perhaps too simplistic -- will discuss more in a later comment as I need to run.
  • context - do we need anything other than w3c? Its 2026.
  • CustomAttribute: no doubt these are valuable but if we can only have literalValue its not. These only really provide value if we allow an ~expression language. Or a few wellknown attributes; I would say "just always include these" but for agentic use cases specifically this could include the entire prompt/completion which could warrant first class 'includeXYZ' style API.

metrics: I think we just drop it entirely. TBH. Its pretty problematic to dynamically enable and disable metrics, especially on a per-metric or per-label-value basis. If we are talking prometheus there is also no config. If we need to support OTLP then just have backendRef.

  • Definitely recommend dropping metrics override which is non-standard, hard to use, and hard to implement.

Logs:

  • Notably missing otlp. Should it be? I know I said I was trying to reduce the scope :-) but it matters and influences other fields.
  • Format -- not clear we need this. If its otlp, there is no format. If not, the proxy probably either uses structured logging and or does not, should a user really define that?
  • match: nice to have but not strictly required. If we are going to do it, and are fine with CEL, just fully embrace CEL and don't have any other match types. If not, read https://blog.howardjohn.info/posts/cel-is-good/#cel-alternatives and think through whether a non-CEL based approach can make an acceptable user experience.
  • fields: I don't think this can possibly just be a simple list of strings, and not a k/v pair, so hard to concretely discuss it but probably should be the same as CustomAttribute in tracing and the same concerns there apply.

provider:
type: OTLP # or implementation-specific
endpoint: "otel-collector.monitoring.svc:4317"
samplingRate:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to explain how sampling of traces work. Is this respecting the existing "sampling" decisions? Is this for requests without an existing context?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a brief explanation. This is the base sampling rate across all requests. The optional parentBasedSampling config allows for a distinct sampling rate specifically for requests that are already part of a trace.

Comment on lines +99 to +100
matches: # Conditional logging
- cel: "response.code >= 500" # CEL-based filtering for errors
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we only want CEL its a bit awkward to have a list. But may make sense if we have non-CEL

Comment on lines +101 to +104
fields: # Configure specific fields to include
- "start_time"
- "response_code"
- "x-token-usage"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any definition of what these fields mean?

For a concrete example, say I want to log the MCP task name (I chose this since its not in https://opentelemetry.io/docs/specs/semconv/gen-ai/mcp/). Can I do it? what do I put as the field if I want to?

Comment on lines +220 to +223
type Dimension struct {
Key string `json:"key"`
FromHeader string `json:"fromHeader,omitempty"`
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 3 APIs all have the same property of "add a K/V" pair but do so in 3 different ways. Does it make sense? Should we be more consistent in them?

It seems odd that:

  • tracing: literal only
  • metrics: header only
  • log: a name only without a value

@rikatz
Copy link
Copy Markdown
Member

rikatz commented Apr 8, 2026

Can we move this discussion to Gateway API as a provisional GEP first, and then experimental? I think a lot of the discussion here is happening on a context that is important not only for agentic, but for the whole Gateway API ecosystem and I wouldn't like to receive this proposal as "we discussed and approved on agentic and now this needs to be implemented this way on Gateway API".

Thanks!

@kflynn
Copy link
Copy Markdown

kflynn commented Apr 9, 2026

Seconding @rikatz's comment – this seems applicable to much more than just the agentic world, and I'd love to get eyes on it from Gateway API. Thanks!! 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/documentation Categorizes issue or PR as related to documentation. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.