TelemetryPolicy proposal by gkhom · Pull Request #69 · kubernetes-sigs/kube-agentic-networking

gkhom · 2026-02-09T23:06:22Z

What type of PR is this?

/kind documentation

What this PR does / why we need it:

This PR contains a proposal for a new TelemetryPolicy API. This K8s API aims to standardize how users enable and configure telemetry (metrics, logs, traces) across different data plane implementations, replacing vendor-specific CRDs with a unified, portable spec.

Which issue(s) this PR fixes:

Fixes #

Does this PR introduce a user-facing change?:

* TelemetryPolicy specification

This change include context, problem description, and design objectives for a TelemetryPolicy proposal. If the community agrees on this context then I will follow up with the actual API specification.

[WIP] TelemetryPolicy proposal

netlify · 2026-02-09T23:06:28Z

✅ Deploy Preview for kube-agentic-networking ready!

Name	Link
🔨 Latest commit	`8600d0b`
🔍 Latest deploy log	https://app.netlify.com/projects/kube-agentic-networking/deploys/69cdd554ee5388000880923d
😎 Deploy Preview	https://deploy-preview-69--kube-agentic-networking.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

linux-foundation-easycla · 2026-02-09T23:06:29Z

The committers listed above are authorized under a signed CLA.

✅ login: evaline-ju / name: Evaline Ju (38d78dc)
✅ login: gkhom / name: Georgi Khomeriki (04667c7, 167372c, 1a39810, 26b328a, 297b6b3, 35eb945, 38d78dc, 40e812f, 4943c76, 644b87f, 6bf817c, 71f4338, 7589829, 8600d0b, 972f464, 9fa2624, af83a0d, c5aed7d, d1f0564, d402594, eb826bc)
✅ login: LiorLieberman / name: Lior Lieberman (af83a0d, c5aed7d)
✅ login: rubambiza / name: Gloire Rubambiza (167372c)

k8s-ci-robot · 2026-02-09T23:06:31Z

Welcome @gkhom!

It looks like this is your first PR to kubernetes-sigs/kube-agentic-networking 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kube-agentic-networking has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2026-02-09T23:06:32Z

Hi @gkhom. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

haiyanmeng · 2026-02-10T14:02:17Z

❌ - login: @gkhom / name: Georgi Khomeriki . The commit (6bf817c, 972f464) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please submit a support request ticket.

@gkhom , can you fix this?

haiyanmeng · 2026-02-10T14:02:29Z

/ok-to-test

LiorLieberman · 2026-02-10T17:27:50Z

/easycla

LiorLieberman · 2026-02-10T17:29:20Z

/check-cla

rubambiza

@gkhom Thanks for kicking this off. I left some clarifying questions and proposing a minor change. The bigger part of the review is to surface work already being proposed/done in the llm-d community with regard to tracing and whether it is applicable to our objectives.

rubambiza · 2026-02-10T20:46:00Z

docs/proposals/0069-TelemetryPolicy.md

+This proposal introduces the `TelemetryPolicy`, a direct policy attachment designed to configure observability signals (metrics, logs, traces) 
+for Gateway API resources (via `Gateway` attachment) and Service Mesh resources (via `namespace` attachment).
+
+This K8s API standardizes how users enable and configure telemetry across different data plane implementations, replacing vendor-specific CRDs 


I am (acting as) a naive reader, and I was immediately curious what some examples of these vendor-specific CRDs are. This also ties to and might clarify the below mention of "Observability lock-in".

Examples of such CRDs are:

Istio's Telemetry CRD

Envoy Gateway's EnvoyProxy and EnvoyGateway CRDs

Kong's MeshMetrics/MeshTrace/MeshAccessLog

Kuadrant's TelemetryPolicy

I intend to write a section that compares such existing APIs and the proposed TelemetryPolicy in the eventual proposal.

Seeing as there's a mix of examples here, will the scope cover one resource for all of the signals (metrics, logs, traces) vs. separate ones? Are there tradeoffs to consider here?

I'm leaning towards one resource for all. The argument for splitting them might be that different personas are involved in configuring the different aspects of observability. In practice, I think that the persona that configures metrics, likely also configures tracing and access logs. So to avoid complicating the API with three additional resources, it seems worthwhile to put all of it in a single resource.

rubambiza · 2026-02-10T20:50:26Z

docs/proposals/0069-TelemetryPolicy.md

+* **Cost is Volatile**: Usage is measured in tokens, not just requests. A single HTTP 200 OK could cost $0.01 or $10.00 depending on the prompt and model used.
+* **Context is King**: Debugging requires knowing the semantic context: Which Model? Which Prompt? Which tool?
+
+Existing telemetry policies are unaware of the Generative AI semantic conventions. They see an opaque TCP stream or HTTP POST. Without a standardized API to 


In line with the header, may I suggest adding "unaware of the emerging Generative AI semantic conventions"?

Good point, will update.

rubambiza · 2026-02-10T21:47:50Z

docs/proposals/0069-TelemetryPolicy.md

+
+1. **Standardization**: A single API for Gateway and Mesh to configure Access Logging, Metrics generation, and Tracing propagation.
+2. **GEP-713 Compliance**: Support `targetRef` attachment to `Gateway` and `Namespace`. The latter covers Mesh use-cases.
+3. **Agentic Support**: Enable the capture of OpenTelemetry GenAI Semantic Conventions and support the requirements of PR #33.


In the spirit of standardization and not reinventing the wheel, I wanted to mention that the llm-d community is already moving on tracing + OTel + GenAI Semantic Conventions. In particular, Sally O'Malley from Red Hat proposed and did a POC for distributed tracing in llm-d. [aside: I learned about this work from Sally on another community call for our kagenti project]

This may be applicable here for a few reasons:

We are keen on integrating OTel and GenAI semantic conventions, too

One of our objectives is a single API for Gateways and Meshes, and Sally's POC has already landed some changes to support tracing to the Gateway API Inference Extension (GAIE) components like the endpoint pickers (proposal comment, GAIE PR).

While llm-d is focused on distributed LLM inferencing regardless of source (i.e., user chat -> LLM vs agent -> LLM), I think it's worth considering any lessons they may have already encountered and API definitions that could overlap with our case, at the very least at the Gateway level. I'd be willing to evangelize our thinking to Sally to get her thoughts, but more importantly curious on our interest level.

It would certainly be valuable to get some of their insights and experiences. The proposal seems to cover configuration through environment variables, have they defined CRDs as well?

No, I did not see any CRD definitions. I'll keep this thread in mind as the definitions become more concrete.

david-martin · 2026-02-11T09:39:16Z

docs/proposals/0069-TelemetryPolicy.md

+for Gateway API resources (via `Gateway` attachment) and Service Mesh resources (via `namespace` attachment).
+
+This K8s API standardizes how users enable and configure telemetry across different data plane implementations, replacing vendor-specific CRDs 
+with a unified, portable spec.


Would you see implementations reconciling the TelemetryPolicy and reading the bits that are relevant to their components? So multiple controllers read the CR and take actions to enable telemetry across the components they are controlling?

It is indeed possible to distribute the responsibility across multiple controllers, it's up to the implementation. In most cases that I'm familiar with a single controller/control plane programs all three observability features (metrics, traces, logs).

possible but a little challenging, what are the cases we see multiple impls reconcile the same thing?

david-martin · 2026-02-17T10:00:19Z

I'm generally in favour of this proposal.
/approve

Perhaps more relevant when it comes to the specification, it would be good to know more about the current 'state of the art' in this space.

k8s-ci-robot · 2026-02-17T10:00:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: david-martin, gkhom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [david-martin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

haiyanmeng · 2026-02-25T22:56:04Z

If the community agrees on this context then I will follow up with the actual API specification.

@gkhom , the PR description needs to be updated since the API specification is included in the PR now.

evaline-ju

Generally looking good to me - a couple lingering questions

evaline-ju · 2026-03-04T20:55:08Z

docs/proposals/0069-TelemetryPolicy.md

+  namespace: prod-ns
+spec:
+  # GEP-713 Attachment
+  targetRef:


Looking at the GEP-713 description, will we allow multiple attachments (targetRefs)? Perhaps with the "what namespace targets?" comment below it'd be good to understand the precedence for multiple policy (gateway vs. namespace policy) resolution more clearly. Are they to be non-overlapping?

similarly I assume policy status will be included?

The intent is indeed to allow multiple attachments. A single TelemetryPolicy can be used to configure for multiple namespaces and/or gateways. I will fix the mistake in the example to state targetRefs.

Regarding multiple policies, I was thinking that only a single TelemetryPolicy is allowed to target a specific resource. A TelemetryPolicy that targets a namespace or Gateway that is already targeted by another TelemetryPolicy should be rejected.

Regarding status, it will be included. Does it need to be mentioned explicitly in the spec?

I personally was not sure if status was going to be unique enough to call out here, but seems @guicassolato also called out the status stanza here: #69 (comment) so including it explicitly will likely avoid further confusion

docs/proposals/0069-TelemetryPolicy.md

Co-authored-by: Evaline Ju <69598118+evaline-ju@users.noreply.github.com>

guicassolato

I think overall this is a strong proposal with a compelling use case and well-designed API.

My main point of concern is whether it has enough to make it specific to the agentic use case or could otherwise better server upstream Gateway API as a whole.

guicassolato · 2026-03-05T11:30:58Z

docs/proposals/0069-TelemetryPolicy.md

+Following [GEP-713](https://gateway-api.sigs.k8s.io/geps/gep-713/), the `TelemetryPolicy` supports the following attachments:
+
+1. **Gateway (Instance Scope)**: Configures the telemetry for a specific `Gateway`.
+2. **Namespace (Mesh Scope)**: Configures the telemetry for all mesh proxies (sidecar proxy / node proxy / etc.) in that namespace.


Defining what namespace targeting means and even if that definition points to mesh use case only is fine. But it has to be more than implied IMO. It has to be by design and well specified/documented, so all implementations of the API will commit to the same meaning and behaviour.

(I believe that's what @gkhom has in mind, but good to spell it out, I think.)

On a side note, if namespace targeting is for the mesh use case, have you considered the Mesh kind (and avoid any possible confusion altogether)?

guicassolato · 2026-03-05T11:31:32Z

docs/proposals/0069-TelemetryPolicy.md

+    provider:
+      type: Prometheus
+    overrides:
+      - name: "gateway.networking.k8s.io/http/request_count"


Are these metric names standardised somehow or the alignment between who owns the policy and who consumes the metrics is something we expect to happen but out of scope of the proposal?

This specific metric name is just an arbitrary example. The closest thing to a standard for metric names would be OTel's semantic conventions. The alignment between policy owner and metric consumer is indeed out-of-scope (but I'm happy to try to include it if needed).

nitpick but can you use an obviously dummy name to avoid making it seem like its a real proposal? like example.com/http/request_count would be fine

guicassolato · 2026-03-05T11:31:39Z

docs/proposals/0069-TelemetryPolicy.md

+
+type TelemetryPolicySpec struct {
+    // Identifies the target resources (Gateway or Namespace) to which this policy attaches (GEP-713).
+    TargetRefs []TargetReference `json:"targetRefs"`


We could probably reuse here one of the Gateway API types from https://github.com/kubernetes-sigs/gateway-api/blob/main/apis/v1/policy_types.go.

I wonder, for example, if namespace targeting should be allowed only in the same namespace (LocalPolicyTargetReference) or "at distance" leveraging ReferenceGrants.

I'm leaning towards NamespacedPolicyTargetReference to allow cross-namespace configuration using ReferenceGrants. This would allow central management of uniform telemetry configuration.

Allowing cross namespace entirely violates namespace boundaries. namespace X should definitely not be able to modify namespace Y's configuration.

If we want uniform management then IMO the correct way to do this would be to modify a global object, probably GatewayClass / Mesh. Note this is still a bit awkward since you need a namespaced-resource to modify a cluster-scoped resource, so would need some additional layer of policy like only allowing it from some trusted admin namespace (Istio has this concept as "root namespace").

This hasn't been done in the Gateway API space so would be novel

I think we should definitely not start with with allowing cross-ns.

Regarding if and how to allow it, I also usually dont like this idea of controlling other telemetry endpoints from other namespace. We could either adopt
(a) "magic namespace" approach like many other implementations are doing - and policies in this namesspaces has global scope OR
(b) a ClusterTelemetryPolicy resource that has global scope

guicassolato · 2026-03-05T11:33:02Z

docs/proposals/0069-TelemetryPolicy.md

+    format: JSON
+    matches: # Conditional logging
+      - path: "/api/v1/sensitive"
+      - cel: "response.code >= 500" # CEL-based filtering for errors


Are we thinking the same variables described at https://github.com/kubernetes-sigs/kube-agentic-networking/blob/main/docs/proposals/0017-DynamicAuth.md#available-context-variables?

Good question, those should be included but we will likely need more. For example, a response object. I'll add a section to make it explicit.

CEL is great for the long tail, but it's too slow and obscure as the primary method. It'll add 3-5x overhead to basic matching.

guicassolato · 2026-03-05T11:38:20Z

docs/proposals/0069-TelemetryPolicy.md

+
+## The TelemetryPolicy Specification
+
+We propose the `TelemetryPolicy` as a direct policy attachment in the `gateway.networking.k8s.io` API group. See [GEP-713](https://gateway-api.sigs.k8s.io/geps/gep-713/#classes-of-policies) for more information on Direct attachment.


Should it belong to the agentic API group or maybe start as such while it's proposed in the scope of the subproject?

Makes sense, will update.

guicassolato · 2026-03-05T11:41:19Z

docs/proposals/0069-TelemetryPolicy.md

+This proposal introduces the `TelemetryPolicy`, a direct policy attachment designed to configure observability signals (metrics, logs, traces) 
+for Gateway API resources (via `Gateway` attachment) and Service Mesh resources (via `namespace` attachment).


I think it's OK to call it a "Direct Policy" while Gateway and Namespace as supported target kinds are for two completely disjoint use cases – ingress and mesh.

By definition, it's only direct when:

A single kind supported in spec.targetRefs.kind

Would it be more accurate to call it "inherited policy"?

guicassolato · 2026-03-05T11:50:43Z

docs/proposals/0069-TelemetryPolicy.md

+}
+
+type TracingProvider struct {
+    Type     string `json:"type"`


Would this be an enum or completely free for the implementations to define? Maybe some Core and Extended levels?

guicassolato · 2026-03-05T11:55:02Z

docs/proposals/0069-TelemetryPolicy.md

+Following [GEP-713](https://gateway-api.sigs.k8s.io/geps/gep-713/), the `TelemetryPolicy` supports the following attachments:
+
+1. **Gateway (Instance Scope)**: Configures the telemetry for a specific `Gateway`.
+2. **Namespace (Mesh Scope)**: Configures the telemetry for all mesh proxies (sidecar proxy / node proxy / etc.) in that namespace.


Although implied when calling it "direct policy", I think it may useful to add a note about the merge semantics, which I imagine will be the None one. I.e., describe the behaviour of what happens when 2 policy resources of this kind target the same object (same gateway or same namespace).

guicassolato · 2026-03-05T11:56:55Z

docs/proposals/0069-TelemetryPolicy.md

+    metav1.ObjectMeta `json:"metadata,omitempty"`
+
+    Spec TelemetryPolicySpec `json:"spec"`
+}


Should probably define a status stanza too. E.g.: https://github.com/kubernetes-sigs/kube-agentic-networking/blob/main/docs/proposals/0017-DynamicAuth.md#extended-accesspolicy-crd

Thanks, I've added the status stanza.
Is the detailed explanation/comment in the *PolicyStatus Go struct required in every proposal?

guicassolato · 2026-03-05T11:58:50Z

docs/proposals/0069-TelemetryPolicy.md

+        type: Counter
+        dimensions: # Custom labels/dimensions
+          - key: "model_id"
+            fromHeader: "x-model-id" # Crucial for Agentic workloads


Any other possible sources in mind? E.g. fromMetadata?

Indeed, fromMetadata is an example. Potentially an advanced fromPayload might be worth considering in the future.

howardjohn

I really think we should not do this.

I don't think "Kubernetes Agentic Networking" is an appropriate venue to define a general observability API. This has absolutely nothing to do with agentic networking specifically, and it feels like this is being used to "smuggle in" a standard through a low-barrier project so that it can gain momentum and face less scrutiny (just to be 100% clear: I am not accusing you of intentionally doing this, just that I think this is the likely result in practice).
Otel has proven two things, IMO: (1) making telemetry standards is really hard. (2) if you don't do it right, you won't make a standard but instead yet-another-non-standard (see all the pre-otel 'standards' that are now obsolete). (3) defining an API like this is hard/not something that can/should be done (hence why there is not only one).
I don't think this is really as portable as it seems on paper.

I'd like to explore a simple alternative: "implementations of agentic networking should implement the OTEL specification around traces, metrics, and logs. Implementations may provide ways to enable or disable these as needed".

I believe this will meet all of the use cases, without the daunting task of defining a new telemetry standard. From a user POV they just want these emitted from their agentic network; we do not need a new CRD to achieve that

howardjohn · 2026-03-05T16:45:18Z

docs/proposals/0069-TelemetryPolicy.md

+      - attributeName: "env"
+        literalValue: "production"
+
+  # 2. Metrics Configuration


I think its questionable how portable dynamic configuration of metrics is. I've only seen Envoy do this, and even then its extremely bug ridden historically. Its very confusing for users what the semantics are, or should be, to add or remove labels from metrics. Or even adding a metric -- imagine I have a dashboard like request_count/error_count but I added error_count after request_count so its all out of whack...

howardjohn · 2026-03-05T16:46:55Z

docs/proposals/0069-TelemetryPolicy.md

+
+type MatchCondition struct {
+    // Path allows filtering to specific paths.
+    Path string `json:"path,omitempty"`


why bother having this if we already have CEL and its trivial to express in CEL?

The way I understand this is that "Path" is a (convenient?) simpler shortcut for just logging vs expressing this in CEL. Probably if we were to leave it we should do it as a oneof field.

But it makes sense to remove and start with an API without it.

CEL is an order of magnitude slower than native code... Even if you use CEL you'd have to add a bunch of variants (query stripped, normalized, escaped, raw path) which will make it more confusing than having a structured condition.

Maybe in some implementations. Should a general purpose vendor-agnostic API switch to a suboptimal user experience (lets assume here that CEL is a preferred UX, since if its a bad UX and bad performance it would obviously be a bad choice) because some implementations do not implement it optimally?

We are handling these at ~native speeds in our implementation.

Even if you use CEL you'd have to add a bunch of variants (query stripped, normalized, escaped, raw path)

This is exactly why I would prefer CEL personally, so we don't have to make our YAML api have 5 variations just for path -- not to mention headers, query params, cookies we will have like 25 fields just to match headers

CEL is an interpreted language by design, so unless you're carving out a specific subset (or breaking some semantics) - it cannot match native speeds. In either of those cases, that is not vendor neutral, there's no guarantee that the "fast" subset of CEL is the same across implementations.

YAML gives that for free. Every implementation will be similarly efficient, and all the "meta" tooling will support it without having to compile/run CEL.

FWIW CEL is used throughout the other APIs in this repo already. IMO this project needs to decide on CEL or not, so we don't have to have this debate on each field. Its not good for users to have 50% of fields (that would make sense to use CEL) use CEL and 50% don't just because of who made the API, and its not great for us to have to debate it on each field usage.

@LiorLieberman @robscott

Sure, as someone who worked on CEL C++ runtime and made it open source - I'm telling you CEL is not the right paradigm for the "95%" of data path cases. It's too slow, and it prevents meta-tooling (management and control planes) from analyzing the semantics. It makes perfect sense as an "extensible context-aware condition" (e.g. like Wasm) for the rest 5% of tail-end bespoke cases, but you wouldn't use CEL for label selectors, would you?

howardjohn · 2026-03-05T16:54:50Z

docs/proposals/0069-TelemetryPolicy.md

+* **Custom Attributes**: Enables the extraction of specific headers and proxy metadata into log entries.
+* **Sinks**: Defaults to standard container logging (stdout) with extensibility for OTLP or external ports.
+
+## Comparison with Prior Art


N.B that all of these are wrappers around Envoy

Good point.
Though the Kuadrant TelemetryPolicy referenced is an API to supplement metrics coming from an internal component (limitador) that envoy filters to, rather than envoy itself.

@howardjohn - can you offer other arts that are not relying on Envoy? I recall seeing a comment here or in slack welcoming help/contributions to the prior art section more

https://agentgateway.dev/docs/kubernetes/latest/reference/api/#frontend
https://doc.traefik.io/traefik/v3.0/observability/tracing/opentelemetry/
https://docs.nginx.com/nginx-gateway-fabric/monitoring/tracing/#create-the-observabilitypolicy

david-martin · 2026-03-06T14:17:18Z

I really think we should not do this.

I don't think "Kubernetes Agentic Networking" is an appropriate venue to define a general observability API. This has absolutely nothing to do with agentic networking specifically, and it feels like this is being used to "smuggle in" a standard through a low-barrier project so that it can gain momentum and face less scrutiny (just to be 100% clear: I am not accusing you of intentionally doing this, just that I think this is the likely result in practice).

Yeah, I can see where you are coming from. Others have raised similar concerns about there not being sufficient AI focused content (e.g. #69 (review)).
I really don't think there's any ill intent or attempt to bypass/smuggle.

Otel has proven two things, IMO: (1) making telemetry standards is really hard. (2) if you don't do it right, you won't make a standard but instead yet-another-non-standard (see all the pre-otel 'standards' that are now obsolete). (3) defining an API like this is hard/not something that can/should be done (hence why there is not only one).

Agreed on 1.
Are there examples of 2 you could share so we can better understand why these have failed or are obsolute?
On 3, I'm still on the optimistic side that something could be done, but that being said, I'm not an implementer of a full gateway solution that would have to agree to and implement a standard API.
Is the proposal too ambitious perhaps in its current form? Could it be slimmed down, but made extensible?
What if we started with just standardising 1 thing, like a spec for enabling/disabling the different signals and where to send them to?

I don't think this is really as portable as it seems on paper.

I'd like to explore a simple alternative: "implementations of agentic networking should implement the OTEL specification around traces, metrics, and logs. Implementations may provide ways to enable or disable these as needed".

Do you see this being a central place to enable/disable or distributed?

howardjohn · 2026-03-06T16:47:01Z

I really don't think there's any ill intent or attempt to bypass/smuggle.

I 100% agree as I tried to convey, I don't think it is intended but I think its the accidental result. Which is a common pattern to accidentally happen (I've seen a lot of bad APIs unintentionally brought in because of last-minute-release rush, for instance!).

Do you see this being a central place to enable/disable or distributed?

I don't really see the need for a CRD to do this. I struggle to believe that users are going to get benefits from a CRD controlling a knob than just --set telemetry.enabled=<true|false> or the equivalent in their implementation install. Vendor neutral APIs are valuable when they are building blocks for extensibility IMO. That is why HTTPRoute and otel are so powerful -- HTTPRoute is a common foundation for a huge ecosystem to build around, as otel is for telemetry. But just providing a common way to flip a knob is not value for anyone, nor a meaningful extension point.

What if we started with just standardising 1 thing, like a spec for enabling/disabling the different signals and where to send them to?

FWIW otel already has standards for this: https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/#general-sdk-configuration. I 100% agree env vars are an awkward standard in this context though, just something to take into account.

FWIW, I hate to be in the position where I say 'we should not do this'. However, I think few people are willing to do it, since it feels bad and un-collaborative, and its easier to just debate whether specific aspects of an API are good or bad or should change than to step back, so wanted to make sure I at least expressed this

gkhom · 2026-03-06T22:57:29Z

I don't think "Kubernetes Agentic Networking" is an appropriate venue to define a general observability API. This has absolutely nothing to do with agentic networking specifically, and it feels like this is being used to "smuggle in" a standard through a low-barrier project so that it can gain momentum and face less scrutiny (just to be 100% clear: I am not accusing you of intentionally doing this, just that I think this is the likely result in practice).

As this is my first ever contribution to the CNCF/K8s ecosystem, my intention is definitely not to bypass scrutiny or "smuggle" anything in. I opened this proposal here because I wanted to solve a real problem I've encountered (on several occasions) and this seemed like a reasonable starting point.

if you don't do it right, you won't make a standard but instead yet-another-non-standard (see all the pre-otel 'standards' that are now obsolete).

I definitely want to avoid creating a fragmented or obsolete standard. Could you point me to some of the pre-OTel standardization attempts you mentioned? I'd love to understand where they fell short in defining a configuration API.

defining an API like this is hard/not something that can/should be done

I recognize that standardizing this API carries significant complexity. Could you elaborate on the specific technical hurdles or roadblocks you envision? I'd really appreciate your perspective on what makes this intractable.

I don't think this is really as portable as it seems on paper.

Could you provide some examples or edge cases where you see portability breaking down in this design? I'd like to see if we can address those gaps.

I'd like to explore a simple alternative: "implementations of agentic networking should implement the OTEL specification around traces, metrics, and logs. Implementations may provide ways to enable or disable these as needed".

This would work as a baseline, but my concern is that it will fall short when users need to customize their telemetry (which is fairly common in practice). In absence of a standard, users will inevitably have to resort to vendor-specific APIs for any customization. Because a lot of operational machinery gets built around telemetry pipelines, relying on vendor-specific APIs for customization (re-)introduces the risk of vendor lock-in and becomes a blocker to portability. I think this is especially critical for agentic networking because it deals with autonomous tool execution. Telemetry here isn't just about troubleshooting and performance. It's a core requirement for operations, security, and auditability. If telemetry configuration is left entirely to vendor-specific implementations, it becomes nearly impossible for platform teams to guarantee a consistent audit trail across heterogeneous environments.

LiorLieberman · 2026-03-09T16:14:25Z

I really think we should not do this.

I don't think "Kubernetes Agentic Networking" is an appropriate venue to define a general observability API. This has absolutely nothing to do with agentic networking specifically, and it feels like this is being used to "smuggle in" a standard through a low-barrier project so that it can gain momentum and face less scrutiny (just to be 100% clear: I am not accusing you of intentionally doing this, just that I think this is the likely result in practice).

Otel has proven two things, IMO: (1) making telemetry standards is really hard. (2) if you don't do it right, you won't make a standard but instead yet-another-non-standard (see all the pre-otel 'standards' that are now obsolete). (3) defining an API like this is hard/not something that can/should be done (hence why there is not only one).

I don't think this is really as portable as it seems on paper.

I'd like to explore a simple alternative: "implementations of agentic networking should implement the OTEL specification around traces, metrics, and logs. Implementations may provide ways to enable or disable these as needed".

I believe this will meet all of the use cases, without the daunting task of defining a new telemetry standard. From a user POV they just want these emitted from their agentic network; we do not need a new CRD to achieve that

Without picking on any specific comment in the thread, the overarching goal here, and all/most our standardization attempts (within Gateway ecosystem at least) in general is to enable users to use portable APIs to define their networking functionalities. Of course we should not try to cover all cases, and some cases, indeed are not worth standardization (often because they are very implementation-specific or not portable). However here, and in many prior discussions, we've tried to take the general rule of trying to work towards enabling at least 80% of the users using portable APIs completely. (~ 20% will have their own, perhaps, not portable unique cases).

Thats true that we dont have any Telemetry/Observability API within Gateway, but I dont recall seeing any attempt to do. I believe a lot of it is due to the fact that they were (and perhaps still) higher priorities to standardize first. (Its important to mention that despite this, lots of vendors did put effort to have their own APIs for this, meaning the demand was there)

What's more important though is that in case of Agentic Networking, telemetry becomes one of the higher priorities. In fact, some users would only want and Agent Gateway / some form of PEPs just for telemetry/audit. Therefore I see a ton of value to standardize this, and have additional knobs beyond enable/disable.

I also believe there is a ton of value that such API exists in Gateway directly, I am very supportive of bringing this back to Gateway when we all feel its in a good shape for doing so.

@howardjohn It would be good if we can get some more concrete feedback on what are the specific challenges and try to think how to overcome them.

howardjohn · 2026-03-09T16:15:59Z

Not my full answer, will think more and respond but wanted to point to kubernetes-sigs/gateway-api#554

howardjohn

Looking at a lense of "what would a smaller but still powerful API look like":

targetRefs: duh, we need this. But do we need Namespace? Or just Mesh for the Mesh case? Or not worry about Mesh for now.

tracing:

Do we need to support anything other than OTLP? OTLP won, its the de-facto and de-jure standard. Anything else is probably a waste of time, at least in the core API.
- TracingProvider.endpoint should be a backendRef though
Sampling is perhaps too simplistic -- will discuss more in a later comment as I need to run.
context - do we need anything other than w3c? Its 2026.
CustomAttribute: no doubt these are valuable but if we can only have literalValue its not. These only really provide value if we allow an ~expression language. Or a few wellknown attributes; I would say "just always include these" but for agentic use cases specifically this could include the entire prompt/completion which could warrant first class 'includeXYZ' style API.

metrics: I think we just drop it entirely. TBH. Its pretty problematic to dynamically enable and disable metrics, especially on a per-metric or per-label-value basis. If we are talking prometheus there is also no config. If we need to support OTLP then just have backendRef.

Definitely recommend dropping metrics override which is non-standard, hard to use, and hard to implement.

Logs:

Notably missing otlp. Should it be? I know I said I was trying to reduce the scope :-) but it matters and influences other fields.
Format -- not clear we need this. If its otlp, there is no format. If not, the proxy probably either uses structured logging and or does not, should a user really define that?
match: nice to have but not strictly required. If we are going to do it, and are fine with CEL, just fully embrace CEL and don't have any other match types. If not, read https://blog.howardjohn.info/posts/cel-is-good/#cel-alternatives and think through whether a non-CEL based approach can make an acceptable user experience.
fields: I don't think this can possibly just be a simple list of strings, and not a k/v pair, so hard to concretely discuss it but probably should be the same as CustomAttribute in tracing and the same concerns there apply.

docs/proposals/0069-TelemetryPolicy.md

kyessenov · 2026-03-18T18:29:02Z

docs/proposals/0069-TelemetryPolicy.md

+    provider:
+      type: OTLP # or implementation-specific
+      endpoint: "otel-collector.monitoring.svc:4317"
+    samplingRate: 


I think you need to explain how sampling of traces work. Is this respecting the existing "sampling" decisions? Is this for requests without an existing context?

Added a brief explanation. This is the base sampling rate across all requests. The optional parentBasedSampling config allows for a distinct sampling rate specifically for requests that are already part of a trace.

howardjohn · 2026-04-02T15:24:39Z

docs/proposals/0069-TelemetryPolicy.md

+    matches: # Conditional logging
+      - cel: "response.code >= 500" # CEL-based filtering for errors


If we only want CEL its a bit awkward to have a list. But may make sense if we have non-CEL

howardjohn · 2026-04-02T15:25:43Z

docs/proposals/0069-TelemetryPolicy.md

+    fields: # Configure specific fields to include
+      - "start_time"
+      - "response_code"
+      - "x-token-usage"


is there any definition of what these fields mean?

For a concrete example, say I want to log the MCP task name (I chose this since its not in https://opentelemetry.io/docs/specs/semconv/gen-ai/mcp/). Can I do it? what do I put as the field if I want to?

howardjohn · 2026-04-02T15:27:02Z

docs/proposals/0069-TelemetryPolicy.md

+type Dimension struct {
+    Key        string `json:"key"`
+    FromHeader string `json:"fromHeader,omitempty"`
+}


The 3 APIs all have the same property of "add a K/V" pair but do so in 3 different ways. Does it make sense? Should we be more consistent in them?

It seems odd that:

tracing: literal only

metrics: header only

log: a name only without a value

rikatz · 2026-04-08T19:21:36Z

Can we move this discussion to Gateway API as a provisional GEP first, and then experimental? I think a lot of the discussion here is happening on a context that is important not only for agentic, but for the whole Gateway API ecosystem and I wouldn't like to receive this proposal as "we discussed and approved on agentic and now this needs to be implemented this way on Gateway API".

Thanks!

kflynn · 2026-04-09T14:01:47Z

Seconding @rikatz's comment – this seems applicable to much more than just the agentic world, and I'd love to get eyes on it from Gateway API. Thanks!! 🙂

gkhom added 2 commits February 9, 2026 14:58

TelemetryPolicy proposal

6bf817c

This change include context, problem description, and design objectives for a TelemetryPolicy proposal. If the community agrees on this context then I will follow up with the actual API specification.

Merge pull request #1 from gkhom/telemetrypolicy-proposal

972f464

[WIP] TelemetryPolicy proposal

k8s-ci-robot added the kind/documentation Categorizes issue or PR as related to documentation. label Feb 9, 2026

k8s-ci-robot requested review from LiorLieberman and david-martin February 9, 2026 23:06

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 9, 2026

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 9, 2026

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 10, 2026

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Feb 10, 2026

rubambiza reviewed Feb 10, 2026

View reviewed changes

david-martin reviewed Feb 11, 2026

View reviewed changes

Update 0069-TelemetryPolicy.md

40e812f

rubambiza mentioned this pull request Feb 11, 2026

REQUEST: New membership for @rubambiza kubernetes/org#6134

Open

11 tasks

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 17, 2026

gkhom added 2 commits February 17, 2026 14:51

Merge branch 'kubernetes-sigs:main' into main

644b87f

Update 0069-TelemetryPolicy.md

26b328a

k8s-ci-robot removed the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Feb 17, 2026

gkhom added 3 commits February 23, 2026 14:56

Update 0069-TelemetryPolicy.md

d402594

Update 0069-TelemetryPolicy.md

71f4338

Update 0069-TelemetryPolicy.md

4943c76

evaline-ju reviewed Mar 4, 2026

View reviewed changes

gkhom and others added 4 commits March 4, 2026 21:40

Update 0069-TelemetryPolicy.md

eb826bc

Update docs/proposals/0069-TelemetryPolicy.md

38d78dc

Co-authored-by: Evaline Ju <69598118+evaline-ju@users.noreply.github.com>

Update 0069-TelemetryPolicy.md

04667c7

Update 0069-TelemetryPolicy.md

7589829

guicassolato reviewed Mar 5, 2026

View reviewed changes

howardjohn suggested changes Mar 5, 2026

View reviewed changes

howardjohn reviewed Mar 5, 2026

View reviewed changes

howardjohn reviewed Mar 9, 2026

View reviewed changes

Update 0069-TelemetryPolicy.md

297b6b3

kyessenov reviewed Mar 18, 2026

View reviewed changes

docs/proposals/0069-TelemetryPolicy.md Outdated Show resolved Hide resolved

kyessenov reviewed Mar 18, 2026

View reviewed changes

docs/proposals/0069-TelemetryPolicy.md Outdated Show resolved Hide resolved

kyessenov reviewed Mar 18, 2026

View reviewed changes

gkhom added 5 commits April 1, 2026 18:22

Update 0069-TelemetryPolicy.md

1a39810

Update 0069-TelemetryPolicy.md

d1f0564

Update 0069-TelemetryPolicy.md

9fa2624

Update 0069-TelemetryPolicy.md

35eb945

Update 0069-TelemetryPolicy.md

8600d0b

howardjohn reviewed Apr 2, 2026

View reviewed changes


		## The TelemetryPolicy Specification

		We propose the `TelemetryPolicy` as a direct policy attachment in the `gateway.networking.k8s.io` API group. See [GEP-713](https://gateway-api.sigs.k8s.io/geps/gep-713/#classes-of-policies) for more information on Direct attachment.

		This proposal introduces the `TelemetryPolicy`, a direct policy attachment designed to configure observability signals (metrics, logs, traces)
		for Gateway API resources (via `Gateway` attachment) and Service Mesh resources (via `namespace` attachment).

		matches: # Conditional logging
		- cel: "response.code >= 500" # CEL-based filtering for errors

Conversation

gkhom commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for kube-agentic-networking ready!

Uh oh!

linux-foundation-easycla bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Feb 9, 2026

Uh oh!

k8s-ci-robot commented Feb 9, 2026

Uh oh!

haiyanmeng commented Feb 10, 2026

Uh oh!

haiyanmeng commented Feb 10, 2026

Uh oh!

LiorLieberman commented Feb 10, 2026

Uh oh!

LiorLieberman commented Feb 10, 2026

Uh oh!

rubambiza left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-martin commented Feb 17, 2026

Uh oh!

k8s-ci-robot commented Feb 17, 2026

Uh oh!

haiyanmeng commented Feb 25, 2026

Uh oh!

evaline-ju left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

guicassolato left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gkhom commented Feb 9, 2026 •

edited

Loading

netlify bot commented Feb 9, 2026 •

edited

Loading

linux-foundation-easycla bot commented Feb 9, 2026 •

edited

Loading

rubambiza left a comment •

edited

Loading

guicassolato left a comment •

edited

Loading

gkhom Mar 12, 2026 •

edited

Loading

kyessenov Mar 18, 2026 •

edited

Loading