Conversation
This change include context, problem description, and design objectives for a TelemetryPolicy proposal. If the community agrees on this context then I will follow up with the actual API specification.
[WIP] TelemetryPolicy proposal
✅ Deploy Preview for kube-agentic-networking ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
|
|
Welcome @gkhom! |
|
Hi @gkhom. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@gkhom , can you fix this? |
|
/ok-to-test |
|
/easycla |
|
/check-cla |
There was a problem hiding this comment.
@gkhom Thanks for kicking this off. I left some clarifying questions and proposing a minor change. The bigger part of the review is to surface work already being proposed/done in the llm-d community with regard to tracing and whether it is applicable to our objectives.
| This proposal introduces the `TelemetryPolicy`, a direct policy attachment designed to configure observability signals (metrics, logs, traces) | ||
| for Gateway API resources (via `Gateway` attachment) and Service Mesh resources (via `namespace` attachment). | ||
|
|
||
| This K8s API standardizes how users enable and configure telemetry across different data plane implementations, replacing vendor-specific CRDs |
There was a problem hiding this comment.
I am (acting as) a naive reader, and I was immediately curious what some examples of these vendor-specific CRDs are. This also ties to and might clarify the below mention of "Observability lock-in".
There was a problem hiding this comment.
Examples of such CRDs are:
- Istio's Telemetry CRD
- Envoy Gateway's EnvoyProxy and EnvoyGateway CRDs
- Kong's MeshMetrics/MeshTrace/MeshAccessLog
- Kuadrant's TelemetryPolicy
I intend to write a section that compares such existing APIs and the proposed TelemetryPolicy in the eventual proposal.
There was a problem hiding this comment.
Seeing as there's a mix of examples here, will the scope cover one resource for all of the signals (metrics, logs, traces) vs. separate ones? Are there tradeoffs to consider here?
There was a problem hiding this comment.
I'm leaning towards one resource for all. The argument for splitting them might be that different personas are involved in configuring the different aspects of observability. In practice, I think that the persona that configures metrics, likely also configures tracing and access logs. So to avoid complicating the API with three additional resources, it seems worthwhile to put all of it in a single resource.
| * **Cost is Volatile**: Usage is measured in tokens, not just requests. A single HTTP 200 OK could cost $0.01 or $10.00 depending on the prompt and model used. | ||
| * **Context is King**: Debugging requires knowing the semantic context: Which Model? Which Prompt? Which tool? | ||
|
|
||
| Existing telemetry policies are unaware of the Generative AI semantic conventions. They see an opaque TCP stream or HTTP POST. Without a standardized API to |
There was a problem hiding this comment.
In line with the header, may I suggest adding "unaware of the emerging Generative AI semantic conventions"?
|
|
||
| 1. **Standardization**: A single API for Gateway and Mesh to configure Access Logging, Metrics generation, and Tracing propagation. | ||
| 2. **GEP-713 Compliance**: Support `targetRef` attachment to `Gateway` and `Namespace`. The latter covers Mesh use-cases. | ||
| 3. **Agentic Support**: Enable the capture of OpenTelemetry GenAI Semantic Conventions and support the requirements of PR #33. |
There was a problem hiding this comment.
In the spirit of standardization and not reinventing the wheel, I wanted to mention that the llm-d community is already moving on tracing + OTel + GenAI Semantic Conventions. In particular, Sally O'Malley from Red Hat proposed and did a POC for distributed tracing in llm-d. [aside: I learned about this work from Sally on another community call for our kagenti project]
This may be applicable here for a few reasons:
- We are keen on integrating OTel and GenAI semantic conventions, too
- One of our objectives is a single API for Gateways and Meshes, and Sally's POC has already landed some changes to support tracing to the Gateway API Inference Extension (GAIE) components like the endpoint pickers (proposal comment, GAIE PR).
While llm-d is focused on distributed LLM inferencing regardless of source (i.e., user chat -> LLM vs agent -> LLM), I think it's worth considering any lessons they may have already encountered and API definitions that could overlap with our case, at the very least at the Gateway level. I'd be willing to evangelize our thinking to Sally to get her thoughts, but more importantly curious on our interest level.
There was a problem hiding this comment.
It would certainly be valuable to get some of their insights and experiences. The proposal seems to cover configuration through environment variables, have they defined CRDs as well?
There was a problem hiding this comment.
No, I did not see any CRD definitions. I'll keep this thread in mind as the definitions become more concrete.
| for Gateway API resources (via `Gateway` attachment) and Service Mesh resources (via `namespace` attachment). | ||
|
|
||
| This K8s API standardizes how users enable and configure telemetry across different data plane implementations, replacing vendor-specific CRDs | ||
| with a unified, portable spec. |
There was a problem hiding this comment.
Would you see implementations reconciling the TelemetryPolicy and reading the bits that are relevant to their components? So multiple controllers read the CR and take actions to enable telemetry across the components they are controlling?
There was a problem hiding this comment.
It is indeed possible to distribute the responsibility across multiple controllers, it's up to the implementation. In most cases that I'm familiar with a single controller/control plane programs all three observability features (metrics, traces, logs).
There was a problem hiding this comment.
possible but a little challenging, what are the cases we see multiple impls reconcile the same thing?
|
I'm generally in favour of this proposal. Perhaps more relevant when it comes to the specification, it would be good to know more about the current 'state of the art' in this space. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: david-martin, gkhom The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@gkhom , the PR description needs to be updated since the API specification is included in the PR now. |
evaline-ju
left a comment
There was a problem hiding this comment.
Generally looking good to me - a couple lingering questions
| namespace: prod-ns | ||
| spec: | ||
| # GEP-713 Attachment | ||
| targetRef: |
There was a problem hiding this comment.
Looking at the GEP-713 description, will we allow multiple attachments (targetRefs)? Perhaps with the "what namespace targets?" comment below it'd be good to understand the precedence for multiple policy (gateway vs. namespace policy) resolution more clearly. Are they to be non-overlapping?
similarly I assume policy status will be included?
There was a problem hiding this comment.
The intent is indeed to allow multiple attachments. A single TelemetryPolicy can be used to configure for multiple namespaces and/or gateways. I will fix the mistake in the example to state targetRefs.
Regarding multiple policies, I was thinking that only a single TelemetryPolicy is allowed to target a specific resource. A TelemetryPolicy that targets a namespace or Gateway that is already targeted by another TelemetryPolicy should be rejected.
Regarding status, it will be included. Does it need to be mentioned explicitly in the spec?
There was a problem hiding this comment.
I personally was not sure if status was going to be unique enough to call out here, but seems @guicassolato also called out the status stanza here: #69 (comment) so including it explicitly will likely avoid further confusion
Co-authored-by: Evaline Ju <69598118+evaline-ju@users.noreply.github.com>
There was a problem hiding this comment.
I think overall this is a strong proposal with a compelling use case and well-designed API.
My main point of concern is whether it has enough to make it specific to the agentic use case or could otherwise better server upstream Gateway API as a whole.
| Following [GEP-713](https://gateway-api.sigs.k8s.io/geps/gep-713/), the `TelemetryPolicy` supports the following attachments: | ||
|
|
||
| 1. **Gateway (Instance Scope)**: Configures the telemetry for a specific `Gateway`. | ||
| 2. **Namespace (Mesh Scope)**: Configures the telemetry for all mesh proxies (sidecar proxy / node proxy / etc.) in that namespace. |
There was a problem hiding this comment.
Defining what namespace targeting means and even if that definition points to mesh use case only is fine. But it has to be more than implied IMO. It has to be by design and well specified/documented, so all implementations of the API will commit to the same meaning and behaviour.
(I believe that's what @gkhom has in mind, but good to spell it out, I think.)
On a side note, if namespace targeting is for the mesh use case, have you considered the Mesh kind (and avoid any possible confusion altogether)?
| provider: | ||
| type: Prometheus | ||
| overrides: | ||
| - name: "gateway.networking.k8s.io/http/request_count" |
There was a problem hiding this comment.
Are these metric names standardised somehow or the alignment between who owns the policy and who consumes the metrics is something we expect to happen but out of scope of the proposal?
There was a problem hiding this comment.
This specific metric name is just an arbitrary example. The closest thing to a standard for metric names would be OTel's semantic conventions. The alignment between policy owner and metric consumer is indeed out-of-scope (but I'm happy to try to include it if needed).
There was a problem hiding this comment.
nitpick but can you use an obviously dummy name to avoid making it seem like its a real proposal? like example.com/http/request_count would be fine
|
|
||
| type TelemetryPolicySpec struct { | ||
| // Identifies the target resources (Gateway or Namespace) to which this policy attaches (GEP-713). | ||
| TargetRefs []TargetReference `json:"targetRefs"` |
There was a problem hiding this comment.
We could probably reuse here one of the Gateway API types from https://github.com/kubernetes-sigs/gateway-api/blob/main/apis/v1/policy_types.go.
I wonder, for example, if namespace targeting should be allowed only in the same namespace (LocalPolicyTargetReference) or "at distance" leveraging ReferenceGrants.
There was a problem hiding this comment.
I'm leaning towards NamespacedPolicyTargetReference to allow cross-namespace configuration using ReferenceGrants. This would allow central management of uniform telemetry configuration.
There was a problem hiding this comment.
Allowing cross namespace entirely violates namespace boundaries. namespace X should definitely not be able to modify namespace Y's configuration.
If we want uniform management then IMO the correct way to do this would be to modify a global object, probably GatewayClass / Mesh. Note this is still a bit awkward since you need a namespaced-resource to modify a cluster-scoped resource, so would need some additional layer of policy like only allowing it from some trusted admin namespace (Istio has this concept as "root namespace").
This hasn't been done in the Gateway API space so would be novel
There was a problem hiding this comment.
I think we should definitely not start with with allowing cross-ns.
Regarding if and how to allow it, I also usually dont like this idea of controlling other telemetry endpoints from other namespace. We could either adopt
(a) "magic namespace" approach like many other implementations are doing - and policies in this namesspaces has global scope OR
(b) a ClusterTelemetryPolicy resource that has global scope
| format: JSON | ||
| matches: # Conditional logging | ||
| - path: "/api/v1/sensitive" | ||
| - cel: "response.code >= 500" # CEL-based filtering for errors |
There was a problem hiding this comment.
Are we thinking the same variables described at https://github.com/kubernetes-sigs/kube-agentic-networking/blob/main/docs/proposals/0017-DynamicAuth.md#available-context-variables?
There was a problem hiding this comment.
Good question, those should be included but we will likely need more. For example, a response object. I'll add a section to make it explicit.
There was a problem hiding this comment.
CEL is great for the long tail, but it's too slow and obscure as the primary method. It'll add 3-5x overhead to basic matching.
|
|
||
| ## The TelemetryPolicy Specification | ||
|
|
||
| We propose the `TelemetryPolicy` as a direct policy attachment in the `gateway.networking.k8s.io` API group. See [GEP-713](https://gateway-api.sigs.k8s.io/geps/gep-713/#classes-of-policies) for more information on Direct attachment. |
There was a problem hiding this comment.
Should it belong to the agentic API group or maybe start as such while it's proposed in the scope of the subproject?
| This proposal introduces the `TelemetryPolicy`, a direct policy attachment designed to configure observability signals (metrics, logs, traces) | ||
| for Gateway API resources (via `Gateway` attachment) and Service Mesh resources (via `namespace` attachment). |
There was a problem hiding this comment.
I think it's OK to call it a "Direct Policy" while Gateway and Namespace as supported target kinds are for two completely disjoint use cases – ingress and mesh.
By definition, it's only direct when:
A single kind supported in
spec.targetRefs.kind
There was a problem hiding this comment.
Would it be more accurate to call it "inherited policy"?
| } | ||
|
|
||
| type TracingProvider struct { | ||
| Type string `json:"type"` |
There was a problem hiding this comment.
Would this be an enum or completely free for the implementations to define? Maybe some Core and Extended levels?
| Following [GEP-713](https://gateway-api.sigs.k8s.io/geps/gep-713/), the `TelemetryPolicy` supports the following attachments: | ||
|
|
||
| 1. **Gateway (Instance Scope)**: Configures the telemetry for a specific `Gateway`. | ||
| 2. **Namespace (Mesh Scope)**: Configures the telemetry for all mesh proxies (sidecar proxy / node proxy / etc.) in that namespace. |
There was a problem hiding this comment.
Although implied when calling it "direct policy", I think it may useful to add a note about the merge semantics, which I imagine will be the None one. I.e., describe the behaviour of what happens when 2 policy resources of this kind target the same object (same gateway or same namespace).
| metav1.ObjectMeta `json:"metadata,omitempty"` | ||
|
|
||
| Spec TelemetryPolicySpec `json:"spec"` | ||
| } |
There was a problem hiding this comment.
Should probably define a status stanza too. E.g.: https://github.com/kubernetes-sigs/kube-agentic-networking/blob/main/docs/proposals/0017-DynamicAuth.md#extended-accesspolicy-crd
There was a problem hiding this comment.
Thanks, I've added the status stanza.
Is the detailed explanation/comment in the *PolicyStatus Go struct required in every proposal?
| type: Counter | ||
| dimensions: # Custom labels/dimensions | ||
| - key: "model_id" | ||
| fromHeader: "x-model-id" # Crucial for Agentic workloads |
There was a problem hiding this comment.
Any other possible sources in mind? E.g. fromMetadata?
There was a problem hiding this comment.
Indeed, fromMetadata is an example. Potentially an advanced fromPayload might be worth considering in the future.
howardjohn
left a comment
There was a problem hiding this comment.
I really think we should not do this.
- I don't think "Kubernetes Agentic Networking" is an appropriate venue to define a general observability API. This has absolutely nothing to do with agentic networking specifically, and it feels like this is being used to "smuggle in" a standard through a low-barrier project so that it can gain momentum and face less scrutiny (just to be 100% clear: I am not accusing you of intentionally doing this, just that I think this is the likely result in practice).
- Otel has proven two things, IMO: (1) making telemetry standards is really hard. (2) if you don't do it right, you won't make a standard but instead yet-another-non-standard (see all the pre-otel 'standards' that are now obsolete). (3) defining an API like this is hard/not something that can/should be done (hence why there is not only one).
- I don't think this is really as portable as it seems on paper.
I'd like to explore a simple alternative: "implementations of agentic networking should implement the OTEL specification around traces, metrics, and logs. Implementations may provide ways to enable or disable these as needed".
I believe this will meet all of the use cases, without the daunting task of defining a new telemetry standard. From a user POV they just want these emitted from their agentic network; we do not need a new CRD to achieve that
| - attributeName: "env" | ||
| literalValue: "production" | ||
|
|
||
| # 2. Metrics Configuration |
There was a problem hiding this comment.
I think its questionable how portable dynamic configuration of metrics is. I've only seen Envoy do this, and even then its extremely bug ridden historically. Its very confusing for users what the semantics are, or should be, to add or remove labels from metrics. Or even adding a metric -- imagine I have a dashboard like request_count/error_count but I added error_count after request_count so its all out of whack...
|
|
||
| type MatchCondition struct { | ||
| // Path allows filtering to specific paths. | ||
| Path string `json:"path,omitempty"` |
There was a problem hiding this comment.
why bother having this if we already have CEL and its trivial to express in CEL?
There was a problem hiding this comment.
The way I understand this is that "Path" is a (convenient?) simpler shortcut for just logging vs expressing this in CEL. Probably if we were to leave it we should do it as a oneof field.
But it makes sense to remove and start with an API without it.
There was a problem hiding this comment.
CEL is an order of magnitude slower than native code... Even if you use CEL you'd have to add a bunch of variants (query stripped, normalized, escaped, raw path) which will make it more confusing than having a structured condition.
There was a problem hiding this comment.
Maybe in some implementations. Should a general purpose vendor-agnostic API switch to a suboptimal user experience (lets assume here that CEL is a preferred UX, since if its a bad UX and bad performance it would obviously be a bad choice) because some implementations do not implement it optimally?
We are handling these at ~native speeds in our implementation.
Even if you use CEL you'd have to add a bunch of variants (query stripped, normalized, escaped, raw path)
This is exactly why I would prefer CEL personally, so we don't have to make our YAML api have 5 variations just for path -- not to mention headers, query params, cookies we will have like 25 fields just to match headers
There was a problem hiding this comment.
CEL is an interpreted language by design, so unless you're carving out a specific subset (or breaking some semantics) - it cannot match native speeds. In either of those cases, that is not vendor neutral, there's no guarantee that the "fast" subset of CEL is the same across implementations.
YAML gives that for free. Every implementation will be similarly efficient, and all the "meta" tooling will support it without having to compile/run CEL.
There was a problem hiding this comment.
FWIW CEL is used throughout the other APIs in this repo already. IMO this project needs to decide on CEL or not, so we don't have to have this debate on each field. Its not good for users to have 50% of fields (that would make sense to use CEL) use CEL and 50% don't just because of who made the API, and its not great for us to have to debate it on each field usage.
There was a problem hiding this comment.
Sure, as someone who worked on CEL C++ runtime and made it open source - I'm telling you CEL is not the right paradigm for the "95%" of data path cases. It's too slow, and it prevents meta-tooling (management and control planes) from analyzing the semantics. It makes perfect sense as an "extensible context-aware condition" (e.g. like Wasm) for the rest 5% of tail-end bespoke cases, but you wouldn't use CEL for label selectors, would you?
| * **Custom Attributes**: Enables the extraction of specific headers and proxy metadata into log entries. | ||
| * **Sinks**: Defaults to standard container logging (stdout) with extensibility for OTLP or external ports. | ||
|
|
||
| ## Comparison with Prior Art |
There was a problem hiding this comment.
N.B that all of these are wrappers around Envoy
There was a problem hiding this comment.
Good point.
Though the Kuadrant TelemetryPolicy referenced is an API to supplement metrics coming from an internal component (limitador) that envoy filters to, rather than envoy itself.
There was a problem hiding this comment.
@howardjohn - can you offer other arts that are not relying on Envoy? I recall seeing a comment here or in slack welcoming help/contributions to the prior art section more
Yeah, I can see where you are coming from. Others have raised similar concerns about there not being sufficient AI focused content (e.g. #69 (review)).
Agreed on 1.
Do you see this being a central place to enable/disable or distributed? |
I 100% agree as I tried to convey, I don't think it is intended but I think its the accidental result. Which is a common pattern to accidentally happen (I've seen a lot of bad APIs unintentionally brought in because of last-minute-release rush, for instance!).
I don't really see the need for a CRD to do this. I struggle to believe that users are going to get benefits from a CRD controlling a knob than just
FWIW otel already has standards for this: https://opentelemetry.io/docs/specs/otel/configuration/sdk-environment-variables/#general-sdk-configuration. I 100% agree env vars are an awkward standard in this context though, just something to take into account. FWIW, I hate to be in the position where I say 'we should not do this'. However, I think few people are willing to do it, since it feels bad and un-collaborative, and its easier to just debate whether specific aspects of an API are good or bad or should change than to step back, so wanted to make sure I at least expressed this |
As this is my first ever contribution to the CNCF/K8s ecosystem, my intention is definitely not to bypass scrutiny or "smuggle" anything in. I opened this proposal here because I wanted to solve a real problem I've encountered (on several occasions) and this seemed like a reasonable starting point.
I definitely want to avoid creating a fragmented or obsolete standard. Could you point me to some of the pre-OTel standardization attempts you mentioned? I'd love to understand where they fell short in defining a configuration API.
I recognize that standardizing this API carries significant complexity. Could you elaborate on the specific technical hurdles or roadblocks you envision? I'd really appreciate your perspective on what makes this intractable.
Could you provide some examples or edge cases where you see portability breaking down in this design? I'd like to see if we can address those gaps.
This would work as a baseline, but my concern is that it will fall short when users need to customize their telemetry (which is fairly common in practice). In absence of a standard, users will inevitably have to resort to vendor-specific APIs for any customization. Because a lot of operational machinery gets built around telemetry pipelines, relying on vendor-specific APIs for customization (re-)introduces the risk of vendor lock-in and becomes a blocker to portability. I think this is especially critical for agentic networking because it deals with autonomous tool execution. Telemetry here isn't just about troubleshooting and performance. It's a core requirement for operations, security, and auditability. If telemetry configuration is left entirely to vendor-specific implementations, it becomes nearly impossible for platform teams to guarantee a consistent audit trail across heterogeneous environments. |
Without picking on any specific comment in the thread, the overarching goal here, and all/most our standardization attempts (within Gateway ecosystem at least) in general is to enable users to use portable APIs to define their networking functionalities. Of course we should not try to cover all cases, and some cases, indeed are not worth standardization (often because they are very implementation-specific or not portable). However here, and in many prior discussions, we've tried to take the general rule of trying to work towards enabling at least 80% of the users using portable APIs completely. (~ 20% will have their own, perhaps, not portable unique cases). Thats true that we dont have any Telemetry/Observability API within Gateway, but I dont recall seeing any attempt to do. I believe a lot of it is due to the fact that they were (and perhaps still) higher priorities to standardize first. (Its important to mention that despite this, lots of vendors did put effort to have their own APIs for this, meaning the demand was there) What's more important though is that in case of Agentic Networking, telemetry becomes one of the higher priorities. In fact, some users would only want and Agent Gateway / some form of PEPs just for telemetry/audit. Therefore I see a ton of value to standardize this, and have additional knobs beyond enable/disable. I also believe there is a ton of value that such API exists in Gateway directly, I am very supportive of bringing this back to Gateway when we all feel its in a good shape for doing so. @howardjohn It would be good if we can get some more concrete feedback on what are the specific challenges and try to think how to overcome them. |
|
Not my full answer, will think more and respond but wanted to point to kubernetes-sigs/gateway-api#554 |
howardjohn
left a comment
There was a problem hiding this comment.
Looking at a lense of "what would a smaller but still powerful API look like":
targetRefs: duh, we need this. But do we need Namespace? Or just Mesh for the Mesh case? Or not worry about Mesh for now.
tracing:
- Do we need to support anything other than OTLP? OTLP won, its the de-facto and de-jure standard. Anything else is probably a waste of time, at least in the core API.
- TracingProvider.endpoint should be a backendRef though
- Sampling is perhaps too simplistic -- will discuss more in a later comment as I need to run.
- context - do we need anything other than w3c? Its 2026.
- CustomAttribute: no doubt these are valuable but if we can only have literalValue its not. These only really provide value if we allow an ~expression language. Or a few wellknown attributes; I would say "just always include these" but for agentic use cases specifically this could include the entire prompt/completion which could warrant first class 'includeXYZ' style API.
metrics: I think we just drop it entirely. TBH. Its pretty problematic to dynamically enable and disable metrics, especially on a per-metric or per-label-value basis. If we are talking prometheus there is also no config. If we need to support OTLP then just have backendRef.
- Definitely recommend dropping metrics override which is non-standard, hard to use, and hard to implement.
Logs:
- Notably missing otlp. Should it be? I know I said I was trying to reduce the scope :-) but it matters and influences other fields.
- Format -- not clear we need this. If its otlp, there is no format. If not, the proxy probably either uses structured logging and or does not, should a user really define that?
- match: nice to have but not strictly required. If we are going to do it, and are fine with CEL, just fully embrace CEL and don't have any other match types. If not, read https://blog.howardjohn.info/posts/cel-is-good/#cel-alternatives and think through whether a non-CEL based approach can make an acceptable user experience.
- fields: I don't think this can possibly just be a simple list of strings, and not a k/v pair, so hard to concretely discuss it but probably should be the same as CustomAttribute in tracing and the same concerns there apply.
| provider: | ||
| type: OTLP # or implementation-specific | ||
| endpoint: "otel-collector.monitoring.svc:4317" | ||
| samplingRate: |
There was a problem hiding this comment.
I think you need to explain how sampling of traces work. Is this respecting the existing "sampling" decisions? Is this for requests without an existing context?
There was a problem hiding this comment.
Added a brief explanation. This is the base sampling rate across all requests. The optional parentBasedSampling config allows for a distinct sampling rate specifically for requests that are already part of a trace.
| matches: # Conditional logging | ||
| - cel: "response.code >= 500" # CEL-based filtering for errors |
There was a problem hiding this comment.
If we only want CEL its a bit awkward to have a list. But may make sense if we have non-CEL
| fields: # Configure specific fields to include | ||
| - "start_time" | ||
| - "response_code" | ||
| - "x-token-usage" |
There was a problem hiding this comment.
is there any definition of what these fields mean?
For a concrete example, say I want to log the MCP task name (I chose this since its not in https://opentelemetry.io/docs/specs/semconv/gen-ai/mcp/). Can I do it? what do I put as the field if I want to?
| type Dimension struct { | ||
| Key string `json:"key"` | ||
| FromHeader string `json:"fromHeader,omitempty"` | ||
| } |
There was a problem hiding this comment.
The 3 APIs all have the same property of "add a K/V" pair but do so in 3 different ways. Does it make sense? Should we be more consistent in them?
It seems odd that:
- tracing: literal only
- metrics: header only
- log: a name only without a value
|
Can we move this discussion to Gateway API as a provisional GEP first, and then experimental? I think a lot of the discussion here is happening on a context that is important not only for agentic, but for the whole Gateway API ecosystem and I wouldn't like to receive this proposal as "we discussed and approved on agentic and now this needs to be implemented this way on Gateway API". Thanks! |
|
Seconding @rikatz's comment – this seems applicable to much more than just the agentic world, and I'd love to get eyes on it from Gateway API. Thanks!! 🙂 |
What type of PR is this?
/kind documentation
What this PR does / why we need it:
This PR contains a proposal for a new
TelemetryPolicyAPI. This K8s API aims to standardize how users enable and configure telemetry (metrics, logs, traces) across different data plane implementations, replacing vendor-specific CRDs with a unified, portable spec.Which issue(s) this PR fixes:
Fixes #
Does this PR introduce a user-facing change?: