Enhance the evaluation rules of AccessPolicy objects#220
Enhance the evaluation rules of AccessPolicy objects#220haiyanmeng wants to merge 6 commits intokubernetes-sigs:mainfrom
Conversation
✅ Deploy Preview for kube-agentic-networking ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: haiyanmeng The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
| } | ||
| ``` | ||
|
|
||
| In addition, to make it easy for users to understand why a tool access is allowed or denied, we will disallow the combination of `InlineTools` and `ExternalAuth` in the same `AccessPolicy`. |
There was a problem hiding this comment.
Another reason for this is to keep the complexity of the API under control and make it easier for vendors to implement this.
If folk are okay with this restriction, I will add the proposed API change.
There was a problem hiding this comment.
What's the difference between combining InlineTools and ExternalAuth in the same AccessPolicy versus in different ones? I could argue that, if you have both options, it's easier to manage when they are declared in the same object.
Additionally, should we accept that ExternalAuth is only for when targeting a Gateway now? If we want Backend-level ExternalAuth to trigger first and not to mix with Backend-level InlineTools, then why not setting ExternalAuth up in the HTTPRoute instead? (I can only think of one reason that is probably not a very common use case: triggering multiple callouts.)
@guicassolato , do these two changes make sense to you? |
|
It's a bank holiday where I live so I will only be back on Tuesday. Therefore, I couldn't give this PR the thorough review it deserves yet, but here are my 2¢ based on what I saw at a glance:
As I mentioned on Slack, I think it's good a thing that we're trying to fill some definition gaps. My first impression nevertheless was that some of the new definitions proposed here feel like we may be overthinking it a little. This only means I think we should give stakeholders a chance of forming a second impression, hence my suggestion to hold at least a second round of discussion and my commitment to review the PR more thoroughly, trying to better understand the other POVs. |
Under that framework, don't you have the inverse problem where a rogue policy can block traffic to ~any endpoint? Regardless, if I understand the table correctly, the semantics here are the same as Istio's which means DENY rules override any allow rules. So this scenario couldn't even happen. |
Yes, and this is a good thing. That’s what meant by the safe approach. |
9415bde to
fbea6f7
Compare
be6c0c8 to
d392ed8
Compare
The original evaluation rules had several issues:
1. When an object is targeted by multiple InlineTools AccessPolicy
objects, the request is only allowed if all the AccessPolicy objects
allow the request.
* Expected behavior: the request should be allowed if any of the
AccessPolicy objects allows the request.
* Rationale: Be consistent with prior art like Istio.
2. The evaluation does not specify the evaluation order when both
InlineTools and ExternalAuth rules exist for a target object.
* Expected behavior: When both InlineTools and ExternalAuth rules
exist for a target object, ExternalAuth rules should be evaluated
first.
* Rationale: efficiency + be consistent with prior art like Istio.
In addition, to make it easy for users to understand why a tool access
is allowed or denied, we will disallow the combination of InlineTools
and ExternalAuth in the same AccessPolicy.
d392ed8 to
b1afab1
Compare
Just to make sure I understand the conversation here, let me provide an example: policy 1:
policy 2:
Taking this guideline literally, it seems like foo would not be able to talk to github or gitlab. That's because although policy 1 allows foo to access github, it does not allow foo to access gitlab. With a similar situation for policy 2. Is that what you're intending to say @guicassolato? I'm assuming I'm misunderstanding here. |
No, that is not what I intend to say. Let’s begin with: Anybody who has the authority to attach a policy at any point in the request path shall have this policy honoured. Are What’s more interesting about this silly and obvious example though is to think what would happen if the distinction between Say, for example, we didn’t have different HTTPRoutes or different Backends for each host ( Once this mistake is noticed, the two policy owners can talk to each other. They are both attaching policies to the same gateway, they probably work on the same team, maybe they are even the same person. They will realise the policy should be better written as one: A policy only speaks to what it says, within the jurisdiction it governs. It cannot be read as prohibiting what it simply does not address. Give it enough context so it can exert its power strictly according to what it says and no more. And when two policy owners share jurisdiction over the same point, they bear the responsibility of writing consistent, non-conflicting rules, taking into account the information they know to be available at the time to enforce the policy. |
|
I'd be interested to see if there was prior art supporting that pattern. a quick search shows the Istio Authz, k8s NetPol, GCP iam, AWS iam, and OPA all have the "any policy should allow". I actually could not find a single example behaving differently. |
I'd be very concerned with the approach. This will likely result in policy authors to create policies that deny previously allowed-by-policy traffic policies. Both Gateways and Backends are namespaced policies. We generally are not concerned by accidentally granting broad access that another policy is trying to restrict. While its not always the case, we technically dont assume different owners for objects within a namespace. So for the case where
Regarding this comment, I dont necessarily think thats true 100% of the time. If the policy author were to own |
|
It looks like a majority is in the making in favour of "a single ALLOW suffices". I stand firm with my opinion that, from a security POV, this is a bad design. It is the perfect example of the classical metaphor of chains that break at the weakest link. (A poor ALLOW breaks all strong DENIES.) But that's fine, as not all decisions have to be a consensus.
@howardjohn, I respect the decision of other policy engines which may have decided to do it this way. However, it doesn't change the fact that this is not a best practice.
@LiorLieberman, if it can be represented as a Gateway (ingress or egress), as an xRoute, as a Backend, or as a named section of any of the above, then it can be an isolated attachment point 100% of the times. You don't have to own What you're arguing is that there's a generalisation that includes other L7 attributes beyond what can be represented as an attachment point with the Kubernetes resources we know. Although IMO this is not the case of I don't see how this changes the fact that conflicting policies should never exist though. And when it's not possible to guarantee that by availability of having enough points of attachment to isolate policies to their own request path, policy owners need to talk to each other and just write better policies. |
I am not what I would consider an authoritative expert on security or authorization policy design. I don't think many (if not anyone) involved is (though there may be some, including you!). That is why I lean heavily on prior art. If every* existing system is doing it one way, and you propose an alternative, how can I as a non-expert weigh those? Without more evidence explaining why every other system is wrong its hard to accept a proposal that differs away from every other existing authorization service. Even if we assume the propose approach is better, given how prevalent the other approaches are, we would probably want to be very confident that its much better, as there is always a risk in diverging from established practices, even when its a theoretically better approach (see Haskell vs Go as a silly example). * I doubt every is true, if we find counter examples would be great |
|
I see your point and I respect that. People who know me also know I don't fight to get things my way. I fight for democratic and informed decisions. So, in the spirit of clarity and making sure we're making an informed decision, here's what's at stake IIUC: Given:
An agent sends a request to Most people seem to be converging to this request should be allowed. The OASIS XACML standard calls these two opposing behaviours, respectively, permit-overrides and deny-overrides. Depending on how ext_authz behaves, the example above can be no different than:
A naive ext_authz implementation of Policy B could be (in OPA): Once again, a request to
By permit-overrides, Policy A wins; by deny-overrides, the winner is Policy B. wrt prior art: There are a few policy engines that implement permit-overrides. Kubernetes RBAC and NetworkPolicy are probably the two best examples in the wild. By being additive and not supporting DENY rules, policies and role bindings can only expand permissions, never restrict. Therefore, you can short-circuit evaluation as soon as you find an ALLOW match. I obviously will not pretend to know more about Istio than you do, but I don't see AuthorizationPolicy in the same category. By evaluating CUSTOM, DENY and ALLOW in this order, wouldn't it make a case of deny-override? I am not a regular user of AWS IAM or GCP IAM, but my understanding is that they also implement deny-override. With AWS IAM, if policies that apply to a request include an ALLOW statement and a DENY statement, I believe the DENY statement will trump the ALLOW statement. Whereas in GCP there's this thing of denying a permission to a principal, effectively invalidating any roles the principal is bound to that otherwise would grant the principal with access. OPA is a policy language. (Well, Rego is.) OPA lets you express the logic yourself. E.g.: Kuadrant AuthPolicy implements deny-overrides strictly. I hope these count as counter-examples. Anyway, I think in the end we are discussing two mental models. The question is which default is safer for our threat model. In networking security, where the consequence of an erroneous ALLOW is a real security breach, deny-overrides is the standard industry recommendation, I reckon. That's at least my experience from working on software security with banks and in the payment processing industry in the past. (BTW, this is not me trying to play the card of "expert", because I certainly do not consider myself one either. These are just things I heard enough in my ears to continue echoing to this date.) That said, if we decide to go with permit-overrides instead, it will be consistent with other kinds of policy in Kubernetes. As long as it's an informed decision, I'd be OK with losing the battle. |
| - **[ExternalAuth-type](https://github.com/kubernetes-sigs/kube-agentic-networking/blob/cf8c85b85d067657a5dce7b87270f8099f1e302c/api/v0alpha0/accesspolicy_types.go#L169) Rules**: | ||
| - Evaluated first. | ||
| - If **any** ExternalAuth rule matching the request returns `deny`, the request is denied immediately. | ||
| - **[InlineTools-type](https://github.com/kubernetes-sigs/kube-agentic-networking/blob/cf8c85b85d067657a5dce7b87270f8099f1e302c/api/v0alpha0/accesspolicy_types.go#L165) Rules**: | ||
| - Evaluated only if all ExternalAuth rules (if any) allowed the request. | ||
| - If there are **no** InlineTools rules matching the request, the request is allowed. | ||
| - If **any** InlineTools rule matching the request exists and allows it, the request is allowed. | ||
| - Otherwise, the request is denied. |
There was a problem hiding this comment.
Although I'd still prefer the rules to be evaluated in the order they appear in the policy, I think this is an acceptable compromise if it's to keep the deny-overrides behaviour.
For the reference implementation:
I think this will require configuring two RBAC filters - before and after ExternalAuth. So the config goes:
- Gateway-level RBAC filter with shadow rules for ExternalAuth
- Gateway-level Ext_authz filter
- Gateway-level RBAC filter with normal rules for InlineTools
- Backend-level RBAC filter with shadow rules for ExternalAuth
- Backend-level Ext_authz filter
- Backend-level RBAC filter with normal rules for InlineTools
There was a problem hiding this comment.
I think this will require configuring two RBAC filters - before and after ExternalAuth. So the config goes:
- Gateway-level RBAC filter with shadow rules for ExternalAuth
- Gateway-level Ext_authz filter
- Gateway-level RBAC filter with normal rules for InlineTools
This looks right if there is only one Gateway-level ExternalAuth policy/rule.
If there are multiple Gateway-level ExternalAuth policies/rules, we may need multiple RABC filters for ExternalAuth. For example, if there are two Gateway-level ExternalAuth policies/rules, the HTTP filters in the reference implementation may look like:
Gateway-level RBAC filter with shadow rules for ExternalAuth-1
Gateway-level Ext_authz filter 1
Gateway-level RBAC filter with shadow rules for ExternalAuth-2
Gateway-level Ext_authz filter 2
Gateway-level RBAC filter with normal rules for InlineTools
Let me know if there are better ways to handle this in our reference implementation.
| - If **any** allows, proceed to Backend level. | ||
| - If **all** deny, the request is rejected immediately. |
There was a problem hiding this comment.
Given we only support implicit deny (by having no match), shouldn't it always proceed to Backend-level?
There was a problem hiding this comment.
If there is no InlineTools policies matching the request, it proceeds to Backend-level checks.
In this case (at line 157), there are two InlineTools policies matching the request: gateway-policy-inline-tools-1 and gateway-policy-inline-tools-2. Neither of the policies allow the request. Hence the request should be denied immediately, and the Backend-level checks should be skipped.
There was a problem hiding this comment.
What does "matching the request" mean? Is it the fact that the policies are attached to the gateway that is handling the request, the fact that source matches the request, or both?
Are you saying that, when a Gateway policy exists and it "matches the request", if at least one of the InlineTools rules in the policy doesn't allow, then there's no point in having Backend-level policies possibly allowing the request?
There was a problem hiding this comment.
Using an example:
kind: XAccessPolicy
spec:
targetRefs:
- kind: Gateway
rules:
- authorization:
type: InlineTools
tools:
- x
---
kind: XAccessPolicy
spec:
targetRefs:
- kind: XBackend
rules:
- authorization:
type: InlineTools
tools:
- ytools/call request sent to "y".
Should the request be allowed or denied?
There was a problem hiding this comment.
Using an example:
kind: XAccessPolicy spec: targetRefs: - kind: Gateway rules: - authorization: type: InlineTools tools: - x --- kind: XAccessPolicy spec: targetRefs: - kind: XBackend rules: - authorization: type: InlineTools tools: - y
tools/callrequest sent to"y".Should the request be allowed or denied?
The request should be denied following the rule If the request is denied at the gateway level, evaluation stops, and Backend-level policies are NOT evaluated.
There was a problem hiding this comment.
Except that the request is not explicitly denied at the gateway level. Rather, it's not allowed, which can then lead to implicitly denied, or alternatively could lead to deferred to backend level.
I wonder if an implicit deny at the gateway level when there are still other policies to check at backend level is the behaviour users expect.
This behaviour is equivalent to the atomic overrides merge strategy described by GEP-713. AFAIK, it has no counterpart in Istio, it's the exact opposite of the behaviour implemented by Envoy Gateway (atomic defaults), and exists in Kuadrant as an opt-in feature.
There was a problem hiding this comment.
I wonder if an implicit deny at the gateway level when there are still other policies to check at backend level is the behaviour users expect.
I agree that we need user feedback regarding this.
Before we have user feedback, let us stick to implicit deny. WDYT?
There was a problem hiding this comment.
I think the trickiest part here is distinguishing between an explicitly configured empty set tools: [] field versus an unset/omitted field (which is particularly annoying in Go implementations, and would necessitate using a pointer type).
IMO, strict hierarchical evaluation (Gateway before Backend) means:
tools: [y]on a Gateway-targeted policy evaluates to an explicit ALLOW for calling tooly, any applicable Backend-targeted policy is evaluated next.tools: [x]on a Gateway-targeted policy evaluates to an implicit "allow no tools except x" - an attempt to call toolyis not allowed, so any Backend-targeted policy is irrelevanttools: []on a Gateway-targeted policy implicitly allows no tool calls, any Backend-targeted policy is irrelevant
Critically, the "implicit allow nothing" is not actually modeled as a DENY which would halt further evaluation and reject the request, it's more like implicitly adding a higher-level default cluster-wide "allow nothing". In Istio this is actually created explicitly as an "allow-nothing" policy, whereas NetworkPolicy enforces this implicitly when an ALLOW rule is created (similar to what's being proposed here).
kind: XAccessPolicy
spec:
targetRefs:
- kind: Gateway
rules: []
^ (or similarly, the lack of any AccessPolicy targeting the Gateway) does not specify any InlineTools ALLOW rules, and thus I would expect no implicit deny should exist and any Backend-targeted should be evaluated. Backend-targeted policies would also be evaluated when a Gateway-targeted AccessPolicy does explicitly allow a requested tool.
This is a somewhat contrived case currently, but could be much more relevant if AccessPolicy eventually became an additional ruleset extension of AuthorizationPolicy (which would make broader L4/L7 rules omitting a tools stanza much more common) or added an audit/"shadow" mode (for logging/testing before enforcement) as below:
rules:
- authorization:
type: InlineTools
action: ALLOW
mode: AUDIT | ENFORCED (default when mode field is omitted)
tools:
- x
(I've always found AUDIT as a peer enum value alongside ALLOW and DENY as confusing, because the intent of the rule becomes less clear.)
If we add a DENY action, we should consider whether that would also trigger the implicit "allow nothing" layer or not (as a narrow DENY breaking all traffic might be unexpected).
| | Level | Target Resource | Precedence | | ||
| |-------|-----------------|------------| | ||
| | Highest | Route Rule (via `sectionName`) | 1 (Wins all) | | ||
| | High | `HTTPRoute` / `GRPCRoute` | 2 | | ||
| | Medium | `Gateway` Listener (via `sectionName`) | 3 | | ||
| | Lowest | `Gateway` (Entire resource) | 4 | |
There was a problem hiding this comment.
Exactly the same as Kuadrant AuthPolicy when the default merge strategy1 applies.
Footnotes
-
With the
overridesmerge strategy, it's the exact reverse. ↩
|
|
||
| One of the most important rules is how Linkerd handles multiple authentication requirements: | ||
|
|
||
| * OR (Between Policies): If you have two different AuthorizationPolicy objects targeting the same Server, the request is allowed if either one is satisfied. |
There was a problem hiding this comment.
Another example of permit-overrides behaviour, although it doesn't seem to support explicit deny – just like the other two examples we have (Kubernetes RBAC and NetworkPolicy).
cc @howardjohn
After giving a better read today, I think this is reasonable. I also realised there was some confusion before about what I said in reference to an I never intended to imply that 2 There's a difference between choosing to evaluate a rule that denies access (explicit deny) and skipping a rule that otherwise would allow access (implicit deny). A single allow rule that is skipped cannot implicitly deny access if there are other rules to evaluate. Only if all allow rules fail to match the conditions to be evaluated (i.e., are all skipped), we can enforce an implicit deny. The conflict I was referring to was between an Even the I already said that I can live with evaluating I guess my last remaining concern then is the disallowing of the combination of |
Rationale: InlineTools is an allowslist.
backend-level evaluation still happens.
| - **ExternalAuth Policies**: `gateway-policy-external-auth-1` and `gateway-policy-external-auth-2` are evaluated. | ||
| - If **any** denies, the request is rejected immediately. | ||
| - If **all** allow, proceed. | ||
| - *Note: Evaluation order among these policies does not matter.* |
There was a problem hiding this comment.
Not sure if this "does not matter" is meant to mean "in the way its specified" or "to users"?
I think its probably important to users the order; usually if they are chaining, they are using external auth not just to give a allow/deny but to augment the request with additional information.
However I agree its a very challenging problem to solve if we allow multiple across resources 🙂 just that I would expect that eventually we see demand for ordering
There was a problem hiding this comment.
I think its probably important to users the order; usually if they are chaining, they are using external auth not just to give a allow/deny but to augment the request with additional information.
Good point.
Before this PR, the ordering is: the policies with earlier creationTimestamp will be evaluated first. For the policies with the same creationTimestamp, the ones appearing first in alphabetical order by {namespace}/{name} will be evaluated first.
Maybe we can keep this ordering for both ExternalAuth and ALLOW/InlineTools policies (strictly speaking, the evaluation order of ALLOW policies really does not matter). WDYT?
There was a problem hiding this comment.
I actually noticed this reading through https://github.com/kubernetes-sigs/kube-agentic-networking/blob/main/docs/proposals/0059-AccessPolicyTargetRefs.md and felt the timestamp behavior was a bit unexpected - in Gateway API timestamps are mostly used to resolve conflicts during the evaluation of a resource by a controller (deciding whether to accept or reject it, such as if only a certain number of Gateway-targeted policies were allowed), and aren't typically relevant during evaluation/enforcement of a resource or policy by the dataplane at runtime.
First, gateway-policy-audit is evaluated (earlier creation timestamp).
Next, gateway-policy-region is evaluated.
I can see how, as @howardjohn mentions, mutations which augment a request could be relevant for subsequent policy decisions, but I think a more explicit ordered configuration would be preferable there, and I'm not sure if the mutating scope extends beyond current expectations for AccessPolicy and starts looking more like https://github.com/kubernetes-sigs/wg-ai-gateway/blob/main/proposals/7-payload-processing.md, which does have specific expectations around ordering and request mutation?
There was a problem hiding this comment.
@mikemorris , thanks for the feedback. Folk seems in favor of limiting the number of ExternalAuth policies/rules per target to 1 (see the GH discussion).
I will update this PR after we finalize the GH dicussion regarding this.
|
This PR is solving real problems:
But we’re at a foundational layer. Decisions made here will:
Recommendation Before Merge
If we get this right now, this becomes a reference model for agentic policy in Kubernetes. If we rush it, we’ll be revisiting this with breaking changes in a few releases. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
The original evaluation rules had several issues:
When an object is targeted by multiple InlineTools AccessPolicy objects, the request is only allowed if all the AccessPolicy objects allow the request.
Expected behavior: the request should be allowed if any of the AccessPolicy objects allows the request.
Rationale: Be consistent with prior art like Istio.
The evaluation does not specify the evaluation order when both InlineTools and ExternalAuth rules exist for a target object.
Expected behavior: When both InlineTools and ExternalAuth rules exist for a target object, ExternalAuth rules should be evaluated first.
Rationale: efficiency + be consistent with prior art like Istio.
In addition, to make it easy for users to understand why a tool access is allowed or denied, we will disallow the combination of InlineTools and ExternalAuth in the same AccessPolicy.
Which issue(s) this PR fixes:
Addresses #217
Does this PR introduce a user-facing change?: