chore: add storage capacity for templated nodes and update labels#505
Draft
landreasyan wants to merge 1 commit into
Draft
chore: add storage capacity for templated nodes and update labels#505landreasyan wants to merge 1 commit into
landreasyan wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
local-csi-driver Capacity-Template Controller
Scenario
The Kubernetes Cluster Autoscaler (CA) decides whether scaling up a node group
will help a pending pod by simulating scheduling onto a template node
synthesised from an existing node in that group. When a pod's PVC uses a CSI
driver whose
CSIStorageCapacityobjects are node-specific (e.g. selected bykubernetes.io/hostname), no capacity object exists for the simulated templatenode, the scheduler's storage-capacity check fails, and CA refuses to scale up
the group. This is upstream issue
kubernetes/autoscaler#9700.
The local-csi-driver runs the external-provisioner sidecar in node-local mode
(
--node-deployment, withTopology=trueand--strict-topology). Capacitytracking is enabled by
--enable-capacityon the same sidecar; thecombination of strict per-node topology and capacity tracking causes each
DaemonSet pod to publish a
CSIStorageCapacitykeyed by hostname foraccurate per-node accounting. That accuracy is exactly what breaks CA's
simulation - the simulated template node has a synthetic hostname for which
no capacity object exists.
Goals
without giving up per-node capacity accuracy for live nodes.
dimension (VM SKU, AKS agent pool, zone) without code changes.
CSIStorageCapacityobjects published by theexternal-provisioner sidecar.
Non-goals
template quantity, used only to satisfy CA's simulation.
nodes.
Design
Mitigation in upstream Cluster Autoscaler
kubernetes/autoscaler#9702 adds the label
cluster-autoscaler.kubernetes.io/template-node=trueto every template nodeCA generates during scale-up simulation. CSI vendors can use this label in a
dedicated
CSIStorageCapacity.NodeTopologyselector that matches onlytemplate nodes, leaving real-node capacity reporting untouched.
Capacity-template controller
A controller in the
local-csi-managerDeployment publishes oneCSIStorageCapacityper(StorageClass x node group)pair.Opt-in. A
StorageClasswhose provisioner islocaldisk.csi.acstor.ioopts in by setting the annotation:
The value is parsed as a
resource.Quantityand used as the publishedcapacity for every node group.
Grouping. Nodes are grouped by a configurable label (flag
--capacity-template-node-group-label). The default isnode.kubernetes.io/instance-type(VM SKU) because local NVMe capacity is aproperty of the VM SKU, not of an arbitrary pool name, and the upstream
instance-type label is portable across cloud providers. Other useful values:
node.kubernetes.io/instance-typekubernetes.azure.com/agentpooltopology.kubernetes.io/zoneTopology selector. Each managed
CSIStorageCapacity.NodeTopologyselects on both labels:
The
template-node=trueconstraint means the object only matches CA'ssimulated template nodes. Real nodes (which never carry that label) continue
to use the per-node objects published by the external-provisioner sidecar.
Reconciliation. The controller does a full sync on every event and on a
periodic 5-minute resync. On each reconcile it:
StorageClasseswith provisionerlocaldisk.csi.acstor.ioand theopt-in annotation.
Nodesand collects the distinct values of the configuredgroup label.
(class, group)pair, creates or updates aCSIStorageCapacityin the manager's namespace.
(class, group)is no longerdesired. Managed objects are identified by the
localdisk.csi.acstor.io/managed-by=capacitytemplatelabel, so objectspublished by the external-provisioner sidecar are never touched.
Each managed object also carries
localdisk.csi.acstor.io/storageclassandlocaldisk.csi.acstor.io/node-grouplabels for human inspection.Naming. Objects are named
local-csi-template-<storageclass>-<group>,sanitised to DNS-1123 and truncated with a short hash if longer than 253
characters.
Watches. The reconciler watches:
StorageClass(filtered by provisioner) - opt-in changes.Node- filtered to events that add, remove, or change the configuredgroup label.
CSIStorageCapacity- filtered to objects with the managed-bylabel, so external changes trigger reconvergence.
Disabled by default
The controller is opt-in via
--enable-capacity-template(Helm valuecleanup.capacityTemplate.enabled), since clusters that do not use ClusterAutoscaler do not need it.
Limitations
group. If two groups using the same grouping label value need different
template capacities, switch the grouping label to a finer-grained one
(e.g. agent pool instead of SKU).
template-node=truelabel (Add template-node label to mitigate scale up issue (#9700) kubernetes/autoscaler#9702). Older CAs willnot match these capacity objects, but they will also not be harmed by them.
References
node-specific
CSIStorageCapacity.mitigation.