-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Description
Which component are you using?:
/area cluster-autoscaler
What version of the component are you using?:
Component version: 1.33
What k8s version are you using (kubectl version
)?:
k8s: 1.32
clusterapi: 1.9
What environment is this in?:
kubernetes with cluster api, have tried on multiple infra providers.
What did you expect to happen?:
i expect the cluster autoscaler to scale my cluster up and down while a MachineDeployment is undergoing an upgrade.
What happened instead?:
during the upgrade process, a MachineDeployment will have more nodes than its replicas field represents. if the old nodes are prevented from being deleted, the cluster-api controllers will delete new nodes that are created during the upgrade.
How to reproduce it (as minimally and precisely as possible):
this is a difficult problem to recreate but the process can be summarized as:
- cordon/drain, add a custom noschedule taint, and mark as unschedulable a Node with a MachineDeployment
- begin an upgrade of the MachineDeployment from step 1
- prevent the old node from step 1 from being removed after the upgrade is complete
- wait until the unneeded timer has expired for the node from step 1
- observe the autoscaler attempt to reduce replicas
Anything else we need to know?:
I've written a little more about this in a separate doc, copying here:
In some cases, when a MachineDeployment is undergoing an upgrade and the Cluster Autoscaler is scaling that MachineDeployment, it is possible for new upgraded nodes joining the cluster to be removed by the autoscaler.
Problem
When the Cluster API MachineDeployment controller is undergoing an upgrade process, there is a period of time when there will be more Machine and Node resources belonging to the MachineDeployment than the .spec.replicas
field represents. When this happens, it is possible for the cluster autoscaler to request removal of the older nodes based on the amount of time that they have been unneeded (eg under-utilized or empty). The cluster autoscaler will then attempt to remove the nodes by annotating the associated Machine resource with the cluster.x-k8s.io/delete-machine
key, and then reducing the .spec.replicas
field of the MachineDeployment. In cases where the Machine annotated for deletion cannot be removed, it is possible for the cluster api controller to remove a new node.
The node removal generally follows this flow:
sequenceDiagram
participant CAS as Cluster Autoscaler
participant CAPI as Cluster-API
participant MD as MachineDeployment
actor User as User
User ->> MD: begin upgrade by changing infra ref
MD ->> CAPI: reconciles change
CAPI ->> CAPI: creates new MachineSet, begins upgrade
CAPI ->> CAPI: new machines gets created (~according to MaxSurge)
CAPI ->> CAPI: old machines gets deleted (~according to MaxUnavailable)
CAPI ->> CAPI: for any reason, workloads are not moved to new machines
CAS ->> CAS: observes unneeded Nodes (old nodes that have been cordoned/drained)
CAS ->> MD: reduces replica count, marks Machines for deletion
MD ->> CAPI: reconciles change
CAPI ->> CAPI: removes wrong Machines (CAPI has its own opinion on which machine to delete when scaling down)
It is important to note that in most cases this process works as intended. But, in cases where the old Machines/Nodes are prevented from being removed, the behavior of new Machines being deleted becomes more pronounced.
Also, it is not clear yet how the maxSurge
and maxUnavailable
fields impact this problem.
Another expression of this problem can occur with new nodes being deleted:
sequenceDiagram
participant CAS as Cluster Autoscaler
participant CAPI as Cluster-API
participant MD as MachineDeployment
actor User as User
User ->> MD: begin upgrade by changing infra ref
MD ->> CAPI: reconciles change
CAPI ->> CAPI: creates new MachineSet, begins upgrade
CAPI ->> CAPI: new machines gets created (according to MaxSurge)
CAPI ->> CAPI: for any reason, workloads are not moved to new machines (e.g. custom provisioning sequence)
CAS ->> CAS: observes unneeded Nodes, the new machine which is still waiting for workloads
CAS ->> MD: reduces replica count for the MD, marks unneeded Machine for deletion
MD ->> CAPI: reconciles change
CAPI ->> CAPI: removes wrong Machines (while rolling out, in this case CAPI prioritizes downscaling old MSsets to reduce the replica count - delete annotation on a machine in another MS has no impact)
Possibilities
- Make the autoscaler smarter about upgrades. This would inolve improving the Cluster API provider code to understand when an upgrade is occurring, and to ignore calls to
DeleteNodes
that target old, un-upgraded, nodes. - Add more information to the Cluster API resource objects to coordinate upgrade. This would be similar to the
outdated-revision
taint, but might include adding an annotation to Machine or Node objects to improve the ability of 3rd party tools to understand when an upgrade is underway. - Change the behavior of the Cluster API MachineDeployment controller. This might include making the MachineDeployment controller more away of when Machines it owns containt the
cluster.x-k8s.io/delete-machine
annotation so that it could choose the best MachineSet to scale. - Prevent the autoscaler from making any changes while a MachineDeployment is upgrading.