CA ClusterAPI provider can delete wrong node when scale-down occurs during MachineDeployment upgrade

**Which component are you using?**:

/area cluster-autoscaler


**What version of the component are you using?**: 



Component version: 1.33

**What k8s version are you using (`kubectl version`)?**:

k8s: 1.32
clusterapi: 1.9

**What environment is this in?**:

kubernetes with cluster api, have tried on multiple infra providers.

**What did you expect to happen?**:

i expect the cluster autoscaler to scale my cluster up and down while a MachineDeployment is undergoing an upgrade.

**What happened instead?**:

during the upgrade process, a MachineDeployment will have more nodes than its replicas field represents. if the old nodes are prevented from being deleted, the cluster-api controllers will delete new nodes that are created during the upgrade.

**How to reproduce it (as minimally and precisely as possible)**:

this is a difficult problem to recreate but the process can be summarized as:

1. cordon/drain, add a custom noschedule taint, and mark as unschedulable a Node with a MachineDeployment
2. begin an upgrade of the MachineDeployment from step 1
3. prevent the old node from step 1 from being removed after the upgrade is complete
4. wait until the unneeded timer has expired for the node from step 1
5. observe the autoscaler attempt to reduce replicas


**Anything else we need to know?**:

I've written a little more about this in a separate doc, copying here:

In some cases, when a MachineDeployment is undergoing an upgrade **and** the Cluster Autoscaler is scaling that MachineDeployment, it is possible for new upgraded nodes joining the cluster to be removed by the autoscaler.

## Problem

When the Cluster API MachineDeployment controller is undergoing an upgrade process, there is a period of time when there will be more Machine and Node resources belonging to the MachineDeployment than the `.spec.replicas` field represents. When this happens, it is possible for the cluster autoscaler to request removal of the older nodes based on the amount of time that they have been unneeded (eg under-utilized or empty). The cluster autoscaler will then attempt to remove the nodes by annotating the associated Machine resource with the `cluster.x-k8s.io/delete-machine` key, and then reducing the `.spec.replicas` field of the MachineDeployment. In cases where the Machine annotated for deletion cannot be removed, it is possible for the cluster api controller to remove a new node.

The node removal generally follows this flow:

```mermaid
sequenceDiagram
  participant CAS as Cluster Autoscaler
  participant CAPI as Cluster-API
  participant MD as MachineDeployment
  actor User as User

  User ->> MD: begin upgrade by changing infra ref
  MD ->> CAPI: reconciles change
  CAPI ->> CAPI: creates new MachineSet, begins upgrade
  CAPI ->> CAPI: new machines gets created (~according to MaxSurge)
  CAPI ->> CAPI: old machines gets deleted (~according to MaxUnavailable)
  CAPI ->> CAPI: for any reason, workloads are not moved to new machines
  CAS ->> CAS: observes unneeded Nodes (old nodes that have been cordoned/drained)
  CAS ->> MD: reduces replica count, marks Machines for deletion
  MD ->> CAPI: reconciles change
  CAPI ->> CAPI: removes wrong Machines (CAPI has its own opinion on which machine to delete when scaling down)
```

It is important to note that in most cases this process works as intended. But, in cases where the old Machines/Nodes are prevented from being removed, the behavior of new Machines being deleted becomes more pronounced.

Also, it is not clear yet how the `maxSurge` and `maxUnavailable` fields impact this problem.

Another expression of this problem can occur with new nodes being deleted:

```mermaid
sequenceDiagram
  participant CAS as Cluster Autoscaler
  participant CAPI as Cluster-API
  participant MD as MachineDeployment
  actor User as User

  User ->> MD: begin upgrade by changing infra ref
  MD ->> CAPI: reconciles change
  CAPI ->> CAPI: creates new MachineSet, begins upgrade
  CAPI ->> CAPI: new machines gets created (according to MaxSurge)
  CAPI ->> CAPI: for any reason, workloads are not moved to new machines (e.g. custom provisioning sequence)
  CAS ->> CAS: observes unneeded Nodes, the new machine which is still waiting for workloads
  CAS ->> MD: reduces replica count for the MD, marks unneeded Machine for deletion
  MD ->> CAPI: reconciles change
  CAPI ->> CAPI: removes wrong Machines (while rolling out, in this case CAPI prioritizes downscaling old MSsets to reduce the replica count - delete annotation on a machine in another MS has no impact)
```

## Possibilities

1. Make the autoscaler smarter about upgrades. This would inolve improving the Cluster API provider code to understand when an upgrade is occurring, and to ignore calls to `DeleteNodes` that target old, un-upgraded, nodes.
2. Add more information to the Cluster API resource objects to coordinate upgrade. This would be similar to the `outdated-revision` taint, but might include adding an annotation to Machine or Node objects to improve the ability of 3rd party tools to understand when an upgrade is underway.
3. Change the behavior of the Cluster API MachineDeployment controller. This might include making the MachineDeployment controller more away of when Machines it owns containt the `cluster.x-k8s.io/delete-machine` annotation so that it could choose the best MachineSet to scale.
4. Prevent the autoscaler from making any changes while a MachineDeployment is upgrading.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CA ClusterAPI provider can delete wrong node when scale-down occurs during MachineDeployment upgrade #8494

Problem

Possibilities

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CA ClusterAPI provider can delete wrong node when scale-down occurs during MachineDeployment upgrade #8494

Description

Problem

Possibilities

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions