Skip to content

Conversation

elmiko
Copy link
Contributor

@elmiko elmiko commented Sep 29, 2025

What type of PR is this?

/kind cleanup
/kind api-change

What this PR does / why we need it:

This patch series changes an argument to the NewCloudProvider function to use an AutoscalerOptions struct instead of AutoscalingOptions. This change allows cloud providers to have more control over the core functionality of the cluster autoscaler.

In specific, this patch series also adds a method named RegisterScaleDownNodeProcessor to the AutoscalerOptions so that cloud providers can inject a custom scale down processor.

Lastly, this change adds a custom scale down processor to the clusterapi provider to help it avoid removing the wrong instance during scale down operations that occur during a cluster upgrade.

Which issue(s) this PR fixes:

Fixes #8494

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area area/cluster-autoscaler area/provider/alicloud Issues or PRs related to the AliCloud cloud provider implementation area/provider/aws Issues or PRs related to aws provider and removed do-not-merge/needs-area labels Sep 29, 2025
@k8s-ci-robot k8s-ci-robot added the area/provider/azure Issues or PRs related to azure provider label Sep 29, 2025
@k8s-ci-robot k8s-ci-robot added area/provider/cluster-api Issues or PRs related to Cluster API provider area/provider/coreweave area/provider/digitalocean Issues or PRs related to digitalocean provider area/provider/equinixmetal Issues or PRs related to the Equinix Metal cloud provider for Cluster Autoscaler size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/provider/externalgrpc Issues or PRs related to the External gRPC provider area/provider/gce area/provider/hetzner Issues or PRs related to Hetzner provider area/provider/huaweicloud area/provider/ionoscloud area/provider/kwok Issues or PRs related to the kwok cloud provider for Cluster Autoscaler area/provider/linode Issues or PRs related to linode provider area/provider/magnum Issues or PRs related to the Magnum cloud provider for Cluster Autoscaler area/provider/oci Issues or PRs related to oci provider area/provider/rancher area/provider/utho Issues or PRs related to Utho provider labels Sep 29, 2025
@elmiko
Copy link
Contributor Author

elmiko commented Sep 29, 2025

this is an alternate solution instead of #8531

@elmiko
Copy link
Contributor Author

elmiko commented Sep 30, 2025

/retest

@elmiko
Copy link
Contributor Author

elmiko commented Sep 30, 2025

not sure why the tests aren't triggering.

@elmiko elmiko force-pushed the provider-options-refactor branch from 1afa484 to a5f07fe Compare September 30, 2025 19:52
@elmiko
Copy link
Contributor Author

elmiko commented Sep 30, 2025

rebased on master

@elmiko
Copy link
Contributor Author

elmiko commented Sep 30, 2025

i'm not sure how i've broken this, but these error lines are not making sense to me:

package k8s.io/autoscaler/cluster-autoscaler/processors/customresources
	imports k8s.io/autoscaler/cluster-autoscaler/cloudprovider/gce from gpu_processor_test.go
	imports k8s.io/autoscaler/cluster-autoscaler/core/options from gce_cloud_provider.go
	imports k8s.io/autoscaler/cluster-autoscaler/processors from autoscaler.go
Error: 	imports k8s.io/autoscaler/cluster-autoscaler/processors/customresources from processors.go: import cycle not allowed in test

@elmiko
Copy link
Contributor Author

elmiko commented Sep 30, 2025

@towca does this error looks familiar to you at all?

@elmiko
Copy link
Contributor Author

elmiko commented Sep 30, 2025

ok, think my last commit on this chain fixed the circular import.

@jackfrancis
Copy link
Contributor

We have a conversation going on here that will potentially improve the gpu node label concern that this PR encountered:

/lgtm
/approve
/hold

for @towca to give his thumbs up

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 2, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 2, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko, jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 2, 2025
@elmiko
Copy link
Contributor Author

elmiko commented Oct 3, 2025

We have a conversation going on here that will potentially improve the gpu node label concern that this PR encountered:

i'm fine to change the label in the test for this PR, i was just trying to stop the circular import error.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 7, 2025
elmiko added 5 commits October 8, 2025 09:43
This change helps to prevent circular dependencies between the core and
builder packages as we start to pass the AutoscalerOptions to the cloud
provider builder functions.
this changes the options input to the cloud provider builder function so
that the full autoscaler options are passed. This is being proposed so
that cloud providers will have new options for injecting behavior into
the core parts of the autoscaler.
this change adds a method to the AutoscalerOptions struct to allow
registering a scale down node processor.
This change adds a custom scale down node processor for cluster api to
reject nodes that are undergoing upgrade.
this change removes the import from the gce module in favor of using the
string value directly.
@elmiko elmiko force-pushed the provider-options-refactor branch from b82aa40 to aea176f Compare October 8, 2025 13:49
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 8, 2025
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 8, 2025
@elmiko
Copy link
Contributor Author

elmiko commented Oct 8, 2025

rebased

@elmiko
Copy link
Contributor Author

elmiko commented Oct 14, 2025

@jackfrancis @towca , how is this patch looking?

i'd like to get it merged soon if possible.

Copy link
Collaborator

@towca towca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the non-cluster-api parts of the PR

}

func (a *AutoscalerOptions) RegisterScaleDownNodeProcessor(np nodes.ScaleDownNodeProcessor) {
a.Processors.ScaleDownNodeProcessor.(*scaledowncandidates.CombinedScaleDownCandidatesProcessor).Register(np)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method only works for a particular configuration of AutoscalerOptions and panics otherwise. IMO this logic belongs in cloud-provider-specific code where you can assume which processors are used. I'd convert this into a CombinedScaleDownCandidatesProcessor-specific util function and call it directly from your NewCloudProvider().

Copy link
Contributor Author

@elmiko elmiko Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i opted for the method here so that we could keep the details about the Processors structure internal to the AutoscalerOptions, but if you are ok with having the individual providers do the injection directly through a passed function, i can change it.

edit: i misread your comment, i'll try it another way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I meant adding an util like this to where the combined processor is implemented:

func AddScaleDownProcessorAssumingCombinedImplementation(combinedProcessor, processorToAdd nodes.ScaleDownNodeProcessor) error {
	combinedProcessorConcrete, ok := combinedProcessor.(*scaledowncandidates.CombinedScaleDownCandidatesProcessor)
	if !ok {
		return fmt.Errorf("wrong concrete ScaleDownNodeProcessor type: want *scaledowncandidates.CombinedScaleDownCandidatesProcessor, got %T", combinedProcessor)
	}
	combinedProcessorConcrete.Register(processorToAdd)
}

And calling it from your NewCloudProvider like so:

...
err := addScaleDownProcessorAssumingCombinedImplementation(opts.Processors.ScaleDownNodeProcessor, clusterApiSpecificProcessor)
...

The whole logic could also just be inlined into your NewCloudProvider() if we don't expect that other cloud provider integrations would want to use it in the future.

The difference between this util and your proposed method is that I see AutoscalerOptions and its associated code as "core" CA, and core CA shouldn't assume anything about concrete types behind the processors. Out-of-tree cloud-provider integrations can for example use a completely different implementation for the top-level ScaleDownNodeProcessor. Then such integrations have to avoid using this method even though from the outside it looks like it provides a generic way of adding another ScaleDownNodeProcessor to AutoscalerOptions. The util on the other hand is specific to a concrete implementation of ScaleDownNodeProcessor, so it should be clear not to use it on other implementations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me, i will update to use a util function style.

The whole logic could also just be inlined into your NewCloudProvider() if we don't expect that other cloud provider integrations would want to use it in the future.

i would prefer a function that could be utilized by other providers. i will make a function in scale_down_candidates_processor.go and see how that looks.

}
if opts.CloudProvider == nil {
opts.CloudProvider = cloudBuilder.NewCloudProvider(opts.AutoscalingOptions, informerFactory)
opts.CloudProvider = cloudBuilder.NewCloudProvider(opts, informerFactory)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to explicitly put this at the end of the defaulting in this function, otherwise we can have the changes to AutoscalerOptions made in NewCloudProvider() be overwritten by further logic here (e.g. if NewCloudProvider() wanted to configure opts.Backoff). It seems that ExpanderStrategy has a dependency on CloudProvider, so that either needs to be removed or explained wherever we describe how NewCloudProvider() should be implemented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to be clear, you want to see something like?

NewCloudProvider(opts config.AutoscalingOptions, informerFactory informers.SharedInformerFactory, autoscalerOpts core.AutoscalerOptions)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or did you mean move the function call further down inside of initializeDefaultOptions ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latter, I meant to put the function call as close to the bottom of initializeDefaultOptions as possible (it can't be at the very bottom because the ExpanderStrategy logic needs CloudProvider already set). More generally, IMO the NewCloudProvider() call should be as late in CA init logic as possible (but initializeDefaultOptions already gets called at the end to fill out the fields that weren't set by the main.go logic before).

My reasoning was slightly wrong though - the fields wouldn't be overwritten if NewCloudProvider set them because of the nil checks in initializeDefaultOptions. But IMO it still makes sense to put it as late as possible, so that NewCloudProvider can assume that ~all fields on the AutoscalerOptions it gets are already set and can be read/modified.

This isn't an issue for your particular problem because opts.Processors are set before the NewCloudProvider call. I'm just trying to make sure the mechanism is as flexible as possible. For example if a NewCloudProvider implementation in the future wants to interact with the opts.EstimatorBuilder field and we keep the current order, the field would be unset at the time of NewCloudProvider() call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, makes sense to me. i will update this.


var snapshotStore clustersnapshot.ClusterSnapshotStore = store.NewDeltaSnapshotStore(autoscalingOptions.ClusterSnapshotParallelism)
opts := core.AutoscalerOptions{
opts := coreoptions.AutoscalerOptions{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing - it seems that there is already existing cloud-provider-specific init logic in this function. Now we're adding a better mechanism for the same purpose, which might be confusing down the road. Could you add a TODO (and ideally an associated issue on Github) near the provider-specific logic to migrate it to NewCloudProvider()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, will do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler area/provider/alicloud Issues or PRs related to the AliCloud cloud provider implementation area/provider/aws Issues or PRs related to aws provider area/provider/azure Issues or PRs related to azure provider area/provider/cluster-api Issues or PRs related to Cluster API provider area/provider/coreweave area/provider/digitalocean Issues or PRs related to digitalocean provider area/provider/equinixmetal Issues or PRs related to the Equinix Metal cloud provider for Cluster Autoscaler area/provider/externalgrpc Issues or PRs related to the External gRPC provider area/provider/gce area/provider/hetzner Issues or PRs related to Hetzner provider area/provider/huaweicloud area/provider/ionoscloud area/provider/kwok Issues or PRs related to the kwok cloud provider for Cluster Autoscaler area/provider/linode Issues or PRs related to linode provider area/provider/magnum Issues or PRs related to the Magnum cloud provider for Cluster Autoscaler area/provider/oci Issues or PRs related to oci provider area/provider/rancher area/provider/utho Issues or PRs related to Utho provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CA ClusterAPI provider can delete wrong node when scale-down occurs during MachineDeployment upgrade

6 participants