Refactor cloud provider creation options #8583

elmiko · 2025-09-29T16:46:35Z

What type of PR is this?

/kind cleanup
/kind api-change

What this PR does / why we need it:

This patch series changes an argument to the NewCloudProvider function to use an AutoscalerOptions struct instead of AutoscalingOptions. This change allows cloud providers to have more control over the core functionality of the cluster autoscaler.

In specific, this patch series also adds a method named RegisterScaleDownNodeProcessor to the AutoscalerOptions so that cloud providers can inject a custom scale down processor.

Lastly, this change adds a custom scale down processor to the clusterapi provider to help it avoid removing the wrong instance during scale down operations that occur during a cluster upgrade.

Which issue(s) this PR fixes:

Fixes #8494

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

elmiko · 2025-09-29T16:48:12Z

this is an alternate solution instead of #8531

elmiko · 2025-09-30T19:21:57Z

/retest

elmiko · 2025-09-30T19:22:18Z

not sure why the tests aren't triggering.

elmiko · 2025-09-30T19:53:06Z

rebased on master

elmiko · 2025-09-30T20:13:17Z

i'm not sure how i've broken this, but these error lines are not making sense to me:

package k8s.io/autoscaler/cluster-autoscaler/processors/customresources
	imports k8s.io/autoscaler/cluster-autoscaler/cloudprovider/gce from gpu_processor_test.go
	imports k8s.io/autoscaler/cluster-autoscaler/core/options from gce_cloud_provider.go
	imports k8s.io/autoscaler/cluster-autoscaler/processors from autoscaler.go
Error: 	imports k8s.io/autoscaler/cluster-autoscaler/processors/customresources from processors.go: import cycle not allowed in test

elmiko · 2025-09-30T20:18:26Z

@towca does this error looks familiar to you at all?

elmiko · 2025-09-30T20:46:15Z

ok, think my last commit on this chain fixed the circular import.

jackfrancis · 2025-10-02T21:33:04Z

We have a conversation going on here that will potentially improve the gpu node label concern that this PR encountered:

CAS: move DRA consts go into core #8595 (review)

/lgtm
/approve
/hold

for @towca to give his thumbs up

k8s-ci-robot · 2025-10-02T21:33:16Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: elmiko, jackfrancis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [jackfrancis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

elmiko · 2025-10-03T13:16:34Z

We have a conversation going on here that will potentially improve the gpu node label concern that this PR encountered:

i'm fine to change the label in the test for this PR, i was just trying to stop the circular import error.

This change helps to prevent circular dependencies between the core and builder packages as we start to pass the AutoscalerOptions to the cloud provider builder functions.

this changes the options input to the cloud provider builder function so that the full autoscaler options are passed. This is being proposed so that cloud providers will have new options for injecting behavior into the core parts of the autoscaler.

this change adds a method to the AutoscalerOptions struct to allow registering a scale down node processor.

This change adds a custom scale down node processor for cluster api to reject nodes that are undergoing upgrade.

this change removes the import from the gce module in favor of using the string value directly.

k8s-ci-robot · 2025-10-08T13:49:55Z

New changes are detected. LGTM label has been removed.

elmiko · 2025-10-08T13:50:11Z

rebased

elmiko · 2025-10-14T15:27:19Z

@jackfrancis @towca , how is this patch looking?

i'd like to get it merged soon if possible.

towca

Reviewed the non-cluster-api parts of the PR

towca · 2025-10-16T19:20:33Z

cluster-autoscaler/core/options/autoscaler.go

 }
+
+func (a *AutoscalerOptions) RegisterScaleDownNodeProcessor(np nodes.ScaleDownNodeProcessor) {
+	a.Processors.ScaleDownNodeProcessor.(*scaledowncandidates.CombinedScaleDownCandidatesProcessor).Register(np)


This method only works for a particular configuration of AutoscalerOptions and panics otherwise. IMO this logic belongs in cloud-provider-specific code where you can assume which processors are used. I'd convert this into a CombinedScaleDownCandidatesProcessor-specific util function and call it directly from your NewCloudProvider().

i opted for the method here so that we could keep the details about the Processors structure internal to the AutoscalerOptions, ~~but if you are ok with having the individual providers do the injection directly through a passed function, i can change it.~~

edit: i misread your comment, i'll try it another way.

Yeah, I meant adding an util like this to where the combined processor is implemented:

func AddScaleDownProcessorAssumingCombinedImplementation(combinedProcessor, processorToAdd nodes.ScaleDownNodeProcessor) error { combinedProcessorConcrete, ok := combinedProcessor.(*scaledowncandidates.CombinedScaleDownCandidatesProcessor) if !ok { return fmt.Errorf("wrong concrete ScaleDownNodeProcessor type: want *scaledowncandidates.CombinedScaleDownCandidatesProcessor, got %T", combinedProcessor) } combinedProcessorConcrete.Register(processorToAdd) }

And calling it from your NewCloudProvider like so:

... err := addScaleDownProcessorAssumingCombinedImplementation(opts.Processors.ScaleDownNodeProcessor, clusterApiSpecificProcessor) ...

The whole logic could also just be inlined into your NewCloudProvider() if we don't expect that other cloud provider integrations would want to use it in the future.

The difference between this util and your proposed method is that I see AutoscalerOptions and its associated code as "core" CA, and core CA shouldn't assume anything about concrete types behind the processors. Out-of-tree cloud-provider integrations can for example use a completely different implementation for the top-level ScaleDownNodeProcessor. Then such integrations have to avoid using this method even though from the outside it looks like it provides a generic way of adding another ScaleDownNodeProcessor to AutoscalerOptions. The util on the other hand is specific to a concrete implementation of ScaleDownNodeProcessor, so it should be clear not to use it on other implementations.

makes sense to me, i will update to use a util function style.

The whole logic could also just be inlined into your NewCloudProvider() if we don't expect that other cloud provider integrations would want to use it in the future.

i would prefer a function that could be utilized by other providers. i will make a function in scale_down_candidates_processor.go and see how that looks.

towca · 2025-10-16T19:33:03Z

cluster-autoscaler/core/autoscaler.go

 	}
 	if opts.CloudProvider == nil {
-		opts.CloudProvider = cloudBuilder.NewCloudProvider(opts.AutoscalingOptions, informerFactory)
+		opts.CloudProvider = cloudBuilder.NewCloudProvider(opts, informerFactory)


I think we need to explicitly put this at the end of the defaulting in this function, otherwise we can have the changes to AutoscalerOptions made in NewCloudProvider() be overwritten by further logic here (e.g. if NewCloudProvider() wanted to configure opts.Backoff). It seems that ExpanderStrategy has a dependency on CloudProvider, so that either needs to be removed or explained wherever we describe how NewCloudProvider() should be implemented.

just to be clear, you want to see something like?

NewCloudProvider(opts config.AutoscalingOptions, informerFactory informers.SharedInformerFactory, autoscalerOpts core.AutoscalerOptions)

or did you mean move the function call further down inside of initializeDefaultOptions ?

The latter, I meant to put the function call as close to the bottom of initializeDefaultOptions as possible (it can't be at the very bottom because the ExpanderStrategy logic needs CloudProvider already set). More generally, IMO the NewCloudProvider() call should be as late in CA init logic as possible (but initializeDefaultOptions already gets called at the end to fill out the fields that weren't set by the main.go logic before).

My reasoning was slightly wrong though - the fields wouldn't be overwritten if NewCloudProvider set them because of the nil checks in initializeDefaultOptions. But IMO it still makes sense to put it as late as possible, so that NewCloudProvider can assume that ~all fields on the AutoscalerOptions it gets are already set and can be read/modified.

This isn't an issue for your particular problem because opts.Processors are set before the NewCloudProvider call. I'm just trying to make sure the mechanism is as flexible as possible. For example if a NewCloudProvider implementation in the future wants to interact with the opts.EstimatorBuilder field and we keep the current order, the field would be unset at the time of NewCloudProvider() call.

got it, makes sense to me. i will update this.

towca · 2025-10-17T16:18:24Z

cluster-autoscaler/main.go


 	var snapshotStore clustersnapshot.ClusterSnapshotStore = store.NewDeltaSnapshotStore(autoscalingOptions.ClusterSnapshotParallelism)
-	opts := core.AutoscalerOptions{
+	opts := coreoptions.AutoscalerOptions{


One more thing - it seems that there is already existing cloud-provider-specific init logic in this function. Now we're adding a better mechanism for the same purpose, which might be confusing down the road. Could you add a TODO (and ideally an associated issue on Github) near the provider-specific logic to migrate it to NewCloudProvider()?

yes, will do.

k8s-ci-robot requested a review from aleksandra-malinowska September 29, 2025 16:46

k8s-ci-robot added the area/provider/azure Issues or PRs related to azure provider label Sep 29, 2025

k8s-ci-robot requested a review from apricote September 29, 2025 16:46

elmiko mentioned this pull request Sep 29, 2025

Add IsNodeCandidateForScaleDown cloud provider interface #8531

Closed

elmiko force-pushed the provider-options-refactor branch from 1afa484 to a5f07fe Compare September 30, 2025 19:52

jackfrancis mentioned this pull request Sep 30, 2025

CAS: move DRA consts go into core #8595

Closed

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 2, 2025

k8s-ci-robot assigned jackfrancis Oct 2, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 2, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 2, 2025

jackfrancis mentioned this pull request Oct 3, 2025

test: use custom gpu node config for processor tests #8607

Open

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 7, 2025

elmiko added 5 commits October 8, 2025 09:43

refactor core.AutoscalerOptions in a new package

2c55741

This change helps to prevent circular dependencies between the core and builder packages as we start to pass the AutoscalerOptions to the cloud provider builder functions.

add scale down registration method

a893f32

this change adds a method to the AutoscalerOptions struct to allow registering a scale down node processor.

add clusterapi scale down upgrade processor

203835c

This change adds a custom scale down node processor for cluster api to reject nodes that are undergoing upgrade.

refactor gpu_processor_test to remove cyclic dependency

aea176f

this change removes the import from the gce module in favor of using the string value directly.

elmiko force-pushed the provider-options-refactor branch from b82aa40 to aea176f Compare October 8, 2025 13:49

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 8, 2025

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 8, 2025

towca requested changes Oct 16, 2025

View reviewed changes

towca reviewed Oct 17, 2025

View reviewed changes

Refactor cloud provider creation options #8583

Are you sure you want to change the base?

Refactor cloud provider creation options #8583

Uh oh!

Conversation

elmiko commented Sep 29, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

elmiko commented Sep 29, 2025

Uh oh!

elmiko commented Sep 30, 2025

Uh oh!

elmiko commented Sep 30, 2025

Uh oh!

elmiko commented Sep 30, 2025

Uh oh!

elmiko commented Sep 30, 2025

Uh oh!

elmiko commented Sep 30, 2025

Uh oh!

elmiko commented Sep 30, 2025

Uh oh!

jackfrancis commented Oct 2, 2025

Uh oh!

k8s-ci-robot commented Oct 2, 2025

Uh oh!

elmiko commented Oct 3, 2025

Uh oh!

k8s-ci-robot commented Oct 8, 2025

Uh oh!

elmiko commented Oct 8, 2025

Uh oh!

elmiko commented Oct 14, 2025

Uh oh!

towca left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elmiko Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

elmiko Oct 16, 2025 •

edited

Loading