Fix proactive scale up injecting fake pods for scheduling gated pods #8580

abdelrahman882 · 2025-09-29T09:26:47Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Proactive scale up is listing the controllers, check the desired replicas and inject fake pods to proactively scale up even before these pods are created and marked as unschedulable.

Proactive scale up is using the following formula for the number of fake pods to inject per controller:

Number of fake pods = number of controller desired pods - (scheduled pods + unschedualable pods + unprocessed pods)

The problem here is that scheduling gated pods are not being excluded along with unschedualable and unprocessed so proactive scale up injects fake pods for these gated pods ignoring the condition. that happens each loop which leads to not scaling down that empty space.

This PR subtracts the number of pods with scheduling gates from the number of fake pods to inject.

Which issue(s) this PR fixes:

NONE

Special notes for your reviewer:

ctx.AllPodLister().List() should be using a lister, so we are not having extra api call here to list the pods rather we just get it from cache.
In PodInjectionPodListProcessor in case of any error filtering out scheduling gated pods, I choose to only log a warning and not return an error so it doesn't block the rest of the processors as CombinedPodListProcessor stops the processing in case any one returns an error, which means any following processor (all others) will not be executed

Does this PR introduce a user-facing change?

NONE

k8s-ci-robot · 2025-09-29T09:26:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: abdelrahman882
Once this PR has been reviewed and has the lgtm label, please assign x13n for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jackfrancis · 2025-09-30T22:46:36Z

cluster-autoscaler/processors/podinjection/pod_group.go

 }

+func filterOutSchedulingGatedPods(groups map[types.UID]podGroup, allPods []*apiv1.Pod) map[types.UID]podGroup {
+	if groups != nil {


Checking for nil seems fine, but if this is the case do we want to return nil back (L58 below). tl;dr it doesn't seem like we would ever expect a nil map and so if there's a chance we might get one and want to defensively handle that we might add a 2nd error response object to the func?

I agree, returned error, also added a check len(groups) > 0 before getting the gated pods, so we son't loop over all pods in case we don't have groups at all.

cluster-autoscaler/utils/kubernetes/listers.go

cluster-autoscaler/processors/podinjection/pod_injection_processor_test.go

x13n · 2025-10-01T14:50:36Z

cluster-autoscaler/processors/podinjection/pod_injection_processor.go

 	groupedPods := groupPods(append(scheduledPods, unschedulablePods...), controllers)
-	var podsToInject []*apiv1.Pod

+	allPods, err := ctx.AllPodLister().List()


Why list all pods? You can decide if a pod has a scheduling gate or not just by looking at it, no need to cross reference the list of pods with itself I think?

tl;dr We have to check the list of pods as the one we have doesn't include the gated pods

Pods sent to the processors are the (unschedulable pods + unprocessed pods), the the gated pods are not included there, that's why we inject fake pods for them.
So we have to list all pods and get those that got ignored in list pods before

cluster-autoscaler/utils/test/test_utils.go

cluster-autoscaler/utils/kubernetes/listers.go

x13n · 2025-10-03T18:08:45Z

cluster-autoscaler/processors/podinjection/pod_group.go

+	for _, podOwnerRef := range pod.OwnerReferences {
+		// SchedulingGated pods can't be unschedualable nor unprocessed nor scheduled so it is not expected
+		// to have them as group sample nor in pod count, so decreasing desiredReplicas by one is enough
+		if grp, found := groups[podOwnerRef.UID]; found {


This should only subtract when podOwnerRef.Controller != nil && *podOwnerRef.Controller to be in sync with updatePodGroups below. But actually, instead of adding replicas and then subtracting them here, wouldn't it suffice to pass scheduling gated pods to groupPods? We're already passing a combined list of scheduled and unschedulable pods there.

k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area area/cluster-autoscaler labels Sep 29, 2025

k8s-ci-robot removed the do-not-merge/needs-area label Sep 29, 2025

k8s-ci-robot requested review from feiskyer and x13n September 29, 2025 09:27

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 29, 2025

abdelrahman882 force-pushed the proactive-scaleup-schgates branch from 286ba74 to ff346b2 Compare September 29, 2025 09:29

jackfrancis reviewed Sep 30, 2025

View reviewed changes

abdelrahman882 force-pushed the proactive-scaleup-schgates branch from ff346b2 to 4627816 Compare October 1, 2025 10:47

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 1, 2025

x13n requested changes Oct 1, 2025

View reviewed changes

abdelrahman882 force-pushed the proactive-scaleup-schgates branch from 4627816 to 030ad2c Compare October 2, 2025 11:44

abdelrahman882 requested a review from x13n October 2, 2025 12:04

x13n reviewed Oct 3, 2025

View reviewed changes

cluster-autoscaler/utils/test/test_utils.go Outdated Show resolved Hide resolved

cluster-autoscaler/utils/kubernetes/listers.go Show resolved Hide resolved

Fix proactive scale up injecting fake pods for scheduling gated pods

a281a40

abdelrahman882 force-pushed the proactive-scaleup-schgates branch from 030ad2c to a281a40 Compare October 3, 2025 13:32

abdelrahman882 requested a review from x13n October 3, 2025 13:58

x13n reviewed Oct 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix proactive scale up injecting fake pods for scheduling gated pods #8580

Fix proactive scale up injecting fake pods for scheduling gated pods #8580

abdelrahman882 commented Sep 29, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Sep 29, 2025

Uh oh!

jackfrancis Sep 30, 2025

Uh oh!

abdelrahman882 Oct 1, 2025

Uh oh!

Uh oh!

Uh oh!

x13n Oct 1, 2025

Uh oh!

abdelrahman882 Oct 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

x13n Oct 3, 2025

Uh oh!

Uh oh!

Fix proactive scale up injecting fake pods for scheduling gated pods #8580

Are you sure you want to change the base?

Fix proactive scale up injecting fake pods for scheduling gated pods #8580

Conversation

abdelrahman882 commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

k8s-ci-robot commented Sep 29, 2025

Uh oh!

jackfrancis Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

abdelrahman882 Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

x13n Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

abdelrahman882 Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

x13n Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

abdelrahman882 commented Sep 29, 2025 •

edited

Loading

abdelrahman882 Oct 2, 2025 •

edited

Loading