Node removal latency metrics added #8485

ttetyanka · 2025-08-28T14:25:49Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR introduces a new histogram metric cluster_autoscaler_node_deletion_duration_seconds that tracks the time a node spends marked as unneeded before it is either:

deleted (label deleted="true")

or becomes needed again without deletion (label deleted="false")

The observed duration is adjusted by subtracting the configured scale-down threshold to better reflect actual decision latency.

The metric is guarded by a feature flag and is disabled by default.

This adds visibility into scale-down behavior, allowing operators to understand both successful and aborted scale-downs.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Introduces NodeLatencyTracker for tracking unneeded nodes.

Records both successful and aborted deletions, with threshold adjustment.

Feature-flagged; no impact unless explicitly enabled.

Includes unit tests covering core scenarios.

Does this PR introduce a user-facing change?

Cluster Autoscaler adds a new Prometheus histogram metric (behind a feature flag):
`cluster_autoscaler_node_deletion_duration_seconds` — duration from when a node is marked as unneeded until it is either deleted (`deleted="true"`) or becomes needed again (`deleted="false"`).  
Reported values are adjusted by subtracting the configured scale-down threshold.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2025-08-28T14:25:52Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

linux-foundation-easycla · 2025-08-28T14:25:56Z

The committers listed above are authorized under a signed CLA.

✅ login: ttetyanka / name: Tetiana Yeremenko (697d57e, 3bd35bf, 905cb29, ba29eef, 4902e0f, 5058928, 09bb88d, 191c494, 045e739, 7512a0d, 6966d04, 97eb2f5, 71be229, 90180f0, 5702fde, 32c7b57, 445bc9f, 75249a6, bb064bd)

k8s-ci-robot · 2025-08-28T14:25:56Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ttetyanka
Once this PR has been reviewed and has the lgtm label, please assign towca for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-08-28T14:25:58Z

Welcome @ttetyanka!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-08-28T14:25:59Z

Hi @ttetyanka. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

elmiko · 2025-08-28T16:32:33Z

/ok-to-test

cluster-autoscaler/core/scaledown/latencytracker/nodelatencytracker.go

cluster-autoscaler/core/scaledown/latencytracker/latencytracker_test.go

cluster-autoscaler/core/scaledown/latencytracker/nodelatencytracker.go

cluster-autoscaler/core/scaledown/unneeded/nodes.go

cluster-autoscaler/core/scaledown/actuation/actuator.go

cluster-autoscaler/metrics/metrics.go

cluster-autoscaler/core/scaledown/latencytracker/latencytracker_test.go

cluster-autoscaler/core/scaledown/actuation/actuator.go

cluster-autoscaler/metrics/metrics.go

MartynaGrotek · 2025-09-30T19:33:57Z

cluster-autoscaler/core/static_autoscaler_test.go

 	drainabilityRules := rules.Default(deleteOptions)

-	sdPlanner := planner.New(&ctx, processors, deleteOptions, drainabilityRules)
+	sdPlanner := planner.New(&ctx, processors, deleteOptions, drainabilityRules, latencytracker.NewNodeLatencyTracker())


Planner and actuator get different instances, it wouldn’t work correctly if tests cover that. I would pass null or the same instance.

used nil in all the other tests, but in TestStaticAutoscalerRunOnce added a mocked instance and verified the method calls

cluster-autoscaler/core/static_autoscaler_test.go

cluster-autoscaler/metrics/metrics.go

cluster-autoscaler/core/scaledown/latencytracker/node_latency_tracker.go

cluster-autoscaler/core/scaledown/latencytracker/node_latency_tracker_test.go

MartynaGrotek · 2025-10-01T08:26:22Z

cluster-autoscaler/core/scaledown/latencytracker/node_latency_tracker.go

+
+	for name, info := range t.nodes {
+		if _, stillUnneeded := currentSet[name]; !stillUnneeded {
+			if _, inDeletion := currentlyInDeletion[name]; !inDeletion {


Is it even possible? I would expect that ObserveDeletion will remove these nodes before.
If it is possible, we are not reporting such node at all.

This is possible in the following scenario: for example, a node was marked as unneeded during one autoscaler loop, and we start tracking it. Then, it cannot be deleted because the minimum node pool size is reached, so the node is not deleted and is no longer marked as unneeded. Therefore, we have to remove it from observation.

So we don’t want to report these cases, just silently remove them?

…re currently under deletion

…yTrackingEnabled

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 28, 2025

k8s-ci-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Aug 28, 2025

k8s-ci-robot requested review from aleksandra-malinowska and elmiko August 28, 2025 14:25

k8s-ci-robot added the area/cluster-autoscaler label Aug 28, 2025

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 28, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 28, 2025

ttetyanka force-pushed the feature/deletionlatencytracker branch from aff3480 to ef7c537 Compare August 28, 2025 14:34

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 28, 2025

MartynaGrotek reviewed Aug 29, 2025

View reviewed changes

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Sep 17, 2025

ttetyanka force-pushed the feature/deletionlatencytracker branch from a302c87 to 8d4c3aa Compare September 24, 2025 07:24

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 24, 2025

ttetyanka force-pushed the feature/deletionlatencytracker branch 3 times, most recently from 62513e4 to 5f895dc Compare September 24, 2025 15:51

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 25, 2025

ttetyanka force-pushed the feature/deletionlatencytracker branch from 5f895dc to 1cad3ae Compare September 28, 2025 10:03

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Sep 28, 2025

MartynaGrotek reviewed Oct 1, 2025

View reviewed changes

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 2, 2025

ttetyanka added 18 commits October 2, 2025 08:59

Node removal latency metrics added

32c7b57

Update node_deletion_duration_seconds metrics bucket distribution

ba29eef

Rename files according to convention

7512a0d

removed lock from metrics tracking

5058928

Create and use LatencyTracker interface for better testability

697d57e

Change UpdateStateWithUnneededList logic to also process nodes that a…

71be229

…re currently under deletion

cover planner node deletion latency tracking with test

bb064bd

Add UpdateThreshold method to ndlt and use it during RemovableAt

905cb29

remove GetUnneededTimeForNode

3bd35bf

Move ObserveDeletion to a correct place and test

97eb2f5

Add node latency tracker tests

75249a6

Expose GetTrackedNodes in interface for testing

4902e0f

fix merge errors

09bb88d

fix linting issues

445bc9f

change name flag from nodeLatencyTrackingEnabled to nodeRemovalLatenc…

045e739

…yTrackingEnabled

fix failing test

90180f0

fix rebase issues

5702fde

change ndlt to interface

6966d04

ttetyanka force-pushed the feature/deletionlatencytracker branch from 052e165 to 40a8d2d Compare October 2, 2025 09:04

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 2, 2025

ttetyanka force-pushed the feature/deletionlatencytracker branch from 40a8d2d to edf05d0 Compare October 2, 2025 09:22

Code review comments addressed

191c494

ttetyanka force-pushed the feature/deletionlatencytracker branch from edf05d0 to 191c494 Compare October 2, 2025 09:23

Node removal latency metrics added #8485

Are you sure you want to change the base?

Node removal latency metrics added #8485

Conversation

ttetyanka commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Aug 28, 2025

Uh oh!

linux-foundation-easycla bot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Aug 28, 2025

Uh oh!

k8s-ci-robot commented Aug 28, 2025

Uh oh!

k8s-ci-robot commented Aug 28, 2025

Uh oh!

elmiko commented Aug 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MartynaGrotek Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

ttetyanka Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MartynaGrotek Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

ttetyanka Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ttetyanka commented Aug 28, 2025 •

edited

Loading

linux-foundation-easycla bot commented Aug 28, 2025 •

edited

Loading