Skip to content

Conversation

ttetyanka
Copy link

@ttetyanka ttetyanka commented Aug 28, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR introduces a new histogram metric cluster_autoscaler_node_deletion_duration_seconds that tracks the time a node spends marked as unneeded before it is either:

deleted (label deleted="true")

or becomes needed again without deletion (label deleted="false")

The observed duration is adjusted by subtracting the configured scale-down threshold to better reflect actual decision latency.

The metric is guarded by a feature flag and is disabled by default.

This adds visibility into scale-down behavior, allowing operators to understand both successful and aborted scale-downs.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Introduces NodeLatencyTracker for tracking unneeded nodes.

Records both successful and aborted deletions, with threshold adjustment.

Feature-flagged; no impact unless explicitly enabled.

Includes unit tests covering core scenarios.

Does this PR introduce a user-facing change?

Cluster Autoscaler adds a new Prometheus histogram metric (behind a feature flag):
`cluster_autoscaler_node_deletion_duration_seconds` — duration from when a node is marked as unneeded until it is either deleted (`deleted="true"`) or becomes needed again (`deleted="false"`).  
Reported values are adjusted by subtracting the configured scale-down threshold.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 28, 2025
@k8s-ci-robot
Copy link
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Aug 28, 2025
Copy link

linux-foundation-easycla bot commented Aug 28, 2025

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ttetyanka
Once this PR has been reviewed and has the lgtm label, please assign towca for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Welcome @ttetyanka!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 28, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @ttetyanka. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 28, 2025
@ttetyanka ttetyanka force-pushed the feature/deletionlatencytracker branch from aff3480 to ef7c537 Compare August 28, 2025 14:34
@elmiko
Copy link
Contributor

elmiko commented Aug 28, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 28, 2025
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Sep 17, 2025
@ttetyanka ttetyanka force-pushed the feature/deletionlatencytracker branch from a302c87 to 8d4c3aa Compare September 24, 2025 07:24
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 24, 2025
@ttetyanka ttetyanka force-pushed the feature/deletionlatencytracker branch 3 times, most recently from 62513e4 to 5f895dc Compare September 24, 2025 15:51
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 25, 2025
@ttetyanka ttetyanka force-pushed the feature/deletionlatencytracker branch from 5f895dc to 1cad3ae Compare September 28, 2025 10:03
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Sep 28, 2025
drainabilityRules := rules.Default(deleteOptions)

sdPlanner := planner.New(&ctx, processors, deleteOptions, drainabilityRules)
sdPlanner := planner.New(&ctx, processors, deleteOptions, drainabilityRules, latencytracker.NewNodeLatencyTracker())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Planner and actuator get different instances, it wouldn’t work correctly if tests cover that. I would pass null or the same instance.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used nil in all the other tests, but in TestStaticAutoscalerRunOnce added a mocked instance and verified the method calls


for name, info := range t.nodes {
if _, stillUnneeded := currentSet[name]; !stillUnneeded {
if _, inDeletion := currentlyInDeletion[name]; !inDeletion {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it even possible? I would expect that ObserveDeletion will remove these nodes before.
If it is possible, we are not reporting such node at all.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is possible in the following scenario: for example, a node was marked as unneeded during one autoscaler loop, and we start tracking it. Then, it cannot be deleted because the minimum node pool size is reached, so the node is not deleted and is no longer marked as unneeded. Therefore, we have to remove it from observation.

So we don’t want to report these cases, just silently remove them?

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 2, 2025
@ttetyanka ttetyanka force-pushed the feature/deletionlatencytracker branch from 052e165 to 40a8d2d Compare October 2, 2025 09:04
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 2, 2025
@ttetyanka ttetyanka force-pushed the feature/deletionlatencytracker branch from 40a8d2d to edf05d0 Compare October 2, 2025 09:22
@ttetyanka ttetyanka force-pushed the feature/deletionlatencytracker branch from edf05d0 to 191c494 Compare October 2, 2025 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants