-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Node removal latency metrics added #8485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Node removal latency metrics added #8485
Conversation
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ttetyanka The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Welcome @ttetyanka! |
Hi @ttetyanka. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
aff3480
to
ef7c537
Compare
/ok-to-test |
cluster-autoscaler/core/scaledown/latencytracker/nodelatencytracker.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/core/scaledown/latencytracker/nodelatencytracker.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/core/scaledown/latencytracker/latencytracker_test.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/core/scaledown/latencytracker/nodelatencytracker.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/core/scaledown/latencytracker/latencytracker_test.go
Outdated
Show resolved
Hide resolved
a302c87
to
8d4c3aa
Compare
62513e4
to
5f895dc
Compare
5f895dc
to
1cad3ae
Compare
drainabilityRules := rules.Default(deleteOptions) | ||
|
||
sdPlanner := planner.New(&ctx, processors, deleteOptions, drainabilityRules) | ||
sdPlanner := planner.New(&ctx, processors, deleteOptions, drainabilityRules, latencytracker.NewNodeLatencyTracker()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Planner and actuator get different instances, it wouldn’t work correctly if tests cover that. I would pass null or the same instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used nil in all the other tests, but in TestStaticAutoscalerRunOnce added a mocked instance and verified the method calls
cluster-autoscaler/core/scaledown/latencytracker/node_latency_tracker.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/core/scaledown/latencytracker/node_latency_tracker.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/core/scaledown/latencytracker/node_latency_tracker.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/core/scaledown/latencytracker/node_latency_tracker_test.go
Outdated
Show resolved
Hide resolved
|
||
for name, info := range t.nodes { | ||
if _, stillUnneeded := currentSet[name]; !stillUnneeded { | ||
if _, inDeletion := currentlyInDeletion[name]; !inDeletion { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it even possible? I would expect that ObserveDeletion will remove these nodes before.
If it is possible, we are not reporting such node at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is possible in the following scenario: for example, a node was marked as unneeded during one autoscaler loop, and we start tracking it. Then, it cannot be deleted because the minimum node pool size is reached, so the node is not deleted and is no longer marked as unneeded. Therefore, we have to remove it from observation.
So we don’t want to report these cases, just silently remove them?
…re currently under deletion
052e165
to
40a8d2d
Compare
40a8d2d
to
edf05d0
Compare
edf05d0
to
191c494
Compare
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR introduces a new histogram metric cluster_autoscaler_node_deletion_duration_seconds that tracks the time a node spends marked as unneeded before it is either:
deleted (label deleted="true")
or becomes needed again without deletion (label deleted="false")
The observed duration is adjusted by subtracting the configured scale-down threshold to better reflect actual decision latency.
The metric is guarded by a feature flag and is disabled by default.
This adds visibility into scale-down behavior, allowing operators to understand both successful and aborted scale-downs.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Introduces NodeLatencyTracker for tracking unneeded nodes.
Records both successful and aborted deletions, with threshold adjustment.
Feature-flagged; no impact unless explicitly enabled.
Includes unit tests covering core scenarios.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: