-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Node removal latency metrics added #8485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
ttetyanka
wants to merge
19
commits into
kubernetes:master
Choose a base branch
from
ttetyanka:feature/deletionlatencytracker
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+486
−36
Draft
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
32c7b57
Node removal latency metrics added
ttetyanka ba29eef
Update node_deletion_duration_seconds metrics bucket distribution
ttetyanka 7512a0d
Rename files according to convention
ttetyanka 5058928
removed lock from metrics tracking
ttetyanka 697d57e
Create and use LatencyTracker interface for better testability
ttetyanka 71be229
Change UpdateStateWithUnneededList logic to also process nodes that a…
ttetyanka bb064bd
cover planner node deletion latency tracking with test
ttetyanka 905cb29
Add UpdateThreshold method to ndlt and use it during RemovableAt
ttetyanka 3bd35bf
remove GetUnneededTimeForNode
ttetyanka 97eb2f5
Move ObserveDeletion to a correct place and test
ttetyanka 75249a6
Add node latency tracker tests
ttetyanka 4902e0f
Expose GetTrackedNodes in interface for testing
ttetyanka 09bb88d
fix merge errors
ttetyanka 445bc9f
fix linting issues
ttetyanka 045e739
change name flag from nodeLatencyTrackingEnabled to nodeRemovalLatenc…
ttetyanka 90180f0
fix failing test
ttetyanka 5702fde
fix rebase issues
ttetyanka 6966d04
change ndlt to interface
ttetyanka 191c494
Code review comments addressed
ttetyanka File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
120 changes: 120 additions & 0 deletions
120
cluster-autoscaler/core/scaledown/latencytracker/node_latency_tracker.go
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,120 @@ | ||
/* | ||
Copyright 2019 The Kubernetes Authors. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ | ||
|
||
package latencytracker | ||
|
||
import ( | ||
"time" | ||
|
||
apiv1 "k8s.io/api/core/v1" | ||
"k8s.io/autoscaler/cluster-autoscaler/metrics" | ||
"k8s.io/klog/v2" | ||
) | ||
|
||
// LatencyTracker defines the interface for tracking node removal latency. | ||
// Implementations record when nodes become unneeded, observe deletion events, | ||
// and expose thresholds for measuring node removal duration. | ||
type LatencyTracker interface { | ||
ObserveDeletion(nodeName string, timestamp time.Time) | ||
UpdateStateWithUnneededList(list []*apiv1.Node, currentlyInDeletion map[string]bool, timestamp time.Time) | ||
UpdateThreshold(nodeName string, threshold time.Duration) | ||
GetTrackedNodes() []string | ||
} | ||
type nodeInfo struct { | ||
unneededSince time.Time | ||
threshold time.Duration | ||
} | ||
|
||
// NodeLatencyTracker is a concrete implementation of LatencyTracker. | ||
// It keeps track of nodes that are marked as unneeded, when they became unneeded, | ||
// and thresholds to adjust node removal latency metrics. | ||
type NodeLatencyTracker struct { | ||
nodes map[string]nodeInfo | ||
} | ||
|
||
// NewNodeLatencyTracker creates a new tracker. | ||
func NewNodeLatencyTracker() *NodeLatencyTracker { | ||
return &NodeLatencyTracker{ | ||
nodes: make(map[string]nodeInfo), | ||
} | ||
} | ||
|
||
// UpdateStateWithUnneededList records unneeded nodes and handles missing ones. | ||
func (t *NodeLatencyTracker) UpdateStateWithUnneededList( | ||
list []*apiv1.Node, | ||
currentlyInDeletion map[string]bool, | ||
timestamp time.Time, | ||
) { | ||
currentSet := make(map[string]struct{}, len(list)) | ||
for _, node := range list { | ||
currentSet[node.Name] = struct{}{} | ||
|
||
if _, exists := t.nodes[node.Name]; !exists { | ||
t.nodes[node.Name] = nodeInfo{ | ||
unneededSince: timestamp, | ||
threshold: 0, | ||
} | ||
klog.V(4).Infof("Started tracking unneeded node %s at %v", node.Name, timestamp) | ||
} | ||
} | ||
|
||
for name, info := range t.nodes { | ||
if _, stillUnneeded := currentSet[name]; !stillUnneeded { | ||
if _, inDeletion := currentlyInDeletion[name]; !inDeletion { | ||
duration := timestamp.Sub(info.unneededSince) | ||
metrics.UpdateScaleDownNodeRemovalLatency(false, duration-info.threshold) | ||
delete(t.nodes, name) | ||
klog.V(4).Infof("Node %q reported as deleted/missing (unneeded for %s, threshold %s)", | ||
name, duration, info.threshold) | ||
} | ||
} | ||
} | ||
} | ||
|
||
// ObserveDeletion is called by the actuator just before node deletion. | ||
func (t *NodeLatencyTracker) ObserveDeletion(nodeName string, timestamp time.Time) { | ||
if info, exists := t.nodes[nodeName]; exists { | ||
duration := timestamp.Sub(info.unneededSince) | ||
|
||
klog.V(4).Infof( | ||
"Observing deletion for node %s, unneeded for %s (threshold was %s).", | ||
nodeName, duration, info.threshold, | ||
) | ||
|
||
metrics.UpdateScaleDownNodeRemovalLatency(true, duration-info.threshold) | ||
delete(t.nodes, nodeName) | ||
} | ||
} | ||
|
||
// UpdateThreshold updates the scale-down threshold for a tracked node. | ||
func (t *NodeLatencyTracker) UpdateThreshold(nodeName string, threshold time.Duration) { | ||
if info, exists := t.nodes[nodeName]; exists { | ||
info.threshold = threshold | ||
t.nodes[nodeName] = info | ||
klog.V(4).Infof("Updated threshold for node %q to %s", nodeName, threshold) | ||
} else { | ||
klog.Warningf("Attempted to update threshold for unknown node %q", nodeName) | ||
} | ||
} | ||
|
||
// GetTrackedNodes returns the names of all nodes currently tracked as unneeded. | ||
func (t *NodeLatencyTracker) GetTrackedNodes() []string { | ||
names := make([]string, 0, len(t.nodes)) | ||
for name := range t.nodes { | ||
names = append(names, name) | ||
} | ||
return names | ||
} |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it even possible? I would expect that ObserveDeletion will remove these nodes before.
If it is possible, we are not reporting such node at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is possible in the following scenario: for example, a node was marked as unneeded during one autoscaler loop, and we start tracking it. Then, it cannot be deleted because the minimum node pool size is reached, so the node is not deleted and is no longer marked as unneeded. Therefore, we have to remove it from observation.
So we don’t want to report these cases, just silently remove them?