-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-5598: Opportunistic scheduling cache #5599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
bwsalmon
commented
Oct 1, 2025
- One-line PR description: First version of the KEP for scheduler cache
- Issue link: Opportunistic scheduling cache #5598
- Other comments:
|
Welcome @bwsalmon! |
Hi @bwsalmon. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
node off the list. We then go down the nominated node path. Just as we would with a pod | ||
with a NominatedNodeName in it, we only re-evaluate the feasibility of this node before using it. | ||
|
||
Since we assume 1-pod-per-node, we know that the node used by the current pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How we're going to verify that it's indeed pod-per-node configuration?
I think that just relying on users is not enough here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the current thought is this (I'll make this more explicit here).
The user has to opt-in with the scheduling profile, we should mention this clearly there.
If they get it wrong it will not necessarily lead to incorrect behavior, just a shift of the scoring somewhat. So I think the cost is comparatively low.
I'm open to more tracking if we think that is appropriate. We can tell if a pod class is 1-pod-per-node, so we can flag that case if we'd like, and we should also be able to tell if multiple non-daemon set pods land on the same nodes after the fact.
Thoughts on how much you think is necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned offline, I'm personally ok with scheduling profile:
- explicitly scheduling profile allows us to explicitly "opt-in" my specific workload for this behavior
- given we do feasibility validation anyway for the result from cache, we won't break the correctness here
There is a concern that @sanposhiho brought during todays SIG scheduling meeting that we can probably summarize as:
What if the cluster state changed in the meantime? (2) ensures that we won't schedule on infeasible node, but if some pods were terminated in the meantime (or nodes appeared), there may be a new spot for our pod in the cluster now that is better from scoring perspective but is not in cache (because it wasn't feasible when computing the cache).
My personal answer to that would be: given that it's a dedicated scheduling profile that users opt-in to, it probably is ok (we don't mess the feasibility, we potentially choose not the optimal node).
FWIW, given that we only look for N feasible nodes for scoring anyway, for large enough clusters we can always choose not the optimal node anyway (so I personally think this is not a problem even without dedicated profile, but clearly that is more contentious).
So I would be ok with scheduling profile. But we need an input from @sanposhiho and @dom4ha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you both mean by "scheduling profile"? The scheduler config or a new API in Pod? If it's the scheduler config, are we going to assume all pods in the cluster are 1-pod-per-node scheduling, without checking anything actual?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "scheduling profile" we do mean the scheduler config. So the user would need to create a scheduler config with a flag set and then reference it in the pod definitions. Thoughts on better ways to handle it?
@bwsalmon - please sign the CLA (and I'm happy to take the production readiness approval for it) |
Cool. I tried to but it forwarded me to some Google reps to sign it... I sent the request but haven't heard back. Maybe I did it wrong? |
If plugin changes prove to be an issue, we could codify the signature as a new "Scheduling" object that only has a subset | ||
of the fields of the pod. Plugins that "opt-in" could only be given access to this reduced scheduling object, and we could then use the entire scheduling object as the signature. This would make it more or less impossible for the signature and plugins to be out of sync, and would | ||
naturally surface new dependencies as additions to the scheduling object. However, as we expect plugin changes to be relatively | ||
modest, we don't believe the complexity of making the interface changes is worth the risk today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate paragraph R323-R326
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it seemed like it fit both places but definitely isn't ideal.
Thoughts on where to keep it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here is better.
* VolumeBinding: Same as NodeVolumeLimits. | ||
* VolumeRestrictions: Same as NodeVolumeLimits. | ||
* VolumeZone: Same as NodeVolumeLimits. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to take the defaultPreemption into our consideration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, good point. Yes, I feel like we should but didn't find the reference in our plugins.
I assume we will also need priority class as well.
Let me figure out where in the code is handling this and we can decide how to add it in...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I took a look through the code and came to what is a somewhat counter-intuitive answer to me.
It doesn't look like we need to include DefaultPreemption, and thus priority in the signature, because we don't need to include PostFilter plugins. I did make that decision early on, but I'm re-digesting it now.
This is because we are caching the result after prefilter, filter, prescore and score, but before the postfilter step. PostFilter only runs if we can't find anything after the previous steps, then we use the priority to decide if we should try to preempt something.
So regardless of the priority, we would (correctly) get the same results from the cache. But if they were empty, we would potentially evict in one case and not in the other.
If you get a chance, please take a look and make sure this matches your understanding...
extending the production code to implement this enhancement. | ||
--> | ||
|
||
- `<package>`: `<date>` - `<test coverage>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the test files that this KEP will add/update, along with the current coverage for the existing ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Especially since this is my first KEP, PTAL and let me know if I'm capturing the appropriate information...
- a search in the Kubernetes bug triage tool (https://storage.googleapis.com/k8s-triage/index.html) | ||
--> | ||
|
||
- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, I expect we will add new integration tests as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. As with unit tests, PTAL and make sure I am meeting the expected guidelines...
[existing list]: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/ | ||
--> | ||
|
||
- [ ] Feature gate (also fill in values in `kep.yaml`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the feature gate and update it in the kep.yaml
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
/cc @sanposhiho @macsko |
/label lead-opted-in |
I did that long time ago, but historically it didn't require any interactions... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly, for me, it is still a lot of ??. Left comments.
node off the list. We then go down the nominated node path. Just as we would with a pod | ||
with a NominatedNodeName in it, we only re-evaluate the feasibility of this node before using it. | ||
|
||
Since we assume 1-pod-per-node, we know that the node used by the current pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you both mean by "scheduling profile"? The scheduler config or a new API in Pod? If it's the scheduler config, are we going to assume all pods in the cluster are 1-pod-per-node scheduling, without checking anything actual?
some other pod, but this should be the only issue. | ||
|
||
All stored results are timestamped, and we remove entries when they are more | ||
than a few seconds old to ensure we do not get stale data. Since we are targeting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long is "a few seconds"? Given the scheduler handles ~300 pods, even if this TTL is 1 sec, the cluster state is changed by, at the very least, ~300 new running pods. Of course, on top of that, there's tons more of cluster state changes, especially when it comes to a huge cluster that this KEP is aiming at. Do we really want to keep using the same scheduling results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we are talking about a few seconds (I'd say between 1 and 10), so thus 100s to 1000s of new pods, which keep the cache up to date as they flow through.
What other changes (besides pod death) should we expect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you mentioned label and taint changes, which makes a lot of sense, and we should discuss there...
Other things you would be concerned about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we scope the cache to a single workload and consider it as a cluster state snapshot made at the time we do first pod scheduling, the problem of invalidation is the result is not that severe as it seems.
Yes, without setting NNN right away there is a greater chance that one-cycle scheduling can be more easily broken by other cluster changes happening in the meantime, but eventually each pod goes though its own complete pod-by-pod scheduling cycle, so the node it was originally assigned to will be reevaluated whether it's still feasible.
Sure, it's a bit more problematic if it turns out to be not feasible anymore, so full filtering and scoring needs to take place and the final assignment can be corrected. Sure, it may be not consistent with the initial placement, but we don't consider any cross-pod requirements yet (TAS, shared ResourceClaims etc), we don't have to care.
Once we start supporting cross-pod requirements, a failure ot pass pod-by-pod scheduling would be a signal to reschedule the whole workload (go back to the single-cycle scheduling phase), but this is the next step.
* NodeResourcesFit: We use the output of the computePodResourceRequest function as the signature. | ||
* NodeUnschedulable: We use the Tolerations field as the signature. | ||
* NodeVolumeLimits: We use all Volume information except from Volumes of type ConfigMap or Secret. | ||
* PodTopologySpread: If the PodTopologySpead field is set, or it is not set but a default set of rules are applied, we mark the pod unsignable, otherwise it returns an empty signature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a default set of rules are applied, we mark the pod unsignable
Meaning, all pods that belong to Service, ReplicaSet, StatefulSet or ReplicationController are not eligible to use the cache. That's a large portion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless the user overrides the defaults, yes. This is definitely the biggest restriction in the model today. I think it is relaxable with a more focused version of pod spread, but it may not be worth addressing given that multi-pod scheduling probably addresses the issue more naturally.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meaning, all pods that belong to Service, ReplicaSet, StatefulSet or ReplicationController are not eligible to use the cache. That's a large portion.
Not necessarily. The workloads that would benefit the most are Jobs, JobSets, LWS etc which AFAIK cannot have default topology spread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this proposal is for jobs etc mostly (given it mentions ML workloads). Like I mentioned in the other thread, I'm talking about how small the target of this feature would be.
In order to utilize this feature, pods must be:
- not Service, ReplicaSet, StatefulSet nor ReplicationController. i.e., this feature is not useful for typical K8s clusters hosting a backend app etc.
- use 1-pod-per-node scheduling strategy largely.
- not use any complex scheduling constraint on pods.
- [in the future] not use a gang scheduling
Why do we need to invest our time to implement/maintain it only for users that luckily have pods that match above conditions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather reverse what you've just said.
The kube-scheduler is useful for clusters which run Services, ReplicaSets, StatefulSet etc, but it does not do a good job when it needs to schedule big Jobs and run ML workloads since it needs to support features developed to support the former.
This is why there are so many workload schedulers which does scheduling better, since they can ignore these problematic scheduling constraints. Sure, we work on gang scheduling, but reusing filtering and scoring results would be a foundation of most of the gang scheduling algorithms, so I have not doubts that this feature is a step in the right direction, although I agree that in its simplest form it's hard to notice its value.
If we scope the cache to a single workload it becomes a more apparent that signature framework allows us to detect workloads and gangs implicitly even if our new workload object wasn't created upfront by a user. So we can use the single cycle scheduling capabilities opportunistically, without enforcing minCount since it's not known until we get an explicit workload definition from a user or controller.
What I'm describing it's more extended scope which we did not want to implement in the simplest form, but it's an alternative that is on the table.
node off the list. We then go down the nominated node path. Just as we would with a pod | ||
with a NominatedNodeName in it, we only re-evaluate the feasibility of this node before using it. | ||
|
||
Since we assume 1-pod-per-node, we know that the node used by the current pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you assume all pods in the cluster have 1-pod-per-node scheduling constraint? Or, could it be mixed like some are 1-pod-per-node scheduling, while others are normal scheduling, which multiple pods can land on the same node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we just assume that pods that use the cache are 1-pod-per-node. So if some other (non-1-pod-per-node) workload uses a node it just gets removed from the cache and ignored by the 1-pod-per-node workloads going through the cache.
So the other workloads obviously don't get any advantage from the cache, but they also won't be hurt by it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this to the text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closing comment, but feel free to reopen if I missed something...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we just assume that pods that use the cache are 1-pod-per-node. So if some other (non-1-pod-per-node) workload uses a node it just gets removed from the cache and ignored by the 1-pod-per-node workloads going through the cache.
please add the explanation how we determine a pod is doing 1-pod-per-node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current proposal is that we determine a workload is 1-pod-per-node by requiring a user to mark them as such by targeting them to a customer scheduler configuration with this option enabled. We can revisit this if we feel it is insufficient.
We can potentially determine if a workload is 1-pod-per-workload, i.e. we will never schedule another pod of this kind on the same node, we can even potentially (but with a lot of complexity) tell if there is any other current pod that could run alongside this pod, but there is no way to tell that user will never create a workload in the future that will be able to run along-side this workload.
I would suggest if we feel the current proposal is insufficient, we could add a constraint on the pod being 1-pod-per-workload (not 1-pod-per-node). However, this adds more constraints and it isn't clear to me that it would actually provide stronger guarantees that the current model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also say; it isn't clear to me that this would only benefit 1-pod-per-node workloads, only that it would be identical to the current scheduler for 1-pod-per-node workloads. We will never provide an infeasible solution, we will only, in some cases, spread out workloads more than they might have been with the current scheduler. I'm not clear this is actually going to be a problem, but I also don't have strong data confirming that it won't be a problem. The hope would be to gather this kind of information through a deployment.
### Goals | ||
|
||
* Improve the performance of scheduling large jobs on large clusters where the constraints are simple. | ||
* Begin building infrastructure to support gang scheduling and other "multi-pod" scheduling requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's also mentioned in our sig-scheduling meeting. But, how would it fit our future picture of gang scheduling? I imagine a separate phase would decide all placements for the gang pods at once before proceeding to the existing filter/score/etc phases.
Are we sure that this cache mechanism would still remain useful?
Even further, is it really something we should focus/invest our time now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, these are excellent questions; I tried to address them but looks like I need to make it more explicit.
So I think we expect the cache itself will be deprecated by gang scheduling; we won't need to cache between cycles because we will just handle the pods in the same cycle. This is why we have kept the cache relatively simple and focused on a particular use case.
But the signature and reuse mechanisms should remain useful. Unless we plan to rewrite all of our existing plugins we will need to evaluate a "representative" pod against all the nodes and then use the results from this pod for all of the pods in the same "class". This will still require the signature and a similar "reuse" logic to that of the cache.
The reason to invest in this now is to both build the signature framework and start to get familiar with the advantages and challenges of evaluating a group of pods against a set of nodes once, with at least some production exposure.
But I'd love to hear your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to see it is this: the cache is a "Trojan Horse" for injecting multi-pod scheduling into the system without having bite off on the full interface and new scheduling cycles. It should provide immediate benefit to a focused set of customers, but the primary purpose (from my perspective, anyway) is to start getting production understanding of how multi-pod scheduling works in practice, and start building frameworks that will support multi-pod scheduling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, just to be clear; I've been working on this KEP in close coordination with Eric, the author of the workload KEP, in addition (of course) with Dominik.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think we expect the cache itself will be deprecated by gang scheduling; we won't need to cache between cycles because we will just handle the pods in the same cycle. This is why we have kept the cache relatively simple and focused on a particular use case.
Right, and that's the core reason I'm doubting why we need to do this today, given we know the gang scheduling is coming.
Note that this cache mechanism is definitely not going to be that simple, implementation wise.
But the signature and reuse mechanisms should remain useful.
For who? This feature's target is already very small even at this point (because of many limitations), and after the gang scheduling, how large portion of pods in this world can get the benefit? and is that worthy enough for us to keep maintaining the feature?
I'm already not a fan of the idea to introduce a optimization specifically for 1-pod-per-node scheduling even with many other limitation.
The reason to invest in this now is to both build the signature framework and start to get familiar with the advantages and challenges of evaluating a group of pods against a set of nodes once, with at least some production exposure.
..
Another way to see it is this: the cache is a "Trojan Horse" for injecting multi-pod scheduling into the system
Hmm, I don't understand. For me, this feature and a future gang scheduling just look like two different things, and we're taking a unnecessary risk/effort here to try something not that similar, which we know will be less useful after the gang scheduling arrives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I don't understand. For me, this feature and a future gang scheduling just look like two different things, and we're taking a unnecessary risk/effort here to try something not that similar, which we know will be less useful after the gang scheduling arrives.
Even if we have single-cycle gang scheduling, the process of finding placement may be very complicated in the most generic case. We don't have any proposal on the table how it could ultimately look like, and it will be the most challenging and exciting part tackling NP-hard/NP-complete scheduling problem.
At the same time, we know that in this special case we can safely reuse both filtering and scoring results, so achieve scheduling in O(N) complexity instead of O(NxM). Still not every workload is eligible for it, so the mechanism of signatures would be still needed to discover that.
For workloads without 1-pod-per-node usage pattern could only reuse filtering results, but repeat scoring after every pod. So in any case, ability to discover what type of workload we deal with could drive different algorithm, while this one seems the fastest.
So the question how many customers could make use of it is valid and even finding an early adopter would be useful to assess it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it will work for some limited number of workloads. I'm taking about how % of kubernetes cluster will get benefit from this feature (see conditions that must be satisfied to use this feature) and whether it really makes sense, as a community, to maintain this specific optimization in k/k.
For workloads without 1-pod-per-node usage pattern could only reuse filtering results, but repeat scoring after every pod.
How can workloads without 1-pod-per-node reuse the filtering results? If it's 1-pod-per-node scheduling, if pod#1 is scheduled on node#1, we can simply remove node#1 from the cache. But, if it's not 1-pod-per-node scheduling, we're not sure if we should remove node#1 from the list or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm taking about how % of kubernetes cluster will get benefit from this feature
+1
How can workloads without 1-pod-per-node reuse the filtering results? If it's 1-pod-per-node scheduling, if pod#1 is scheduled on node#1, we can simply remove node#1 from the cache. But, if it's not 1-pod-per-node scheduling, we're not sure if we should remove node#1 from the list or not.
The signature filters out pod scheduling features which can cause invalidation/extension of feasible nodes (the filtering results) and their scores. This way we should know that placing a pod on one of the nodes does not require recomputing them for other nodes. These might be a vast majority of cases for large workloads (which need gang scheduling), since pod-affinities and others are known from significantly slowing down scheduling.
So without 1-pod-per-node constraint we would need to recompute feasibility and score on the node which was just taken before placing another one there. This actually does not sound like a big degradation compared to the initial proposal, so do we really need to detect 1-pod-per-node then as a special case? It would also imply keeping the cache itself.
The inter pod-affinity and anti-affinity features are more complex to support, since they may both influence feasibility and scoring results after each placement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm completely open to just saying this is for any workload. I agree that the difference in scoring is likely not a big issue; at worst we will spread pods out more than we would today, which it sounds like we are already trying to do anyway.
The restriction to 1-pod-per-node is meant as a way to be cautious; in this case we know the behavior will be identical to the current mechanisms, where for multi-pod-per-node cases the scoring would differ somewhat, even if the feasibility would still be identical.
The big thing from my perspective is that I don't think any of the work for 1-pod-per-node will be wasted. If we decide we really do need to re-evaluate scores and keep the nodes around, we can add that with incremental extra work.
* Topology spread rules (including inherited rules from the system default) This constraint we should attempt to lift in the future. | ||
|
||
To construct a signature, we add a new function for each plugin to implement. | ||
This function takes a pod and generates a signature for that plugin as a string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the explaination how the interface will look like on the kep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good call. I'll pull in the signatures from the draft version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, added the draft implementation here in the text, PTAL...
|
||
### Goals | ||
|
||
* Improve the performance of scheduling large jobs on large clusters where the constraints are simple. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where's the bottleneck for the scheduling you are imagining in this cluster? If pods are that simple, most of plugins should just return Skip
at PreXXXX phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bottleneck is just iterating through all the nodes and checking; once you have 100000 nodes just iterating through them and checking labels becomes pretty expensive. None of the checks individually should be expensive, there are just a lot of them. I did some benchmarking with the scheduler_perf tests, I can grab results from some there if you'd like to look through them together...
When a pod with the same signature comes later, we find the entry in our cache and pull the first | ||
node off the list. We then go down the nominated node path. Just as we would with a pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, there's multiple lists, one per one signature. When a scheduling happens and node#1 is selected, are we going to iterate all lists and remove node#1 from all lists? (assuming node#1 is no longer available for pods because of 1-pod-per-node assumption)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a host is used, it is removed from the cache.
The cache puts every entry on two linked lists; one for the signature and one for the host. So we don't have to iterate over anything to remove the host, just remove the entries in the host list.
Co-authored-by: Dominik Marciński <gmidon@gmail.com>
Co-authored-by: Dominik Marciński <gmidon@gmail.com>
Co-authored-by: Dominik Marciński <gmidon@gmail.com>
Co-authored-by: Dominik Marciński <gmidon@gmail.com>
Co-authored-by: Dominik Marciński <gmidon@gmail.com>
Co-authored-by: Dominik Marciński <gmidon@gmail.com>
Co-authored-by: Dominik Marciński <gmidon@gmail.com>
Co-authored-by: Dominik Marciński <gmidon@gmail.com>
Ok, let me see what I need to do... |
is what we cache; the current pod will use the first node on the list, and we cache the remaining | ||
results, indexed by the pod's "scheduling signature" (which we will describe later). | ||
|
||
When a pod with the same signature comes later, we find the entry in our cache and pull the first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative is to set nomination on all pods (schedule them in one cycle) vs the option pulling the cached node once a pod gets to its dedicated scheduling cycle.
The advantage would be that resources needed for a workload would be not only scheduled, but reserved (at least in memory), so that other pods that are interleaving the workload ones in the scheduling queue, don't take the space (invalidating some of the cached entries).
I guess it will matter more when we integrate this approach with Workload object and Gang Scheduling, but for now we could keep it just as an alternative, which however is worth to mention in the KEP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm very onboard with this approach; the point of the cache was to avoid getting in the way of the workload / gang scheduling work. Let's discuss, because given Kensei's concerns around the cache as an object this may be the right way to do it off the bat. I don't think this is more complicated than the cache, and it may be much simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It indeed solves a few problems, including memory consumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative is to set nomination on all pods (schedule them in one cycle) vs the option pulling the cached node once a pod gets to its dedicated scheduling cycle.
The primary problem with this approach is that those pods may not yet have been observed by the scheduler. We need a way to somehow enumerate them first to ensure that this "single cycle" knows which pods it should do.
But overall I'm definitely supportive for this path - it's actually something we've been talking about some time ago.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Maybe we could simply reverse the problem and instead of caching results, we could wait for a few seconds for the workload pods to appear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanposhiho - for your thoughts too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. I think we can tune the logic waiting vs. expecting X pods, but overall the proposed approach here is to
- account pods to workload in the framework
- schedule the workload in a few batches (put NNN as the result)
- "cache" filtering/scoring results per batch, instead of allowing it to outlive one scheduling cycle
Is that the common ground?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is what I'm suggesting. But we also need @sanposhiho opinion here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I believe that when you say "implicit workload" and I say "signature" we mean the same thing. I think I'm aligned but let me state my understanding:
V 0.1 Opportunistic batching (this KEP)
- When a pod is placed in the queue it is given a "signature" / "implicit workload" that describes the scheduling parameters for the pod; in some cases (pod affinity, etc) we may choose to mark it as "unbatchable", which will skip all of this logic.
- When we pull a pod off the queue for scheduling, we opportunistically grab as many pods with matching "signature" / "implicit workload" as is easily doable.
- We evaluate the first pod against all nodes.
- We assign nominated node name for as many of the pods in the "signature" / "implicit workload" as possible using the results from the first pod. Note that this will require the same assumption as we do for 1-pod-per-node: we will use one host for one pod, but I think we are aligned that this is fine even for "multi-pod-per-node" workloads. If we can't assign all pods, we assign as many as we can.
- We run ScheduleOne for each pod in the group we could assign.
- We put the pods we couldn't assign back in the queue to be tried again.
V 1.0 Gangs (post this KEP)
All the same functionality as v 0.1, with the following additions:
- A pod assigned to a "gang" / "explicit workload" will be marked as such when added to the queue.
- For "explicit workloads" we do not process any pods until we have all the pods in the queue.
- After evaluating a single pod against all nodes and attempting to assign nominated node names, we fail to schedule all the "workload" pods if we cannot assign all pods in the set. In this case we remove all the nominated node names and don't actually schedule any of the pods.
- For "normal" workloads we continue to assign a "signature" / "implicit workload" and batch opportunistically, as we do in v 0.1.
- Depending on the constraints we may need to move to a model where we assign more than 1 pod for each node in our list. We can tackle this with scoring / fit changes if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should be able to implement a first version of this scheme this week.
This KEP does not seem to be tracked for PRR, as it haven't got any heads up from the v1.35 Enhancements team. Not sure if CLA is also a missing piece, but definitely the KEP should have an entry in prod-readiness dir - Thursday 9th October 2025 (AoE) / Friday 10th October 2025, 12:00 UTC |
* Begin building infrastructure to support gang scheduling and other "multi-pod" scheduling requests. | ||
* Ensure that the infrastructure we build is maintainable as we update, add and remove plugins. | ||
* Never impact feasibility. | ||
* Provide identical results to our current scheduler for 1-pod-per-node environments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need more clarification of 1-pod-per-node scheduling.
What if some random pods w/o 1-pod-per-node constraint land on nodes? Would those nodes be ineligible for pods w/ 1-pod-per-node constraint? Or, 1-pod-per-node constraint means to be effective only between other pods with 1-pod-per-node constraint? (i.e., 1-pod-per-node pods can be colocated with random pods)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative idea: we can consider some new extension points in the scheduling framework and implement this cache feature as a plugin. That way, we could avoid having this cache mechanism that works only for a specific set of pods in the scheduler core, and we might be able to have more plugins for different sets of pods in the future.
I'm still not convinced to maintain this feature in k/k though, this alternative approach would open another option: implement the framework change only in k/k, and implement this feature as a out-of-tree plugin in sigs/scheduler-plugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discovery of non-compatible pods (via cross-plugins signature mechanism) would be still needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but that's not a blocker of this approach: we can have a new method on the framework handle so that a plugin can run other plugins to compute the signature. (it's like a preemption plugin runs other plugins' filter funcs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can have a new method on the framework handle so that a plugin can run other plugins to compute the signature
True, but I'd rather think about creating single-cycle workload scheduling phase as an extension point. This way the gang-scheduling plugin would be notified that there is a workload to be scheduled. Different plugins would have different set of algorithms in which they do that.
The ability to run other plugins phases like filtering, scoring and most this new signature one (to check pod compatibility with some algorithms) would become necessary and available to them. Even external schedulers more and more need such capability, so we there are discussion about a need to take it out as a library, otherwise they are not able to provide valid placements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but I'd rather think about creating single-cycle workload scheduling phase as an extension point.
I think we kind-of agreed on that already as part of gang-scheduling KEP (for Beta). So I'm definitely +1 for this.
But I think that wouldn't on its own can be a bit orthogonal (and on its own doesn't address Kensei's concern) [unless I misunderstood your comment].
In addition to be able to call other plugins to compute the signature, we would still need:
(1) either a cache itself somewhere - which Kensei is concerned about
(2) or be able to ensure that we collect all similar pods first before processing them all at once as part of this extension point (then the cache remain local to a single scheduling cycle, which is great)
(2) sounds much better (and is much more aligned with the vision), but we don't have any mechanism to collect all the pods that we want to process together (or to even learn how many of these we want to collect). So I think that is a missing bit here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re (2) true, I just describe how this extension point may look like in the future.
So overall I agree with Kensei's proposal to eventually keep cache in a plugin (rather than in the framework). We wanted to avoid defining this extension point now before we understand well how single-phase workload scheduling should look like. Without it, it would be hard to pass the cache results back to the framework.
However I understand better the concern now about making this cache a part of the framework.
Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages. The list of commits with invalid commit messages:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: bwsalmon The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/ok-to-test |
@bwsalmon: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |