KEP-5598: Opportunistic scheduling cache #5599

bwsalmon · 2025-10-01T23:11:36Z

One-line PR description: First version of the KEP for scheduler cache

Issue link: Opportunistic scheduling cache #5598

Other comments:

linux-foundation-easycla · 2025-10-01T23:11:43Z

✅ login: bwsalmon / name: bwsalmon (29d699c, 2a1622e, 36f3ad1, 7b9e763, 81dea08, 81dfd1d, 8c8fb39, d1195a2, d1b3df1, d8f79f5, ed3bb4e)
❌ - login: @bsalmon-goog / name: Brandon Salmon . The commit (2518927, 47bc765, 5a1a32b, 68a3857, b833002, fd731ed) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please submit a support request ticket.

k8s-ci-robot · 2025-10-01T23:11:46Z

Welcome @bwsalmon!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-10-01T23:11:47Z

Hi @bwsalmon. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

keps/sig-scheduling/5598-scheduling-cache/README.md

wojtek-t · 2025-10-03T12:49:03Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+node off the list. We then go down the nominated node path. Just as we would with a pod
+with a NominatedNodeName in it, we only re-evaluate the feasibility of this node before using it.
+
+Since we assume 1-pod-per-node, we know that the node used by the current pod 


How we're going to verify that it's indeed pod-per-node configuration?

I think that just relying on users is not enough here...

So the current thought is this (I'll make this more explicit here).

The user has to opt-in with the scheduling profile, we should mention this clearly there.

If they get it wrong it will not necessarily lead to incorrect behavior, just a shift of the scoring somewhat. So I think the cost is comparatively low.

I'm open to more tracking if we think that is appropriate. We can tell if a pod class is 1-pod-per-node, so we can flag that case if we'd like, and we should also be able to tell if multiple non-daemon set pods land on the same nodes after the fact.

Thoughts on how much you think is necessary?

As mentioned offline, I'm personally ok with scheduling profile:

explicitly scheduling profile allows us to explicitly "opt-in" my specific workload for this behavior

given we do feasibility validation anyway for the result from cache, we won't break the correctness here

There is a concern that @sanposhiho brought during todays SIG scheduling meeting that we can probably summarize as:
What if the cluster state changed in the meantime? (2) ensures that we won't schedule on infeasible node, but if some pods were terminated in the meantime (or nodes appeared), there may be a new spot for our pod in the cluster now that is better from scoring perspective but is not in cache (because it wasn't feasible when computing the cache).

My personal answer to that would be: given that it's a dedicated scheduling profile that users opt-in to, it probably is ok (we don't mess the feasibility, we potentially choose not the optimal node).

FWIW, given that we only look for N feasible nodes for scoring anyway, for large enough clusters we can always choose not the optimal node anyway (so I personally think this is not a problem even without dedicated profile, but clearly that is more contentious).

So I would be ok with scheduling profile. But we need an input from @sanposhiho and @dom4ha

What do you both mean by "scheduling profile"? The scheduler config or a new API in Pod? If it's the scheduler config, are we going to assume all pods in the cluster are 1-pod-per-node scheduling, without checking anything actual?

By "scheduling profile" we do mean the scheduler config. So the user would need to create a scheduler config with a flag set and then reference it in the pod definitions. Thoughts on better ways to handle it?

wojtek-t · 2025-10-03T12:49:45Z

@bwsalmon - please sign the CLA

(and I'm happy to take the production readiness approval for it)

bwsalmon · 2025-10-03T19:48:30Z

@bwsalmon - please sign the CLA

(and I'm happy to take the production readiness approval for it)

Cool. I tried to but it forwarded me to some Google reps to sign it... I sent the request but haven't heard back. Maybe I did it wrong?

helayoty · 2025-10-06T20:36:13Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+If plugin changes prove to be an issue, we could codify the signature as a new "Scheduling" object that only has a subset
+of the fields of the pod. Plugins that "opt-in" could only be given access to this reduced scheduling object, and we could then use the entire scheduling object as the signature. This would make it more or less impossible for the signature and plugins to be out of sync, and would
+naturally surface new dependencies as additions to the scheduling object. However, as we expect plugin changes to be relatively 
+modest, we don't believe the complexity of making the interface changes is worth the risk today.


Duplicate paragraph R323-R326

Yeah, it seemed like it fit both places but definitely isn't ideal.

Thoughts on where to keep it?

I think here is better.

keps/sig-scheduling/5598-scheduling-cache/README.md

helayoty · 2025-10-06T21:40:33Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+ * VolumeBinding: Same as NodeVolumeLimits.
+ * VolumeRestrictions: Same as NodeVolumeLimits.
+ * VolumeZone: Same as NodeVolumeLimits.
+


Do we need to take the defaultPreemption into our consideration?

Hmmm, good point. Yes, I feel like we should but didn't find the reference in our plugins.

I assume we will also need priority class as well.

Let me figure out where in the code is handling this and we can decide how to add it in...

Ok, I took a look through the code and came to what is a somewhat counter-intuitive answer to me.

It doesn't look like we need to include DefaultPreemption, and thus priority in the signature, because we don't need to include PostFilter plugins. I did make that decision early on, but I'm re-digesting it now.

This is because we are caching the result after prefilter, filter, prescore and score, but before the postfilter step. PostFilter only runs if we can't find anything after the previous steps, then we use the priority to decide if we should try to preempt something.

So regardless of the priority, we would (correctly) get the same results from the cache. But if they were empty, we would potentially evict in one case and not in the other.

If you get a chance, please take a look and make sure this matches your understanding...

helayoty · 2025-10-06T21:42:36Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+extending the production code to implement this enhancement.
+-->
+
+- `<package>`: `<date>` - `<test coverage>`


Please add the test files that this KEP will add/update, along with the current coverage for the existing ones.

Done. Especially since this is my first KEP, PTAL and let me know if I'm capturing the appropriate information...

helayoty · 2025-10-06T21:43:27Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+- a search in the Kubernetes bug triage tool (https://storage.googleapis.com/k8s-triage/index.html)
+-->
+
+- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature)


Same, I expect we will add new integration tests as well.

Done. As with unit tests, PTAL and make sure I am meeting the expected guidelines...

helayoty · 2025-10-06T21:44:12Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+[existing list]: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/
+-->
+
+- [ ] Feature gate (also fill in values in `kep.yaml`)


Please add the feature gate and update it in the kep.yaml as well.

dom4ha · 2025-10-06T22:51:53Z

/cc @sanposhiho @macsko

dom4ha · 2025-10-06T22:55:34Z

/label lead-opted-in
/milestone v1.35

wojtek-t · 2025-10-07T14:17:39Z

@bwsalmon - please sign the CLA
(and I'm happy to take the production readiness approval for it)

Cool. I tried to but it forwarded me to some Google reps to sign it... I sent the request but haven't heard back. Maybe I did it wrong?

I did that long time ago, but historically it didn't require any interactions...

sanposhiho

Honestly, for me, it is still a lot of ??. Left comments.

sanposhiho · 2025-10-07T15:54:48Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+node off the list. We then go down the nominated node path. Just as we would with a pod
+with a NominatedNodeName in it, we only re-evaluate the feasibility of this node before using it.
+
+Since we assume 1-pod-per-node, we know that the node used by the current pod 


What do you both mean by "scheduling profile"? The scheduler config or a new API in Pod? If it's the scheduler config, are we going to assume all pods in the cluster are 1-pod-per-node scheduling, without checking anything actual?

sanposhiho · 2025-10-07T15:59:34Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+some other pod, but this should be the only issue.
+
+All stored results are timestamped, and we remove entries when they are more
+than a few seconds old to ensure we do not get stale data. Since we are targeting


How long is "a few seconds"? Given the scheduler handles ~300 pods, even if this TTL is 1 sec, the cluster state is changed by, at the very least, ~300 new running pods. Of course, on top of that, there's tons more of cluster state changes, especially when it comes to a huge cluster that this KEP is aiming at. Do we really want to keep using the same scheduling results?

Yes, we are talking about a few seconds (I'd say between 1 and 10), so thus 100s to 1000s of new pods, which keep the cache up to date as they flow through.

What other changes (besides pod death) should we expect?

I see you mentioned label and taint changes, which makes a lot of sense, and we should discuss there...

Other things you would be concerned about?

If we scope the cache to a single workload and consider it as a cluster state snapshot made at the time we do first pod scheduling, the problem of invalidation is the result is not that severe as it seems.

Yes, without setting NNN right away there is a greater chance that one-cycle scheduling can be more easily broken by other cluster changes happening in the meantime, but eventually each pod goes though its own complete pod-by-pod scheduling cycle, so the node it was originally assigned to will be reevaluated whether it's still feasible.

Sure, it's a bit more problematic if it turns out to be not feasible anymore, so full filtering and scoring needs to take place and the final assignment can be corrected. Sure, it may be not consistent with the initial placement, but we don't consider any cross-pod requirements yet (TAS, shared ResourceClaims etc), we don't have to care.

Once we start supporting cross-pod requirements, a failure ot pass pod-by-pod scheduling would be a signal to reschedule the whole workload (go back to the single-cycle scheduling phase), but this is the next step.

sanposhiho · 2025-10-07T16:10:28Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+ * NodeResourcesFit: We use the output of the computePodResourceRequest function as the signature.
+ * NodeUnschedulable: We use the Tolerations field as the signature.
+ * NodeVolumeLimits: We use all Volume information except from Volumes of type ConfigMap or Secret.
+ * PodTopologySpread: If the PodTopologySpead field is set, or it is not set but a default set of rules are applied, we mark the pod unsignable, otherwise it returns an empty signature.


a default set of rules are applied, we mark the pod unsignable

Meaning, all pods that belong to Service, ReplicaSet, StatefulSet or ReplicationController are not eligible to use the cache. That's a large portion.

Unless the user overrides the defaults, yes. This is definitely the biggest restriction in the model today. I think it is relaxable with a more focused version of pod spread, but it may not be worth addressing given that multi-pod scheduling probably addresses the issue more naturally.

Thoughts?

Meaning, all pods that belong to Service, ReplicaSet, StatefulSet or ReplicationController are not eligible to use the cache. That's a large portion.

Not necessarily. The workloads that would benefit the most are Jobs, JobSets, LWS etc which AFAIK cannot have default topology spread.

I know this proposal is for jobs etc mostly (given it mentions ML workloads). Like I mentioned in the other thread, I'm talking about how small the target of this feature would be.

In order to utilize this feature, pods must be:

not Service, ReplicaSet, StatefulSet nor ReplicationController. i.e., this feature is not useful for typical K8s clusters hosting a backend app etc.

use 1-pod-per-node scheduling strategy largely.

not use any complex scheduling constraint on pods.

[in the future] not use a gang scheduling

Why do we need to invest our time to implement/maintain it only for users that luckily have pods that match above conditions?

I'd rather reverse what you've just said.

The kube-scheduler is useful for clusters which run Services, ReplicaSets, StatefulSet etc, but it does not do a good job when it needs to schedule big Jobs and run ML workloads since it needs to support features developed to support the former.

This is why there are so many workload schedulers which does scheduling better, since they can ignore these problematic scheduling constraints. Sure, we work on gang scheduling, but reusing filtering and scoring results would be a foundation of most of the gang scheduling algorithms, so I have not doubts that this feature is a step in the right direction, although I agree that in its simplest form it's hard to notice its value.

If we scope the cache to a single workload it becomes a more apparent that signature framework allows us to detect workloads and gangs implicitly even if our new workload object wasn't created upfront by a user. So we can use the single cycle scheduling capabilities opportunistically, without enforcing minCount since it's not known until we get an explicit workload definition from a user or controller.

What I'm describing it's more extended scope which we did not want to implement in the simplest form, but it's an alternative that is on the table.

sanposhiho · 2025-10-07T16:16:52Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+node off the list. We then go down the nominated node path. Just as we would with a pod
+with a NominatedNodeName in it, we only re-evaluate the feasibility of this node before using it.
+
+Since we assume 1-pod-per-node, we know that the node used by the current pod 


Do you assume all pods in the cluster have 1-pod-per-node scheduling constraint? Or, could it be mixed like some are 1-pod-per-node scheduling, while others are normal scheduling, which multiple pods can land on the same node?

No, we just assume that pods that use the cache are 1-pod-per-node. So if some other (non-1-pod-per-node) workload uses a node it just gets removed from the cache and ignored by the 1-pod-per-node workloads going through the cache.

So the other workloads obviously don't get any advantage from the cache, but they also won't be hurt by it...

Added this to the text.

Closing comment, but feel free to reopen if I missed something...

No, we just assume that pods that use the cache are 1-pod-per-node. So if some other (non-1-pod-per-node) workload uses a node it just gets removed from the cache and ignored by the 1-pod-per-node workloads going through the cache.

please add the explanation how we determine a pod is doing 1-pod-per-node

The current proposal is that we determine a workload is 1-pod-per-node by requiring a user to mark them as such by targeting them to a customer scheduler configuration with this option enabled. We can revisit this if we feel it is insufficient.

We can potentially determine if a workload is 1-pod-per-workload, i.e. we will never schedule another pod of this kind on the same node, we can even potentially (but with a lot of complexity) tell if there is any other current pod that could run alongside this pod, but there is no way to tell that user will never create a workload in the future that will be able to run along-side this workload.

I would suggest if we feel the current proposal is insufficient, we could add a constraint on the pod being 1-pod-per-workload (not 1-pod-per-node). However, this adds more constraints and it isn't clear to me that it would actually provide stronger guarantees that the current model.

I would also say; it isn't clear to me that this would only benefit 1-pod-per-node workloads, only that it would be identical to the current scheduler for 1-pod-per-node workloads. We will never provide an infeasible solution, we will only, in some cases, spread out workloads more than they might have been with the current scheduler. I'm not clear this is actually going to be a problem, but I also don't have strong data confirming that it won't be a problem. The hope would be to gather this kind of information through a deployment.

sanposhiho · 2025-10-07T16:20:05Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+### Goals
+
+ * Improve the performance of scheduling large jobs on large clusters where the constraints are simple.
+ * Begin building infrastructure to support gang scheduling and other "multi-pod" scheduling requests.


I think it's also mentioned in our sig-scheduling meeting. But, how would it fit our future picture of gang scheduling? I imagine a separate phase would decide all placements for the gang pods at once before proceeding to the existing filter/score/etc phases.
Are we sure that this cache mechanism would still remain useful?
Even further, is it really something we should focus/invest our time now?

Yep, these are excellent questions; I tried to address them but looks like I need to make it more explicit.

So I think we expect the cache itself will be deprecated by gang scheduling; we won't need to cache between cycles because we will just handle the pods in the same cycle. This is why we have kept the cache relatively simple and focused on a particular use case.

But the signature and reuse mechanisms should remain useful. Unless we plan to rewrite all of our existing plugins we will need to evaluate a "representative" pod against all the nodes and then use the results from this pod for all of the pods in the same "class". This will still require the signature and a similar "reuse" logic to that of the cache.

The reason to invest in this now is to both build the signature framework and start to get familiar with the advantages and challenges of evaluating a group of pods against a set of nodes once, with at least some production exposure.

But I'd love to hear your thoughts.

Another way to see it is this: the cache is a "Trojan Horse" for injecting multi-pod scheduling into the system without having bite off on the full interface and new scheduling cycles. It should provide immediate benefit to a focused set of customers, but the primary purpose (from my perspective, anyway) is to start getting production understanding of how multi-pod scheduling works in practice, and start building frameworks that will support multi-pod scheduling.

Also, just to be clear; I've been working on this KEP in close coordination with Eric, the author of the workload KEP, in addition (of course) with Dominik.

So I think we expect the cache itself will be deprecated by gang scheduling; we won't need to cache between cycles because we will just handle the pods in the same cycle. This is why we have kept the cache relatively simple and focused on a particular use case.

Right, and that's the core reason I'm doubting why we need to do this today, given we know the gang scheduling is coming.
Note that this cache mechanism is definitely not going to be that simple, implementation wise.

But the signature and reuse mechanisms should remain useful.

For who? This feature's target is already very small even at this point (because of many limitations), and after the gang scheduling, how large portion of pods in this world can get the benefit? and is that worthy enough for us to keep maintaining the feature?
I'm already not a fan of the idea to introduce a optimization specifically for 1-pod-per-node scheduling even with many other limitation.

The reason to invest in this now is to both build the signature framework and start to get familiar with the advantages and challenges of evaluating a group of pods against a set of nodes once, with at least some production exposure.
..
Another way to see it is this: the cache is a "Trojan Horse" for injecting multi-pod scheduling into the system

Hmm, I don't understand. For me, this feature and a future gang scheduling just look like two different things, and we're taking a unnecessary risk/effort here to try something not that similar, which we know will be less useful after the gang scheduling arrives.

Hmm, I don't understand. For me, this feature and a future gang scheduling just look like two different things, and we're taking a unnecessary risk/effort here to try something not that similar, which we know will be less useful after the gang scheduling arrives.

Even if we have single-cycle gang scheduling, the process of finding placement may be very complicated in the most generic case. We don't have any proposal on the table how it could ultimately look like, and it will be the most challenging and exciting part tackling NP-hard/NP-complete scheduling problem.

At the same time, we know that in this special case we can safely reuse both filtering and scoring results, so achieve scheduling in O(N) complexity instead of O(NxM). Still not every workload is eligible for it, so the mechanism of signatures would be still needed to discover that.

For workloads without 1-pod-per-node usage pattern could only reuse filtering results, but repeat scoring after every pod. So in any case, ability to discover what type of workload we deal with could drive different algorithm, while this one seems the fastest.

So the question how many customers could make use of it is valid and even finding an early adopter would be useful to assess it.

I know it will work for some limited number of workloads. I'm taking about how % of kubernetes cluster will get benefit from this feature (see conditions that must be satisfied to use this feature) and whether it really makes sense, as a community, to maintain this specific optimization in k/k.

For workloads without 1-pod-per-node usage pattern could only reuse filtering results, but repeat scoring after every pod.

How can workloads without 1-pod-per-node reuse the filtering results? If it's 1-pod-per-node scheduling, if pod#1 is scheduled on node#1, we can simply remove node#1 from the cache. But, if it's not 1-pod-per-node scheduling, we're not sure if we should remove node#1 from the list or not.

I'm taking about how % of kubernetes cluster will get benefit from this feature

+1

How can workloads without 1-pod-per-node reuse the filtering results? If it's 1-pod-per-node scheduling, if pod#1 is scheduled on node#1, we can simply remove node#1 from the cache. But, if it's not 1-pod-per-node scheduling, we're not sure if we should remove node#1 from the list or not.

The signature filters out pod scheduling features which can cause invalidation/extension of feasible nodes (the filtering results) and their scores. This way we should know that placing a pod on one of the nodes does not require recomputing them for other nodes. These might be a vast majority of cases for large workloads (which need gang scheduling), since pod-affinities and others are known from significantly slowing down scheduling.

So without 1-pod-per-node constraint we would need to recompute feasibility and score on the node which was just taken before placing another one there. This actually does not sound like a big degradation compared to the initial proposal, so do we really need to detect 1-pod-per-node then as a special case? It would also imply keeping the cache itself.

The inter pod-affinity and anti-affinity features are more complex to support, since they may both influence feasibility and scoring results after each placement.

I'm completely open to just saying this is for any workload. I agree that the difference in scoring is likely not a big issue; at worst we will spread pods out more than we would today, which it sounds like we are already trying to do anyway.

The restriction to 1-pod-per-node is meant as a way to be cautious; in this case we know the behavior will be identical to the current mechanisms, where for multi-pod-per-node cases the scoring would differ somewhat, even if the feasibility would still be identical.

The big thing from my perspective is that I don't think any of the work for 1-pod-per-node will be wasted. If we decide we really do need to re-evaluate scores and keep the nodes around, we can add that with incremental extra work.

sanposhiho · 2025-10-07T16:26:09Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+   * Topology spread rules (including inherited rules from the system default) This constraint we should attempt to lift in the future.
+
+To construct a signature, we add a new function for each plugin to implement.
+This function takes a pod and generates a signature for that plugin as a string. 


Can you add the explaination how the interface will look like on the kep

Yes, good call. I'll pull in the signatures from the draft version.

Ok, added the draft implementation here in the text, PTAL...

keps/sig-scheduling/5598-scheduling-cache/README.md

sanposhiho · 2025-10-07T16:36:44Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+
+### Goals
+
+ * Improve the performance of scheduling large jobs on large clusters where the constraints are simple.


Where's the bottleneck for the scheduling you are imagining in this cluster? If pods are that simple, most of plugins should just return Skip at PreXXXX phase.

The bottleneck is just iterating through all the nodes and checking; once you have 100000 nodes just iterating through them and checking labels becomes pretty expensive. None of the checks individually should be expensive, there are just a lot of them. I did some benchmarking with the scheduler_perf tests, I can grab results from some there if you'd like to look through them together...

sanposhiho · 2025-10-07T16:39:18Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+When a pod with the same signature comes later, we find the entry in our cache and pull the first
+node off the list. We then go down the nominated node path. Just as we would with a pod


So, there's multiple lists, one per one signature. When a scheduling happens and node#1 is selected, are we going to iterate all lists and remove node#1 from all lists? (assuming node#1 is no longer available for pods because of 1-pod-per-node assumption)

When a host is used, it is removed from the cache.

The cache puts every entry on two linked lists; one for the signature and one for the host. So we don't have to iterate over anything to remove the host, just remove the entries in the host list.

keps/sig-scheduling/5598-scheduling-cache/README.md

Co-authored-by: Dominik Marciński <gmidon@gmail.com>

bwsalmon · 2025-10-07T23:48:51Z

@bwsalmon - please sign the CLA
(and I'm happy to take the production readiness approval for it)

Cool. I tried to but it forwarded me to some Google reps to sign it... I sent the request but haven't heard back. Maybe I did it wrong?

I did that long time ago, but historically it didn't require any interactions...

Ok, let me see what I need to do...

dom4ha · 2025-10-08T11:24:47Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+is what we cache; the current pod will use the first node on the list, and we cache the remaining
+results, indexed by the pod's "scheduling signature" (which we will describe later).
+
+When a pod with the same signature comes later, we find the entry in our cache and pull the first


The alternative is to set nomination on all pods (schedule them in one cycle) vs the option pulling the cached node once a pod gets to its dedicated scheduling cycle.

The advantage would be that resources needed for a workload would be not only scheduled, but reserved (at least in memory), so that other pods that are interleaving the workload ones in the scheduling queue, don't take the space (invalidating some of the cached entries).

I guess it will matter more when we integrate this approach with Workload object and Gang Scheduling, but for now we could keep it just as an alternative, which however is worth to mention in the KEP.

Yes, I'm very onboard with this approach; the point of the cache was to avoid getting in the way of the workload / gang scheduling work. Let's discuss, because given Kensei's concerns around the cache as an object this may be the right way to do it off the bat. I don't think this is more complicated than the cache, and it may be much simpler.

It indeed solves a few problems, including memory consumption.

The alternative is to set nomination on all pods (schedule them in one cycle) vs the option pulling the cached node once a pod gets to its dedicated scheduling cycle.

The primary problem with this approach is that those pods may not yet have been observed by the scheduler. We need a way to somehow enumerate them first to ensure that this "single cycle" knows which pods it should do.

But overall I'm definitely supportive for this path - it's actually something we've been talking about some time ago.

True. Maybe we could simply reverse the problem and instead of caching results, we could wait for a few seconds for the workload pods to appear?

@sanposhiho - for your thoughts too

Great. I think we can tune the logic waiting vs. expecting X pods, but overall the proposed approach here is to

account pods to workload in the framework

schedule the workload in a few batches (put NNN as the result)

"cache" filtering/scoring results per batch, instead of allowing it to outlive one scheduling cycle

Is that the common ground?

That is what I'm suggesting. But we also need @sanposhiho opinion here.

Ok, I believe that when you say "implicit workload" and I say "signature" we mean the same thing. I think I'm aligned but let me state my understanding:

V 0.1 Opportunistic batching (this KEP)

When a pod is placed in the queue it is given a "signature" / "implicit workload" that describes the scheduling parameters for the pod; in some cases (pod affinity, etc) we may choose to mark it as "unbatchable", which will skip all of this logic.

When we pull a pod off the queue for scheduling, we opportunistically grab as many pods with matching "signature" / "implicit workload" as is easily doable.

We evaluate the first pod against all nodes.

We assign nominated node name for as many of the pods in the "signature" / "implicit workload" as possible using the results from the first pod. Note that this will require the same assumption as we do for 1-pod-per-node: we will use one host for one pod, but I think we are aligned that this is fine even for "multi-pod-per-node" workloads. If we can't assign all pods, we assign as many as we can.

We run ScheduleOne for each pod in the group we could assign.

We put the pods we couldn't assign back in the queue to be tried again.

V 1.0 Gangs (post this KEP)

All the same functionality as v 0.1, with the following additions:

A pod assigned to a "gang" / "explicit workload" will be marked as such when added to the queue.

For "explicit workloads" we do not process any pods until we have all the pods in the queue.

After evaluating a single pod against all nodes and attempting to assign nominated node names, we fail to schedule all the "workload" pods if we cannot assign all pods in the set. In this case we remove all the nominated node names and don't actually schedule any of the pods.

For "normal" workloads we continue to assign a "signature" / "implicit workload" and batch opportunistically, as we do in v 0.1.

Depending on the constraints we may need to move to a model where we assign more than 1 pod for each node in our list. We can tackle this with scoring / fit changes if necessary.

I should be able to implement a first version of this scheme this week.

dom4ha · 2025-10-08T11:40:04Z

@bwsalmon - please sign the CLA
(and I'm happy to take the production readiness approval for it)

Cool. I tried to but it forwarded me to some Google reps to sign it... I sent the request but haven't heard back. Maybe I did it wrong?

I did that long time ago, but historically it didn't require any interactions...

Ok, let me see what I need to do...

This KEP does not seem to be tracked for PRR, as it haven't got any heads up from the v1.35 Enhancements team.

Not sure if CLA is also a missing piece, but definitely the KEP should have an entry in prod-readiness dir - keps/prod-readiness/sig-scheduling/5598yaml. Note that the PRR freeze is tomorrow!

Thursday 9th October 2025 (AoE) / Friday 10th October 2025, 12:00 UTC

sanposhiho · 2025-10-08T13:56:00Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+ * Begin building infrastructure to support gang scheduling and other "multi-pod" scheduling requests.
+ * Ensure that the infrastructure we build is maintainable as we update, add and remove plugins.
+ * Never impact feasibility.
+ * Provide identical results to our current scheduler for 1-pod-per-node environments.


We need more clarification of 1-pod-per-node scheduling.
What if some random pods w/o 1-pod-per-node constraint land on nodes? Would those nodes be ineligible for pods w/ 1-pod-per-node constraint? Or, 1-pod-per-node constraint means to be effective only between other pods with 1-pod-per-node constraint? (i.e., 1-pod-per-node pods can be colocated with random pods)

sanposhiho · 2025-10-08T14:16:27Z

keps/sig-scheduling/5598-scheduling-cache/README.md

Alternative idea: we can consider some new extension points in the scheduling framework and implement this cache feature as a plugin. That way, we could avoid having this cache mechanism that works only for a specific set of pods in the scheduler core, and we might be able to have more plugins for different sets of pods in the future.

I'm still not convinced to maintain this feature in k/k though, this alternative approach would open another option: implement the framework change only in k/k, and implement this feature as a out-of-tree plugin in sigs/scheduler-plugins.

Discovery of non-compatible pods (via cross-plugins signature mechanism) would be still needed.

Right, but that's not a blocker of this approach: we can have a new method on the framework handle so that a plugin can run other plugins to compute the signature. (it's like a preemption plugin runs other plugins' filter funcs)

we can have a new method on the framework handle so that a plugin can run other plugins to compute the signature

True, but I'd rather think about creating single-cycle workload scheduling phase as an extension point. This way the gang-scheduling plugin would be notified that there is a workload to be scheduled. Different plugins would have different set of algorithms in which they do that.

The ability to run other plugins phases like filtering, scoring and most this new signature one (to check pod compatibility with some algorithms) would become necessary and available to them. Even external schedulers more and more need such capability, so we there are discussion about a need to take it out as a library, otherwise they are not able to provide valid placements.

True, but I'd rather think about creating single-cycle workload scheduling phase as an extension point.

I think we kind-of agreed on that already as part of gang-scheduling KEP (for Beta). So I'm definitely +1 for this.

But I think that wouldn't on its own can be a bit orthogonal (and on its own doesn't address Kensei's concern) [unless I misunderstood your comment].
In addition to be able to call other plugins to compute the signature, we would still need:
(1) either a cache itself somewhere - which Kensei is concerned about
(2) or be able to ensure that we collect all similar pods first before processing them all at once as part of this extension point (then the cache remain local to a single scheduling cycle, which is great)

(2) sounds much better (and is much more aligned with the vision), but we don't have any mechanism to collect all the pods that we want to process together (or to even learn how many of these we want to collect). So I think that is a missing bit here.

Re (2) true, I just describe how this extension point may look like in the future.

So overall I agree with Kensei's proposal to eventually keep cache in a plugin (rather than in the framework). We wanted to avoid defining this extension point now before we understand well how single-phase workload scheduling should look like. Without it, it would be hard to pass the cache results back to the framework.

However I understand better the concern now about making this cache a part of the framework.

k8s-ci-robot · 2025-10-09T00:45:11Z

Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages.

The list of commits with invalid commit messages:

81dfd1d Apply suggestion from @dom4ha
7b9e763 Apply suggestion from @dom4ha
d1b3df1 Apply suggestion from @dom4ha
ed3bb4e Apply suggestion from @dom4ha
36f3ad1 Apply suggestion from @dom4ha
29d699c Apply suggestion from @dom4ha
d1195a2 Apply suggestion from @dom4ha
81dea08 Apply suggestion from @dom4ha

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-ci-robot · 2025-10-09T00:45:17Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bwsalmon
Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t and additionally assign dom4ha for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wojtek-t · 2025-10-09T11:04:19Z

/ok-to-test

k8s-ci-robot · 2025-10-09T11:07:40Z

@bwsalmon: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-enhancements-verify	`8c8fb39`	link	true	`/test pull-enhancements-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

KEP-5598: Opportunistic scheduling cache

fd731ed

k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 1, 2025

k8s-ci-robot requested a review from dom4ha October 1, 2025 23:11

k8s-ci-robot requested a review from macsko October 1, 2025 23:11

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 1, 2025

github-project-automation bot added this to SIG Scheduling Oct 1, 2025

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 1, 2025

helayoty moved this to Needs Review in SIG Scheduling Oct 2, 2025

wojtek-t reviewed Oct 3, 2025

View reviewed changes

wojtek-t self-assigned this Oct 3, 2025

Add section on eCache.

68a3857

bsalmon-goog added 3 commits October 3, 2025 19:50

Wording clarifications.

2518927

More cleanup.

47bc765

Cleanup and add comparison to ecache.

b833002

helayoty reviewed Oct 6, 2025

View reviewed changes

keps/sig-scheduling/5598-scheduling-cache/README.md Outdated Show resolved Hide resolved

helayoty reviewed Oct 6, 2025

View reviewed changes

k8s-ci-robot requested a review from sanposhiho October 6, 2025 22:51

helayoty moved this from Needs Review to In Progress in SIG Scheduling Oct 7, 2025

sanposhiho reviewed Oct 7, 2025

View reviewed changes

Apply suggestion from @dom4ha

81dfd1d

Co-authored-by: Dominik Marciński <gmidon@gmail.com>

k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Oct 7, 2025

bwsalmon and others added 7 commits October 7, 2025 10:39

Apply suggestion from @dom4ha

7b9e763

Co-authored-by: Dominik Marciński <gmidon@gmail.com>

Apply suggestion from @dom4ha

d1b3df1

Co-authored-by: Dominik Marciński <gmidon@gmail.com>

Apply suggestion from @dom4ha

ed3bb4e

Co-authored-by: Dominik Marciński <gmidon@gmail.com>

Apply suggestion from @dom4ha

36f3ad1

Co-authored-by: Dominik Marciński <gmidon@gmail.com>

Apply suggestion from @dom4ha

29d699c

Co-authored-by: Dominik Marciński <gmidon@gmail.com>

Apply suggestion from @dom4ha

d1195a2

Co-authored-by: Dominik Marciński <gmidon@gmail.com>

Apply suggestion from @dom4ha

81dea08

Co-authored-by: Dominik Marciński <gmidon@gmail.com>

bwsalmon added 2 commits October 7, 2025 16:50

Addressing feedback...

d8f79f5

Add testing fields, feature gate info, etc.

2a1622e

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 8, 2025

dom4ha reviewed Oct 8, 2025

View reviewed changes

macsko mentioned this pull request Oct 8, 2025

Opportunistic scheduling cache #5598

Open

4 tasks

sanposhiho reviewed Oct 8, 2025

View reviewed changes

bsalmon-goog and others added 2 commits October 9, 2025 00:16

Start on multi scheduler and Kueue sections.

5a1a32b

Add PRR file.

8c8fb39

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 9, 2025


		### Goals

		* Improve the performance of scheduling large jobs on large clusters where the constraints are simple.

		When a pod with the same signature comes later, we find the entry in our cache and pull the first
		node off the list. We then go down the nominated node path. Just as we would with a pod

KEP-5598: Opportunistic scheduling cache #5599

Are you sure you want to change the base?

KEP-5598: Opportunistic scheduling cache #5599

Conversation

bwsalmon commented Oct 1, 2025

Uh oh!

linux-foundation-easycla bot commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Oct 1, 2025

Uh oh!

k8s-ci-robot commented Oct 1, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wojtek-t commented Oct 3, 2025

Uh oh!

bwsalmon commented Oct 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dom4ha commented Oct 6, 2025

Uh oh!

dom4ha commented Oct 6, 2025

Uh oh!

wojtek-t commented Oct 7, 2025

Uh oh!

sanposhiho left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bwsalmon Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linux-foundation-easycla bot commented Oct 1, 2025 •

edited

Loading

bwsalmon Oct 7, 2025 •

edited

Loading

bwsalmon Oct 7, 2025 •

edited

Loading

bwsalmon Oct 7, 2025 •

edited

Loading