Skip to content

Conversation

bwsalmon
Copy link

@bwsalmon bwsalmon commented Oct 1, 2025

  • One-line PR description: First version of the KEP for scheduler cache
  • Other comments:

Copy link

linux-foundation-easycla bot commented Oct 1, 2025

CLA Not Signed

@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 1, 2025
@k8s-ci-robot k8s-ci-robot requested a review from dom4ha October 1, 2025 23:11
@k8s-ci-robot
Copy link
Contributor

Welcome @bwsalmon!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot requested a review from macsko October 1, 2025 23:11
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 1, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @bwsalmon. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 1, 2025
@helayoty helayoty moved this to Needs Review in SIG Scheduling Oct 2, 2025
node off the list. We then go down the nominated node path. Just as we would with a pod
with a NominatedNodeName in it, we only re-evaluate the feasibility of this node before using it.

Since we assume 1-pod-per-node, we know that the node used by the current pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How we're going to verify that it's indeed pod-per-node configuration?

I think that just relying on users is not enough here...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the current thought is this (I'll make this more explicit here).

The user has to opt-in with the scheduling profile, we should mention this clearly there.

If they get it wrong it will not necessarily lead to incorrect behavior, just a shift of the scoring somewhat. So I think the cost is comparatively low.

I'm open to more tracking if we think that is appropriate. We can tell if a pod class is 1-pod-per-node, so we can flag that case if we'd like, and we should also be able to tell if multiple non-daemon set pods land on the same nodes after the fact.

Thoughts on how much you think is necessary?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned offline, I'm personally ok with scheduling profile:

  1. explicitly scheduling profile allows us to explicitly "opt-in" my specific workload for this behavior
  2. given we do feasibility validation anyway for the result from cache, we won't break the correctness here

There is a concern that @sanposhiho brought during todays SIG scheduling meeting that we can probably summarize as:
What if the cluster state changed in the meantime? (2) ensures that we won't schedule on infeasible node, but if some pods were terminated in the meantime (or nodes appeared), there may be a new spot for our pod in the cluster now that is better from scoring perspective but is not in cache (because it wasn't feasible when computing the cache).

My personal answer to that would be: given that it's a dedicated scheduling profile that users opt-in to, it probably is ok (we don't mess the feasibility, we potentially choose not the optimal node).

FWIW, given that we only look for N feasible nodes for scoring anyway, for large enough clusters we can always choose not the optimal node anyway (so I personally think this is not a problem even without dedicated profile, but clearly that is more contentious).

So I would be ok with scheduling profile. But we need an input from @sanposhiho and @dom4ha

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you both mean by "scheduling profile"? The scheduler config or a new API in Pod? If it's the scheduler config, are we going to assume all pods in the cluster are 1-pod-per-node scheduling, without checking anything actual?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "scheduling profile" we do mean the scheduler config. So the user would need to create a scheduler config with a flag set and then reference it in the pod definitions. Thoughts on better ways to handle it?

@wojtek-t
Copy link
Member

wojtek-t commented Oct 3, 2025

@bwsalmon - please sign the CLA

(and I'm happy to take the production readiness approval for it)

@wojtek-t wojtek-t self-assigned this Oct 3, 2025
@bwsalmon
Copy link
Author

bwsalmon commented Oct 3, 2025

@bwsalmon - please sign the CLA

(and I'm happy to take the production readiness approval for it)

Cool. I tried to but it forwarded me to some Google reps to sign it... I sent the request but haven't heard back. Maybe I did it wrong?

Comment on lines 340 to 343
If plugin changes prove to be an issue, we could codify the signature as a new "Scheduling" object that only has a subset
of the fields of the pod. Plugins that "opt-in" could only be given access to this reduced scheduling object, and we could then use the entire scheduling object as the signature. This would make it more or less impossible for the signature and plugins to be out of sync, and would
naturally surface new dependencies as additions to the scheduling object. However, as we expect plugin changes to be relatively
modest, we don't believe the complexity of making the interface changes is worth the risk today.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate paragraph R323-R326

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it seemed like it fit both places but definitely isn't ideal.

Thoughts on where to keep it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here is better.

* VolumeBinding: Same as NodeVolumeLimits.
* VolumeRestrictions: Same as NodeVolumeLimits.
* VolumeZone: Same as NodeVolumeLimits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to take the defaultPreemption into our consideration?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, good point. Yes, I feel like we should but didn't find the reference in our plugins.

I assume we will also need priority class as well.

Let me figure out where in the code is handling this and we can decide how to add it in...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I took a look through the code and came to what is a somewhat counter-intuitive answer to me.

It doesn't look like we need to include DefaultPreemption, and thus priority in the signature, because we don't need to include PostFilter plugins. I did make that decision early on, but I'm re-digesting it now.

This is because we are caching the result after prefilter, filter, prescore and score, but before the postfilter step. PostFilter only runs if we can't find anything after the previous steps, then we use the priority to decide if we should try to preempt something.

So regardless of the priority, we would (correctly) get the same results from the cache. But if they were empty, we would potentially evict in one case and not in the other.

If you get a chance, please take a look and make sure this matches your understanding...

extending the production code to implement this enhancement.
-->

- `<package>`: `<date>` - `<test coverage>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the test files that this KEP will add/update, along with the current coverage for the existing ones.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Especially since this is my first KEP, PTAL and let me know if I'm capturing the appropriate information...

- a search in the Kubernetes bug triage tool (https://storage.googleapis.com/k8s-triage/index.html)
-->

- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, I expect we will add new integration tests as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. As with unit tests, PTAL and make sure I am meeting the expected guidelines...

[existing list]: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/
-->

- [ ] Feature gate (also fill in values in `kep.yaml`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the feature gate and update it in the kep.yaml as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@dom4ha
Copy link
Member

dom4ha commented Oct 6, 2025

/cc @sanposhiho @macsko

@dom4ha
Copy link
Member

dom4ha commented Oct 6, 2025

/label lead-opted-in
/milestone v1.35

@helayoty helayoty moved this from Needs Review to In Progress in SIG Scheduling Oct 7, 2025
@wojtek-t
Copy link
Member

wojtek-t commented Oct 7, 2025

@bwsalmon - please sign the CLA
(and I'm happy to take the production readiness approval for it)

Cool. I tried to but it forwarded me to some Google reps to sign it... I sent the request but haven't heard back. Maybe I did it wrong?

I did that long time ago, but historically it didn't require any interactions...

Copy link
Member

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, for me, it is still a lot of ??. Left comments.

node off the list. We then go down the nominated node path. Just as we would with a pod
with a NominatedNodeName in it, we only re-evaluate the feasibility of this node before using it.

Since we assume 1-pod-per-node, we know that the node used by the current pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you both mean by "scheduling profile"? The scheduler config or a new API in Pod? If it's the scheduler config, are we going to assume all pods in the cluster are 1-pod-per-node scheduling, without checking anything actual?

some other pod, but this should be the only issue.

All stored results are timestamped, and we remove entries when they are more
than a few seconds old to ensure we do not get stale data. Since we are targeting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long is "a few seconds"? Given the scheduler handles ~300 pods, even if this TTL is 1 sec, the cluster state is changed by, at the very least, ~300 new running pods. Of course, on top of that, there's tons more of cluster state changes, especially when it comes to a huge cluster that this KEP is aiming at. Do we really want to keep using the same scheduling results?

Copy link
Author

@bwsalmon bwsalmon Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we are talking about a few seconds (I'd say between 1 and 10), so thus 100s to 1000s of new pods, which keep the cache up to date as they flow through.

What other changes (besides pod death) should we expect?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you mentioned label and taint changes, which makes a lot of sense, and we should discuss there...

Other things you would be concerned about?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we scope the cache to a single workload and consider it as a cluster state snapshot made at the time we do first pod scheduling, the problem of invalidation is the result is not that severe as it seems.

Yes, without setting NNN right away there is a greater chance that one-cycle scheduling can be more easily broken by other cluster changes happening in the meantime, but eventually each pod goes though its own complete pod-by-pod scheduling cycle, so the node it was originally assigned to will be reevaluated whether it's still feasible.

Sure, it's a bit more problematic if it turns out to be not feasible anymore, so full filtering and scoring needs to take place and the final assignment can be corrected. Sure, it may be not consistent with the initial placement, but we don't consider any cross-pod requirements yet (TAS, shared ResourceClaims etc), we don't have to care.

Once we start supporting cross-pod requirements, a failure ot pass pod-by-pod scheduling would be a signal to reschedule the whole workload (go back to the single-cycle scheduling phase), but this is the next step.

* NodeResourcesFit: We use the output of the computePodResourceRequest function as the signature.
* NodeUnschedulable: We use the Tolerations field as the signature.
* NodeVolumeLimits: We use all Volume information except from Volumes of type ConfigMap or Secret.
* PodTopologySpread: If the PodTopologySpead field is set, or it is not set but a default set of rules are applied, we mark the pod unsignable, otherwise it returns an empty signature.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a default set of rules are applied, we mark the pod unsignable

Meaning, all pods that belong to Service, ReplicaSet, StatefulSet or ReplicationController are not eligible to use the cache. That's a large portion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless the user overrides the defaults, yes. This is definitely the biggest restriction in the model today. I think it is relaxable with a more focused version of pod spread, but it may not be worth addressing given that multi-pod scheduling probably addresses the issue more naturally.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meaning, all pods that belong to Service, ReplicaSet, StatefulSet or ReplicationController are not eligible to use the cache. That's a large portion.

Not necessarily. The workloads that would benefit the most are Jobs, JobSets, LWS etc which AFAIK cannot have default topology spread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this proposal is for jobs etc mostly (given it mentions ML workloads). Like I mentioned in the other thread, I'm talking about how small the target of this feature would be.

In order to utilize this feature, pods must be:

  • not Service, ReplicaSet, StatefulSet nor ReplicationController. i.e., this feature is not useful for typical K8s clusters hosting a backend app etc.
  • use 1-pod-per-node scheduling strategy largely.
  • not use any complex scheduling constraint on pods.
  • [in the future] not use a gang scheduling

Why do we need to invest our time to implement/maintain it only for users that luckily have pods that match above conditions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather reverse what you've just said.

The kube-scheduler is useful for clusters which run Services, ReplicaSets, StatefulSet etc, but it does not do a good job when it needs to schedule big Jobs and run ML workloads since it needs to support features developed to support the former.

This is why there are so many workload schedulers which does scheduling better, since they can ignore these problematic scheduling constraints. Sure, we work on gang scheduling, but reusing filtering and scoring results would be a foundation of most of the gang scheduling algorithms, so I have not doubts that this feature is a step in the right direction, although I agree that in its simplest form it's hard to notice its value.

If we scope the cache to a single workload it becomes a more apparent that signature framework allows us to detect workloads and gangs implicitly even if our new workload object wasn't created upfront by a user. So we can use the single cycle scheduling capabilities opportunistically, without enforcing minCount since it's not known until we get an explicit workload definition from a user or controller.

What I'm describing it's more extended scope which we did not want to implement in the simplest form, but it's an alternative that is on the table.

node off the list. We then go down the nominated node path. Just as we would with a pod
with a NominatedNodeName in it, we only re-evaluate the feasibility of this node before using it.

Since we assume 1-pod-per-node, we know that the node used by the current pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you assume all pods in the cluster have 1-pod-per-node scheduling constraint? Or, could it be mixed like some are 1-pod-per-node scheduling, while others are normal scheduling, which multiple pods can land on the same node?

Copy link
Author

@bwsalmon bwsalmon Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we just assume that pods that use the cache are 1-pod-per-node. So if some other (non-1-pod-per-node) workload uses a node it just gets removed from the cache and ignored by the 1-pod-per-node workloads going through the cache.

So the other workloads obviously don't get any advantage from the cache, but they also won't be hurt by it...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this to the text.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Closing comment, but feel free to reopen if I missed something...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we just assume that pods that use the cache are 1-pod-per-node. So if some other (non-1-pod-per-node) workload uses a node it just gets removed from the cache and ignored by the 1-pod-per-node workloads going through the cache.

please add the explanation how we determine a pod is doing 1-pod-per-node

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current proposal is that we determine a workload is 1-pod-per-node by requiring a user to mark them as such by targeting them to a customer scheduler configuration with this option enabled. We can revisit this if we feel it is insufficient.

We can potentially determine if a workload is 1-pod-per-workload, i.e. we will never schedule another pod of this kind on the same node, we can even potentially (but with a lot of complexity) tell if there is any other current pod that could run alongside this pod, but there is no way to tell that user will never create a workload in the future that will be able to run along-side this workload.

I would suggest if we feel the current proposal is insufficient, we could add a constraint on the pod being 1-pod-per-workload (not 1-pod-per-node). However, this adds more constraints and it isn't clear to me that it would actually provide stronger guarantees that the current model.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also say; it isn't clear to me that this would only benefit 1-pod-per-node workloads, only that it would be identical to the current scheduler for 1-pod-per-node workloads. We will never provide an infeasible solution, we will only, in some cases, spread out workloads more than they might have been with the current scheduler. I'm not clear this is actually going to be a problem, but I also don't have strong data confirming that it won't be a problem. The hope would be to gather this kind of information through a deployment.

### Goals

* Improve the performance of scheduling large jobs on large clusters where the constraints are simple.
* Begin building infrastructure to support gang scheduling and other "multi-pod" scheduling requests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's also mentioned in our sig-scheduling meeting. But, how would it fit our future picture of gang scheduling? I imagine a separate phase would decide all placements for the gang pods at once before proceeding to the existing filter/score/etc phases.
Are we sure that this cache mechanism would still remain useful?
Even further, is it really something we should focus/invest our time now?

Copy link
Author

@bwsalmon bwsalmon Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, these are excellent questions; I tried to address them but looks like I need to make it more explicit.

So I think we expect the cache itself will be deprecated by gang scheduling; we won't need to cache between cycles because we will just handle the pods in the same cycle. This is why we have kept the cache relatively simple and focused on a particular use case.

But the signature and reuse mechanisms should remain useful. Unless we plan to rewrite all of our existing plugins we will need to evaluate a "representative" pod against all the nodes and then use the results from this pod for all of the pods in the same "class". This will still require the signature and a similar "reuse" logic to that of the cache.

The reason to invest in this now is to both build the signature framework and start to get familiar with the advantages and challenges of evaluating a group of pods against a set of nodes once, with at least some production exposure.

But I'd love to hear your thoughts.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to see it is this: the cache is a "Trojan Horse" for injecting multi-pod scheduling into the system without having bite off on the full interface and new scheduling cycles. It should provide immediate benefit to a focused set of customers, but the primary purpose (from my perspective, anyway) is to start getting production understanding of how multi-pod scheduling works in practice, and start building frameworks that will support multi-pod scheduling.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, just to be clear; I've been working on this KEP in close coordination with Eric, the author of the workload KEP, in addition (of course) with Dominik.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think we expect the cache itself will be deprecated by gang scheduling; we won't need to cache between cycles because we will just handle the pods in the same cycle. This is why we have kept the cache relatively simple and focused on a particular use case.

Right, and that's the core reason I'm doubting why we need to do this today, given we know the gang scheduling is coming.
Note that this cache mechanism is definitely not going to be that simple, implementation wise.

But the signature and reuse mechanisms should remain useful.

For who? This feature's target is already very small even at this point (because of many limitations), and after the gang scheduling, how large portion of pods in this world can get the benefit? and is that worthy enough for us to keep maintaining the feature?
I'm already not a fan of the idea to introduce a optimization specifically for 1-pod-per-node scheduling even with many other limitation.

The reason to invest in this now is to both build the signature framework and start to get familiar with the advantages and challenges of evaluating a group of pods against a set of nodes once, with at least some production exposure.
..
Another way to see it is this: the cache is a "Trojan Horse" for injecting multi-pod scheduling into the system

Hmm, I don't understand. For me, this feature and a future gang scheduling just look like two different things, and we're taking a unnecessary risk/effort here to try something not that similar, which we know will be less useful after the gang scheduling arrives.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I don't understand. For me, this feature and a future gang scheduling just look like two different things, and we're taking a unnecessary risk/effort here to try something not that similar, which we know will be less useful after the gang scheduling arrives.

Even if we have single-cycle gang scheduling, the process of finding placement may be very complicated in the most generic case. We don't have any proposal on the table how it could ultimately look like, and it will be the most challenging and exciting part tackling NP-hard/NP-complete scheduling problem.

At the same time, we know that in this special case we can safely reuse both filtering and scoring results, so achieve scheduling in O(N) complexity instead of O(NxM). Still not every workload is eligible for it, so the mechanism of signatures would be still needed to discover that.

For workloads without 1-pod-per-node usage pattern could only reuse filtering results, but repeat scoring after every pod. So in any case, ability to discover what type of workload we deal with could drive different algorithm, while this one seems the fastest.

So the question how many customers could make use of it is valid and even finding an early adopter would be useful to assess it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it will work for some limited number of workloads. I'm taking about how % of kubernetes cluster will get benefit from this feature (see conditions that must be satisfied to use this feature) and whether it really makes sense, as a community, to maintain this specific optimization in k/k.

For workloads without 1-pod-per-node usage pattern could only reuse filtering results, but repeat scoring after every pod.

How can workloads without 1-pod-per-node reuse the filtering results? If it's 1-pod-per-node scheduling, if pod#1 is scheduled on node#1, we can simply remove node#1 from the cache. But, if it's not 1-pod-per-node scheduling, we're not sure if we should remove node#1 from the list or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm taking about how % of kubernetes cluster will get benefit from this feature

+1

How can workloads without 1-pod-per-node reuse the filtering results? If it's 1-pod-per-node scheduling, if pod#1 is scheduled on node#1, we can simply remove node#1 from the cache. But, if it's not 1-pod-per-node scheduling, we're not sure if we should remove node#1 from the list or not.

The signature filters out pod scheduling features which can cause invalidation/extension of feasible nodes (the filtering results) and their scores. This way we should know that placing a pod on one of the nodes does not require recomputing them for other nodes. These might be a vast majority of cases for large workloads (which need gang scheduling), since pod-affinities and others are known from significantly slowing down scheduling.

So without 1-pod-per-node constraint we would need to recompute feasibility and score on the node which was just taken before placing another one there. This actually does not sound like a big degradation compared to the initial proposal, so do we really need to detect 1-pod-per-node then as a special case? It would also imply keeping the cache itself.

The inter pod-affinity and anti-affinity features are more complex to support, since they may both influence feasibility and scoring results after each placement.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm completely open to just saying this is for any workload. I agree that the difference in scoring is likely not a big issue; at worst we will spread pods out more than we would today, which it sounds like we are already trying to do anyway.

The restriction to 1-pod-per-node is meant as a way to be cautious; in this case we know the behavior will be identical to the current mechanisms, where for multi-pod-per-node cases the scoring would differ somewhat, even if the feasibility would still be identical.

The big thing from my perspective is that I don't think any of the work for 1-pod-per-node will be wasted. If we decide we really do need to re-evaluate scores and keep the nodes around, we can add that with incremental extra work.

* Topology spread rules (including inherited rules from the system default) This constraint we should attempt to lift in the future.

To construct a signature, we add a new function for each plugin to implement.
This function takes a pod and generates a signature for that plugin as a string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the explaination how the interface will look like on the kep

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good call. I'll pull in the signatures from the draft version.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, added the draft implementation here in the text, PTAL...


### Goals

* Improve the performance of scheduling large jobs on large clusters where the constraints are simple.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where's the bottleneck for the scheduling you are imagining in this cluster? If pods are that simple, most of plugins should just return Skip at PreXXXX phase.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bottleneck is just iterating through all the nodes and checking; once you have 100000 nodes just iterating through them and checking labels becomes pretty expensive. None of the checks individually should be expensive, there are just a lot of them. I did some benchmarking with the scheduler_perf tests, I can grab results from some there if you'd like to look through them together...

Comment on lines +224 to +225
When a pod with the same signature comes later, we find the entry in our cache and pull the first
node off the list. We then go down the nominated node path. Just as we would with a pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, there's multiple lists, one per one signature. When a scheduling happens and node#1 is selected, are we going to iterate all lists and remove node#1 from all lists? (assuming node#1 is no longer available for pods because of 1-pod-per-node assumption)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a host is used, it is removed from the cache.

The cache puts every entry on two linked lists; one for the signature and one for the host. So we don't have to iterate over anything to remove the host, just remove the entries in the host list.

Co-authored-by: Dominik Marciński <gmidon@gmail.com>
@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Oct 7, 2025
bwsalmon and others added 7 commits October 7, 2025 10:39
Co-authored-by: Dominik Marciński <gmidon@gmail.com>
Co-authored-by: Dominik Marciński <gmidon@gmail.com>
Co-authored-by: Dominik Marciński <gmidon@gmail.com>
Co-authored-by: Dominik Marciński <gmidon@gmail.com>
Co-authored-by: Dominik Marciński <gmidon@gmail.com>
Co-authored-by: Dominik Marciński <gmidon@gmail.com>
Co-authored-by: Dominik Marciński <gmidon@gmail.com>
@bwsalmon
Copy link
Author

bwsalmon commented Oct 7, 2025

@bwsalmon - please sign the CLA
(and I'm happy to take the production readiness approval for it)

Cool. I tried to but it forwarded me to some Google reps to sign it... I sent the request but haven't heard back. Maybe I did it wrong?

I did that long time ago, but historically it didn't require any interactions...

Ok, let me see what I need to do...

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 8, 2025
is what we cache; the current pod will use the first node on the list, and we cache the remaining
results, indexed by the pod's "scheduling signature" (which we will describe later).

When a pod with the same signature comes later, we find the entry in our cache and pull the first
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative is to set nomination on all pods (schedule them in one cycle) vs the option pulling the cached node once a pod gets to its dedicated scheduling cycle.

The advantage would be that resources needed for a workload would be not only scheduled, but reserved (at least in memory), so that other pods that are interleaving the workload ones in the scheduling queue, don't take the space (invalidating some of the cached entries).

I guess it will matter more when we integrate this approach with Workload object and Gang Scheduling, but for now we could keep it just as an alternative, which however is worth to mention in the KEP.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm very onboard with this approach; the point of the cache was to avoid getting in the way of the workload / gang scheduling work. Let's discuss, because given Kensei's concerns around the cache as an object this may be the right way to do it off the bat. I don't think this is more complicated than the cache, and it may be much simpler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It indeed solves a few problems, including memory consumption.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative is to set nomination on all pods (schedule them in one cycle) vs the option pulling the cached node once a pod gets to its dedicated scheduling cycle.

The primary problem with this approach is that those pods may not yet have been observed by the scheduler. We need a way to somehow enumerate them first to ensure that this "single cycle" knows which pods it should do.

But overall I'm definitely supportive for this path - it's actually something we've been talking about some time ago.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Maybe we could simply reverse the problem and instead of caching results, we could wait for a few seconds for the workload pods to appear?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanposhiho - for your thoughts too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. I think we can tune the logic waiting vs. expecting X pods, but overall the proposed approach here is to

  • account pods to workload in the framework
  • schedule the workload in a few batches (put NNN as the result)
  • "cache" filtering/scoring results per batch, instead of allowing it to outlive one scheduling cycle

Is that the common ground?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what I'm suggesting. But we also need @sanposhiho opinion here.

Copy link
Author

@bwsalmon bwsalmon Oct 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I believe that when you say "implicit workload" and I say "signature" we mean the same thing. I think I'm aligned but let me state my understanding:

V 0.1 Opportunistic batching (this KEP)

  1. When a pod is placed in the queue it is given a "signature" / "implicit workload" that describes the scheduling parameters for the pod; in some cases (pod affinity, etc) we may choose to mark it as "unbatchable", which will skip all of this logic.
  2. When we pull a pod off the queue for scheduling, we opportunistically grab as many pods with matching "signature" / "implicit workload" as is easily doable.
  3. We evaluate the first pod against all nodes.
  4. We assign nominated node name for as many of the pods in the "signature" / "implicit workload" as possible using the results from the first pod. Note that this will require the same assumption as we do for 1-pod-per-node: we will use one host for one pod, but I think we are aligned that this is fine even for "multi-pod-per-node" workloads. If we can't assign all pods, we assign as many as we can.
  5. We run ScheduleOne for each pod in the group we could assign.
  6. We put the pods we couldn't assign back in the queue to be tried again.

V 1.0 Gangs (post this KEP)

All the same functionality as v 0.1, with the following additions:

  1. A pod assigned to a "gang" / "explicit workload" will be marked as such when added to the queue.
  2. For "explicit workloads" we do not process any pods until we have all the pods in the queue.
  3. After evaluating a single pod against all nodes and attempting to assign nominated node names, we fail to schedule all the "workload" pods if we cannot assign all pods in the set. In this case we remove all the nominated node names and don't actually schedule any of the pods.
  4. For "normal" workloads we continue to assign a "signature" / "implicit workload" and batch opportunistically, as we do in v 0.1.
  5. Depending on the constraints we may need to move to a model where we assign more than 1 pod for each node in our list. We can tackle this with scoring / fit changes if necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should be able to implement a first version of this scheme this week.

@dom4ha
Copy link
Member

dom4ha commented Oct 8, 2025

@bwsalmon - please sign the CLA
(and I'm happy to take the production readiness approval for it)

Cool. I tried to but it forwarded me to some Google reps to sign it... I sent the request but haven't heard back. Maybe I did it wrong?

I did that long time ago, but historically it didn't require any interactions...

Ok, let me see what I need to do...

This KEP does not seem to be tracked for PRR, as it haven't got any heads up from the v1.35 Enhancements team.

Not sure if CLA is also a missing piece, but definitely the KEP should have an entry in prod-readiness dir - keps/prod-readiness/sig-scheduling/5598yaml. Note that the PRR freeze is tomorrow!

Thursday 9th October 2025 (AoE) / Friday 10th October 2025, 12:00 UTC

@macsko macsko mentioned this pull request Oct 8, 2025
4 tasks
* Begin building infrastructure to support gang scheduling and other "multi-pod" scheduling requests.
* Ensure that the infrastructure we build is maintainable as we update, add and remove plugins.
* Never impact feasibility.
* Provide identical results to our current scheduler for 1-pod-per-node environments.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need more clarification of 1-pod-per-node scheduling.
What if some random pods w/o 1-pod-per-node constraint land on nodes? Would those nodes be ineligible for pods w/ 1-pod-per-node constraint? Or, 1-pod-per-node constraint means to be effective only between other pods with 1-pod-per-node constraint? (i.e., 1-pod-per-node pods can be colocated with random pods)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative idea: we can consider some new extension points in the scheduling framework and implement this cache feature as a plugin. That way, we could avoid having this cache mechanism that works only for a specific set of pods in the scheduler core, and we might be able to have more plugins for different sets of pods in the future.

I'm still not convinced to maintain this feature in k/k though, this alternative approach would open another option: implement the framework change only in k/k, and implement this feature as a out-of-tree plugin in sigs/scheduler-plugins.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discovery of non-compatible pods (via cross-plugins signature mechanism) would be still needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but that's not a blocker of this approach: we can have a new method on the framework handle so that a plugin can run other plugins to compute the signature. (it's like a preemption plugin runs other plugins' filter funcs)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can have a new method on the framework handle so that a plugin can run other plugins to compute the signature

True, but I'd rather think about creating single-cycle workload scheduling phase as an extension point. This way the gang-scheduling plugin would be notified that there is a workload to be scheduled. Different plugins would have different set of algorithms in which they do that.

The ability to run other plugins phases like filtering, scoring and most this new signature one (to check pod compatibility with some algorithms) would become necessary and available to them. Even external schedulers more and more need such capability, so we there are discussion about a need to take it out as a library, otherwise they are not able to provide valid placements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but I'd rather think about creating single-cycle workload scheduling phase as an extension point.

I think we kind-of agreed on that already as part of gang-scheduling KEP (for Beta). So I'm definitely +1 for this.

But I think that wouldn't on its own can be a bit orthogonal (and on its own doesn't address Kensei's concern) [unless I misunderstood your comment].
In addition to be able to call other plugins to compute the signature, we would still need:
(1) either a cache itself somewhere - which Kensei is concerned about
(2) or be able to ensure that we collect all similar pods first before processing them all at once as part of this extension point (then the cache remain local to a single scheduling cycle, which is great)

(2) sounds much better (and is much more aligned with the vision), but we don't have any mechanism to collect all the pods that we want to process together (or to even learn how many of these we want to collect). So I think that is a missing bit here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re (2) true, I just describe how this extension point may look like in the future.

So overall I agree with Kensei's proposal to eventually keep cache in a plugin (rather than in the framework). We wanted to avoid defining this extension point now before we understand well how single-phase workload scheduling should look like. Without it, it would be hard to pass the cache results back to the framework.

However I understand better the concern now about making this cache a part of the framework.

@k8s-ci-robot
Copy link
Contributor

Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages.

The list of commits with invalid commit messages:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bwsalmon
Once this PR has been reviewed and has the lgtm label, please ask for approval from wojtek-t and additionally assign dom4ha for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wojtek-t
Copy link
Member

wojtek-t commented Oct 9, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 9, 2025
@k8s-ci-robot
Copy link
Contributor

@bwsalmon: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-enhancements-verify 8c8fb39 link true /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lead-opted-in Denotes that an issue has been opted in to a release ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

7 participants