kep-5278: nominated node name for an expected pod placement #5287

sanposhiho · 2025-05-07T10:52:36Z

One-line PR description: add kep-5278: nominated node name for an expected pod placement

Issue link: Use NominatedNodeName to express the expected pod placement #5278

Other comments:

sanposhiho

@dom4ha @wojtek-t
Opened a draft PR. I know there are other sections that we have to write.

sanposhiho · 2025-05-07T10:54:04Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/kep.yaml

+kep-number: 5278
+authors:
+  - "@sanposhiho" 
+  - "@wojtek-t" 


Put you here @wojtek-t. Or maybe we can just move you to PRR reviewer if you don't have many things to edit? Either is fine to me.

@wojtek-t Who do you think we can ask for PRR?

I would suggest @soltysh probably
Given that I'm pushing some of my ideas through this KEP, it probably shouldn't be me :)

/assign @soltysh

Please assign it to someone else, if needed! 🙇

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

sanposhiho · 2025-05-07T10:55:59Z

/cc @macsko

sanposhiho · 2025-05-07T10:58:35Z

/hold

Not to merge it without sig-scheduling approval

sanposhiho · 2025-05-07T11:08:42Z

/sig autoscaling

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

wojtek-t · 2025-05-08T09:33:57Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+
+So, they know where those pods are likely going to be scheduled.
+
+By specifing their expectation on `NominatedNodeName`, the scheduler can first check whether the pod can go to the nominated node,


There is one more thing that I would like to be able to handle, which is related to CA (and Karpenter).

As CA, when I'm creating some nodes I already computed some placement for that and would like to be able to let kube-scheduler to reuse that (if it places pods differently they may no longer fit). But the main point is that the nodes don't yet exist.
I chatted with @x13n and even though it's not the case now, it's possible to change CA so that it knows the name of the node it will create. In which case they would be able to set the NominatedNodeName appropriately.

But the problem is - this node doesn't exist, so scheduler will look at it and effectively ignore it (because it doesn't exists).
So we want to ensure, that if the node set via NominatedNodeName doesn't exist, we will not clear that.

I can think of few different options:

we try to schedule, but if the pod is unschedulable and node with NNN name doesn't exist, we leave the NNN as is

we don't even try to schedule if NNN doesn't exist [that is risky though, because if the node won't appear, we will never schedule]

some hybrid where CA additionally sets some annotation on the pod like "nnn-is-coming: true" :)

But we also need a mechanism, that when a new node comes, we first see if there are any pods with NNN set to that node and try to schedule them first (probably taking into account if there aren't higher-priority ones).

But the problem is - this node doesn't exist, so scheduler will look at it and effectively ignore it (because it doesn't exists).
So we want to ensure, that if the node set via NominatedNodeName doesn't exist, we will not clear that.

Good point.

we try to schedule, but if the pod is unschedulable and node with NNN name doesn't exist, we leave the NNN as is

I like this over others.
One additional point is what if the pod triggers the preemption. If the pod can go somewhere after the preemption, we should prioritize that, rather than waiting for a new node to be registered.
So, the condition (for not clearing NNN) should be:

node with NNN doesn't exist

the scheduler isn't trying to put a new NNN (i.e., the preemption hasn't triggered at the scheduling cycle)

we don't even try to schedule if NNN doesn't exist [that is risky though, because if the node won't appear, we will never schedule]

Yeah, risky. If external components (or even the scheduler after the preemption) put the existing node name, but the node is deleted right after that, the pod could get stuck.
Also, if many existing running pods are deleted after CA triggers the node creation, we might be able to schedule pending pods over there without waiting for new nodes to come up.

So, the condition (for not clearing NNN) should be:

node with NNN doesn't exist

the scheduler isn't trying to put a new NNN (i.e., the preemption hasn't triggered at the scheduling cycle)

Fair - if scheduler in the meantime found a different place for that pod, we should take that into account.

The main question that I have based on that is:

do we really want to clear NNN first and later set it to the new value

or do we actually want to allow for "changing NNN in place"
The later would allow us to avoid additional API call so I think we should consider it.

Yeah, I mentioned to have a validation webhook to prevent "changing NNN" somewhere on the KEP though, probably we cannot. Performance perspective of course, and also an external component might set NNN between the moments between the scheduler clears it and sets a new value to it. (what should the scheduler do? clear it again, or give up setting a new value? complexity)

So, it's better to allow changing NNN from non-empty, with just one API call.

Even if the implementation were changed so that the CA knows the name of the node it plans to provision in advance, isn’t it still possible that the node might fail to be provisioned properly? For example, if compute resources or GPUs in a specific cloud zone are exhausted, the underlying instance might not be provisioned at all. In that case, wouldn't there be a need to change the NNN in place? Or does the CA (or Karpenter) only assign the NNN when it's certain that the node will eventually join the cluster?

CA can change NNN if they detect the instance failed to start for some reasons and retry to create another, or the scheduler might change NNN if it runs the preemption for the pod. Or, even other external components can overwrite NNN somehow.
Otherwise, we don't have to change NNN (NNN for non-existing node should be just no-op).

As CA, when I'm creating some nodes I already computed some placement for that and would like to be able to let kube-scheduler to reuse that (if it places pods differently they may no longer fit)

Currently we cannot prevent kube-scheduler from placing pods differently, as the semantic of nomination is best effort in this regard. In particular, whenever scheduler changes CA decision, there is a risk that a group of pods would no longer be schedulable. I think we should clearly call it a non-goal and this type of scheduling would be hopefully covered by reservations.

we try to schedule, but if the pod is unschedulable and node with NNN name doesn't exist, we leave the NNN as is

It sounds doable. There is a risk that NNN would point to no longer existing node, but scheduler would either find a new place for it or it would remain unschedulable and CA should find a new home for it.

+1 to both of your comments

A potential improvement (though a much bigger change than this KEP on its own) would be to externalize upcoming nodes. Cluster Autoscaler currently only holds them in memory, while Karpenter has NodeClaims that are its own CRD. Making scheduler aware of upcoming nodes not only would let it make better decisions when handling pods with NNN set, it would also allow it to "schedule" pods on upcoming nodes when there is no capacity available, but we expect it to be provisioned. It could potentially even bring NNN field ownership back to scheduler, if upcoming nodes had a list of pods that triggered its creation.

I love that idea.

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

wojtek-t · 2025-05-08T09:48:27Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+But, if an external component updates `NominatedNodeName` that is set by the scheduler, 
+the pod could end up having different `NominatedNodeName` and `NodeName`.
+
+Probably we should clear `NominatedNodeName` when the pod is bound. (at binding api)


+1 to it

But technically, nothing prevents you from setting NominatedNodeName after the NodeName is set, so there can be divergence anyway.

Couldn't apiserver reject setting NominatedNodeName on bound pods?

Couldn't apiserver reject setting NominatedNodeName on bound pods?

We can do that too. But, the scenario here is:

The binding cycle sets NNN.

Someone sets NNN. (while the binding cycle is still running)

The binding cycle calls the binding api.

NNN and NodeName would be different.

So, we need to clear NNN at the binding API.

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

wojtek-t · 2025-05-08T12:21:31Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+-->
+### The scheduler puts `NominatedNodeName`
+
+After the pod is permitted at `WaitOnPermit`, the scheduler needs to update `NominatedNodeName` with the node that it determines the pod is going to.


I just went through: kubernetes/kubernetes#125491 (comment)

Is there anything we need to do to accomodate with @dom4ha described there about double-accounting?

A new flow would be like this:

The scheduler decides where the pod goes to at the scheduling cycle.

At the end of the scheduling cycle, it "assumes" the pod on the node (= reserves the pod's place on the scheduler's cache)

At the beginning of the binding cycle, it adds NNN to the Pod (added by this KEP)

The event handler receives the pod's update, which is getting NNN though, given the pod is assumed, the update is just ignored.

https://github.com/kubernetes/kubernetes/blob/ff74d46bf074476d798653584657ef974306053f/pkg/scheduler/eventhandlers.go#L155-L161

So, I don't see any problem here. Adding NNN after "assume" shouldn't cause anything.
Though maybe I don't 100% understand what @dom4ha was concerned.

The point was that nominated node and assumed node are two different things for scheduler. We are now going to use one of these concepts to both cases, which was my concern.

Because of this concern we started discussions about reservations to better understand the problem space before we start changing nomination semantic. So on one hand, we'd like to move things forward and fix a few problems, but on the other hand, we may still want to continue investigating how other similar cases relate to the nomination.

I wonder whether we're in the position we understand them and we should jump into implementation, or use this proposal to explore various options. I'd prefer to avoid a situation when we use NNN just because it's easier to reuse something that already exists.

What do you think about listing different use cases first? The obvious ones that I see are:

Scheduler decides to preempt some pods and wants to inform other components that there is a pod which is about to take the freed up space (the final assignment decision can be change at any time)

Scheduler decided to bind a pod, but performs pre-binding steps first, so wants to inform other components that the binding process started (the final assignment is already decided, only consecutive preemption may change it)

CA case

Scheduler itself tries to find a place for group of pods in gang-scheduling case

Kueue case (external scheduler does gang-scheduling)
etc.

Looking into above list, CA case sounds the most similar to the preemption case (a nomination is a suggestion that can change without any harm). Delayed binding is a bit different one though.

Your 5 use cases are also what I could come up with now, but all of them don't seem problematic to me to just reuse NNN. (do you?)

I'd prefer to avoid a situation when we use NNN just because it's easier to reuse something that already exists.

Unless we come up with problems or further use cases that NNN cannot accommodate, it's natural to pick up the simplest solution, just reusing NNN. Rather, the ease/simplicity is actually a big factor.
We can expand NNN's concept and still say NNN is the node name proposed for the next step, in general. If external components set it or preemption sets it (i.e., before scheduling cycles), NNN means the proposal to the scheduling cycle. If the scheduling cycle sets it, NNN means the proposal to the binding cycle.

I don't think we need to try coming up with all possible future use cases right now (impossible), especially because NNN is an existing field. We can design completely a new thing when such use case or problem actually comes up as time goes, and switch from NNN (if needed). That's my thought process.

sanposhiho · 2025-05-08T15:00:04Z

I updated the KEP once, based on the discussion that we've had so far.

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

dom4ha · 2025-05-09T12:20:11Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+We can expose those internal reservations with `NominatedNodeName` so that external components can take a more appropriate action
+based on the expected pod placement.
+
+### External components want to specify a preferred pod placement


One think that worries me the most is letting other components change the NNN that was set by scheduler. So far the assigned scheduler was the only owner, so if we change it, we'd have to think how it impacts existing processes. I'm not saying those are real problems, but I think the KEP should elaborate on them:

What if CA changes NNN that was set by scheduler as a part of the preemption or delayed binding process

Is it possible that components keep changing NNN for a pod because they run different logics

Does it matter if there are multiple schedulers?

How races are resolved (it should not be a problem, just we should describe it)

I wanted to ask what if we say that CA can only set the NNN but can't change that. But then we actually face the problem that if the placement will no longer be valid, scheduler would have to clear it. And we have chicken and egg problem around that. I agree we need an answer here.

2, 3 and 4 - those are in my mental model the flavors of the same problem. If we can have more components trying to schedule the same pod - that's already a problem. So I would basically say that we should document that you need think through who can set NNN for a given pod in a given state and ensure that we have a single component responsible for doing that.

@x13n
Also the current semantic is that NNN is set in an "overallocation" mode, meaning that it's expected the preempted pods will disappear to make a place for the nominated one.

Doesn't CA ignores pods with NNN in its scaling decisions? How it could distinguish it's own nominations from preemption nominations?

What if for whatever reason (not sure if it's possible) the nomination set by CA will be also in "overallocation" mode, but already running pods are not going to be turned down?

Don't we have other components that may assume something about NNN semantic? I don't know much about Karpenter and many other components that might exist.

What if CA changes NNN that was set by scheduler as a part of the preemption or delayed binding process

CA should not change NNN if it's non empty. Even more, it should not process any pods that has NNN basically.
One exception is that it can remove (or change) NNN that is set by themselves if, for example, they failed to create an instance and try to create another.

Is it possible that components keep changing NNN for a pod because they run different logics

As mentioned at the risk section, we will just document such situation as a risk. Not doing any technical mitigation for now.

Does it matter if there are multiple schedulers?

Why do we need to consider multiple scheduler here? We don't recommend it at all at the moment and hence I don't think we need to care.

How races are resolved (it should not be a problem, just we should describe it)

It is described at the risk section. As we have talked, we don't regard NNN as something that the scheduler must follow, and hence NNN could be ignored, which we don't call as a problem.

Doesn't CA ignores pods with NNN in its scaling decisions?

(If it does, then we should change it not to do that)

How it could distinguish it's own nominations from preemption nominations?

Why does it have to distinguish?

What if for whatever reason (not sure if it's possible) the nomination set by CA will be also in "overallocation" mode, but already running pods are not going to be turned down?

How would CA set NNN for "overallocation" mode in your words? It should only add a new node to NNN. Or do you have any possible scenario in your mind?

Don't we have other components that may assume something about NNN semantic? I don't know much about Karpenter and many other components that might exist.

As far as my knowledge, No. And, I think this KEP should be acceptable for this point because:

in the first place, we haven't yet instructed users to use it, haven't shown how they can use and the scheduler would react.

also, this KEP doesn't add a huge change either way? rather we basically try to make sure NNN would work for external components here.

CA doesn't ignore pods with NNN, it pretends they are already scheduled on that node when doing simulations. This can lead to "invalid" in memory state where a node is over-allocated, but it allows subsequent logic to account for that pod already being there (which can affect e.g. pod affinity calculation for other pods).

Doesn't CA ignores pods with NNN in its scaling decisions? How it could distinguish it's own nominations from preemption nominations?

CA can track the nodes it tried to create - so if it sees VM name "foo" not being provisioned - it is time to drop "foo" from NNN on all pods. The only caveat is that it requires some refactoring to make CA better track ongoing node provisioning, but it is something we were thinking about doing anyway.

What if for whatever reason (not sure if it's possible) the nomination set by CA will be also in "overallocation" mode, but already running pods are not going to be turned down?

That'd be a bug in Cluster Autoscaler, but scheduler should be able to recover easily, just by realizing the pod doesn't fit and doing a full scan of the Filter phase.

k8s-ci-robot · 2025-05-12T08:05:11Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sanposhiho
Once this PR has been reviewed and has the lgtm label, please ask for approval from soltysh. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/prod-readiness/OWNERS
~~keps/sig-scheduling/OWNERS~~ [sanposhiho]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sanposhiho · 2025-05-12T08:05:26Z

Appended another update based on the latest discussion

macsko · 2025-05-15T09:30:01Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+- We will rely on the fact that a pod with NominatedNodeName set is resulting in the in-memory reservation for requested resources. 
+Higher-priority pods can ignore it, but pods with equal or lower priority don't have access to these resources. 
+This allows us to prioritize nominated pods when nomination was done by external components. 
+We just need to ensure that in case when NominatedNodeName was assigned by an external component, this nomination will get reflected in scheduler memory.


I don't fully understand this point. We would want to run Filter plugins always with GE priority nominated pods?

Sorry, what's GE priority?

I meant greater than or equal priority. Okay, we do it now as well.

We are planning not to clear nominatedNodeName when the pod is unschedulable. What if the scheduling for such pod won't be successful for some reason for a long time, preventing lower priority pods from being temporarily scheduled on a node that is (virtually) occupied by the high priority nominated pod?

We are planning not to clear nominatedNodeName when the pod is unschedulable. What if the scheduling for such pod won't be successful for some reason for a long time, preventing lower priority pods from being temporarily scheduled on a node that is (virtually) occupied by the high priority nominated pod?

Yeah, that's exactly what I noticed at #5287 (comment). In some sense, that is an expected behaviour; as long as NNN is there, the scheduler has to honor it. So, the external component that sets a long-lived NNN should be responsible for clearing NNN when NNN is no longer valid.

macsko · 2025-05-15T09:44:28Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+-->
+### The scheduler puts `NominatedNodeName`
+
+After the pod is permitted at `WaitOnPermit`, the scheduler needs to update `NominatedNodeName` with the node that it determines the pod is going to.


What if WaitOnPermit takes a long time and leads to similar CA problems as long PreBind? Of course, we don't have any in-tree plugin that uses the Permit phase, but shouldn't we consider it?

WaitOnPermit takes a long time

Hmm, you're right (I was too focusing on PreBind problem). Should we set NNN before WaitOnPermit then?

Looks like we should do this before WaitOnPermit. So, I think we could set NNN iff any Permit plugin returned Wait or any PreBindPreFlight returned non-Skip.

x13n · 2025-05-16T13:53:37Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+We can expose those internal reservations with `NominatedNodeName` so that external components can take a more appropriate action
+based on the expected pod placement.
+
+### External components want to specify a preferred pod placement


CA doesn't ignore pods with NNN, it pretends they are already scheduled on that node when doing simulations. This can lead to "invalid" in memory state where a node is over-allocated, but it allows subsequent logic to account for that pod already being there (which can affect e.g. pod affinity calculation for other pods).

Doesn't CA ignores pods with NNN in its scaling decisions? How it could distinguish it's own nominations from preemption nominations?

CA can track the nodes it tried to create - so if it sees VM name "foo" not being provisioned - it is time to drop "foo" from NNN on all pods. The only caveat is that it requires some refactoring to make CA better track ongoing node provisioning, but it is something we were thinking about doing anyway.

What if for whatever reason (not sure if it's possible) the nomination set by CA will be also in "overallocation" mode, but already running pods are not going to be turned down?

That'd be a bug in Cluster Autoscaler, but scheduler should be able to recover easily, just by realizing the pod doesn't fit and doing a full scan of the Filter phase.

x13n · 2025-05-16T14:02:03Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+
+So, they know where those pods are likely going to be scheduled.
+
+By specifing their expectation on `NominatedNodeName`, the scheduler can first check whether the pod can go to the nominated node,


A potential improvement (though a much bigger change than this KEP on its own) would be to externalize upcoming nodes. Cluster Autoscaler currently only holds them in memory, while Karpenter has NodeClaims that are its own CRD. Making scheduler aware of upcoming nodes not only would let it make better decisions when handling pods with NNN set, it would also allow it to "schedule" pods on upcoming nodes when there is no capacity available, but we expect it to be provisioned. It could potentially even bring NNN field ownership back to scheduler, if upcoming nodes had a list of pods that triggered its creation.

x13n · 2025-05-16T14:07:34Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+Usually, the scheduler scans all the nodes in the cluster when scheduling pods.
+
+When the cluster autoscaler creates instances for pending pods, it calculate which new node might get which pending pod.
+If they can put `NominatedNodeName` based on those calculation, it could tell the scheduler that the node can probably picked up for the pod's scheduling,


But scheduler can still process another pod earlier and land it on the new node, correct? NNN doesn't eliminate this race condition unless scheduler somehow reserves the capacity for pods with NNN set.

unless scheduler somehow reserves the capacity for pods with NNN set.

It actually does. When NNN is set to pending pod, the scheduler reserves the place for this pod on the node, and this reservation can only be ignored by higher priority pods.

x13n · 2025-05-16T14:09:54Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+
+#### Increasing the load to kube-apiserver
+
+If we simply implement this, we'd double the API calls during a simple binding cycle (NNN + actual binding),


Should we be concerned about additional load on apiserver coming from Cluster Autoscaler? Most of the time CA will generate much lower load than scheduler, but in case of failing a large cluster scale up there will be a spike of NNN setting, followed by NNN clearing.

x13n · 2025-05-16T14:11:59Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+But, if an external component updates `NominatedNodeName` that is set by the scheduler, 
+the pod could end up having different `NominatedNodeName` and `NodeName`.
+
+Probably we should clear `NominatedNodeName` when the pod is bound. (at binding api)


Couldn't apiserver reject setting NominatedNodeName on bound pods?

x13n · 2025-05-16T14:13:07Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+and the node doesn't exist.
+
+In order for the cluster autoscaler to levarage this feature,
+it has to put unexisting node's name, which is supposed to be registered later after its scale up,


nit: unexisting -> upcoming?

x13n · 2025-05-16T14:16:11Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+So, we need to keep the node's name on `NominatedNodeName` even when the node doesn't exist.
+We'll discuss it at [Only modifying `NominatedNodeName`](#only-modifying-nominatednodename) section.
+
+#### [CA scenario] A new node's taint prevents the pod from going there, and the scheduler ends up clearing `NominatedNodeName`


Maybe worth noting this is pretty common in GPU provisioning scenarios where GPU driver removes a taint after finishing device initialization.

x13n · 2025-05-16T14:22:09Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+- if NominatedNodeName is set, but corresponding Node doesn't exist, kube-scheduler will NOT clear it when the pod is unschedulable [assuming that a node might appear soon]
+- We will rely on the fact that a pod with NominatedNodeName set is resulting in the in-memory reservation for requested resources. 
+Higher-priority pods can ignore it, but pods with equal or lower priority don't have access to these resources. 
+This allows us to prioritize nominated pods when nomination was done by external components. 


Does that mean scheduler will process pods with NNN set before pods without NNN set as long as they have the same priority?

This "prioritize" here is not about the scheduling order. It just says, if pod-A has NNN to node-A, then other pods, except higher priority ones, can not steal the space of pod-A on node-A.
I'll change the wording to make it clearer.

x13n · 2025-05-16T14:27:09Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+
+We will implement integration tests simulating the above behavior of external components.
+
+#### The scheduler only modifies `NominatedNodeName`, not clears it in any cases


Should external components follow the same principle? When Cluster Autoscaler realizes the node didn't get provisioned, should it clear NNN or keep it as is, potentially overwriting when another scale up is triggered? Clearing may lead to race condition if scheduler wanted to trigger preemption, while leaving as is will mislead scheduler into thinking the node can still get provisioned. The latter will waste some CPU cycles, but perhaps is safer.

Should external components follow the same principle?

or keep it as is, potentially overwriting when another scale up is triggered?

This could be a discussion point, but, for now, I'd say Yes: the cluster autoscaler can just leave NNN even if it failed to create a new node. (and overwrite with a new NNN potentially)

leaving as is will mislead scheduler into thinking the node can still get provisioned

NNN to non-existing node would be just ignored. So, I don't see any problem. Rather clearing NNN might just cause unnecessary overload to kube-apiserver.
Even if NNN is not valid anymore, as you said, it just would waste a tiny time to try to get a non-existing node from the scheduler cache at every scheduling cycle.

Hmm, thinking more about this clearning stuff.. CA is ok because non-valid NNN is for non-existing node, which the scheduler can just ignore.
However, let's say an external component sets NNN to the existing node, but later, it notices this NNN is not valid anymore, and also it cannot find a new node to overwrite NNN.
In this case, this external component would need to clear NNN...? otherwise the scheduler would keep reserving the pod's place on invalid NNN Node.
But, if we allow them to clear NNN, you mentioned the race condition with the preemption, which is a good point. We need to instruct users to use PUT or SSA PATCH.I'm not an expert though, PUT or SSA PATCH causes the conflict errors when such race condition happens, and other JSON Patch etc wouldn't cause Conflict errors.

sanposhiho · 2025-05-17T09:28:14Z

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md

+If they can put `NominatedNodeName` based on those calculation, it could tell the scheduler that the node can probably picked up for the pod's scheduling,
+prevenging the double effort of scanning/calculating all nodes again at the scheduling retries.
+
+#### Story 3: Kueue specifies `NominatedNodeName` to indicate where it prefers pods being scheduled to


@mimowo @tenzen-y PTAL at this story and let me know if I'm mistaken or any modification you'd prefer to make

sanposhiho added 2 commits May 7, 2025 12:50

kep-5278: nominated node name for an expected pod placement

9400883

add prr

d69af1f

sanposhiho marked this pull request as draft May 7, 2025 10:52

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels May 7, 2025

k8s-ci-robot requested review from kikisdeliveryservice and macsko May 7, 2025 10:52

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label May 7, 2025

github-project-automation bot added this to SIG Scheduling May 7, 2025

github-project-automation bot moved this to Needs Triage in SIG Scheduling May 7, 2025

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 7, 2025

sanposhiho commented May 7, 2025

View reviewed changes

update toc

f78c781

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 7, 2025

add motivation

3dc7cb9

k8s-ci-robot added the sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. label May 7, 2025

sanposhiho commented May 7, 2025

View reviewed changes

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md Show resolved Hide resolved

tenzen-y reviewed May 7, 2025

View reviewed changes

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md Show resolved Hide resolved

sanposhiho mentioned this pull request May 7, 2025

Use NominatedNodeName to express the expected pod placement #5278

Open

4 tasks

dom4ha reviewed May 7, 2025

View reviewed changes

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md Show resolved Hide resolved

liggitt added this to @liggitt May 7, 2025

wojtek-t reviewed May 8, 2025

View reviewed changes

sanposhiho added 2 commits May 8, 2025 16:59

update based on the discussion

9da94cc

update toc

cc1a919

sanposhiho commented May 8, 2025

View reviewed changes

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md Outdated Show resolved Hide resolved

k8s-ci-robot assigned soltysh May 8, 2025

sanposhiho commented May 9, 2025

View reviewed changes

keps/sig-scheduling/5278-nominated-node-name-for-expectation/README.md Outdated Show resolved Hide resolved

dom4ha reviewed May 9, 2025

View reviewed changes

fix based on the discussion

76c551d

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 12, 2025

macsko reviewed May 15, 2025

View reviewed changes

x13n reviewed May 16, 2025

View reviewed changes

sanposhiho commented May 17, 2025

View reviewed changes


		So, they know where those pods are likely going to be scheduled.

		By specifing their expectation on `NominatedNodeName`, the scheduler can first check whether the pod can go to the nominated node,


		#### Increasing the load to kube-apiserver

		If we simply implement this, we'd double the API calls during a simple binding cycle (NNN + actual binding),


		We will implement integration tests simulating the above behavior of external components.

		#### The scheduler only modifies `NominatedNodeName`, not clears it in any cases

kep-5278: nominated node name for an expected pod placement #5287

Are you sure you want to change the base?

kep-5278: nominated node name for an expected pod placement #5287

Conversation

sanposhiho commented May 7, 2025

sanposhiho left a comment

Choose a reason for hiding this comment

sanposhiho May 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho commented May 7, 2025

sanposhiho commented May 7, 2025

sanposhiho commented May 7, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho May 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho May 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho May 8, 2025 • edited Loading

Choose a reason for hiding this comment

dom4ha May 9, 2025 • edited Loading

Choose a reason for hiding this comment

sanposhiho May 10, 2025 • edited Loading

Choose a reason for hiding this comment

sanposhiho commented May 8, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dom4ha May 9, 2025 • edited Loading

Choose a reason for hiding this comment

sanposhiho May 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

k8s-ci-robot commented May 12, 2025

sanposhiho commented May 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho May 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho May 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho May 17, 2025 • edited Loading

Choose a reason for hiding this comment

sanposhiho May 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanposhiho May 7, 2025 •

edited

Loading

sanposhiho May 9, 2025 •

edited

Loading

sanposhiho May 17, 2025 •

edited

Loading

sanposhiho May 8, 2025 •

edited

Loading

dom4ha May 9, 2025 •

edited

Loading

sanposhiho May 10, 2025 •

edited

Loading

dom4ha May 9, 2025 •

edited

Loading

sanposhiho May 10, 2025 •

edited

Loading

sanposhiho commented May 12, 2025 •

edited

Loading

sanposhiho May 17, 2025 •

edited

Loading

sanposhiho May 17, 2025 •

edited

Loading

sanposhiho May 17, 2025 •

edited

Loading

sanposhiho May 17, 2025 •

edited

Loading