Skip to content

kep-5278: nominated node name for an expected pod placement #5287

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

sanposhiho
Copy link
Member

  • One-line PR description: add kep-5278: nominated node name for an expected pod placement
  • Other comments:

@sanposhiho sanposhiho marked this pull request as draft May 7, 2025 10:52
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels May 7, 2025
@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label May 7, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling May 7, 2025
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 7, 2025
Copy link
Member Author

@sanposhiho sanposhiho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dom4ha @wojtek-t
Opened a draft PR. I know there are other sections that we have to write.

kep-number: 5278
authors:
- "@sanposhiho"
- "@wojtek-t"
Copy link
Member Author

@sanposhiho sanposhiho May 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put you here @wojtek-t. Or maybe we can just move you to PRR reviewer if you don't have many things to edit? Either is fine to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wojtek-t Who do you think we can ask for PRR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest @soltysh probably
Given that I'm pushing some of my ideas through this KEP, it probably shouldn't be me :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign @soltysh

Please assign it to someone else, if needed! 🙇

@sanposhiho
Copy link
Member Author

/cc @macsko

@sanposhiho
Copy link
Member Author

/hold

Not to merge it without sig-scheduling approval

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 7, 2025
@sanposhiho
Copy link
Member Author

/sig autoscaling

@k8s-ci-robot k8s-ci-robot added the sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. label May 7, 2025
@liggitt liggitt added this to @liggitt May 7, 2025

So, they know where those pods are likely going to be scheduled.

By specifing their expectation on `NominatedNodeName`, the scheduler can first check whether the pod can go to the nominated node,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one more thing that I would like to be able to handle, which is related to CA (and Karpenter).

As CA, when I'm creating some nodes I already computed some placement for that and would like to be able to let kube-scheduler to reuse that (if it places pods differently they may no longer fit). But the main point is that the nodes don't yet exist.
I chatted with @x13n and even though it's not the case now, it's possible to change CA so that it knows the name of the node it will create. In which case they would be able to set the NominatedNodeName appropriately.

But the problem is - this node doesn't exist, so scheduler will look at it and effectively ignore it (because it doesn't exists).
So we want to ensure, that if the node set via NominatedNodeName doesn't exist, we will not clear that.

I can think of few different options:

  • we try to schedule, but if the pod is unschedulable and node with NNN name doesn't exist, we leave the NNN as is
  • we don't even try to schedule if NNN doesn't exist [that is risky though, because if the node won't appear, we will never schedule]
  • some hybrid where CA additionally sets some annotation on the pod like "nnn-is-coming: true" :)

But we also need a mechanism, that when a new node comes, we first see if there are any pods with NNN set to that node and try to schedule them first (probably taking into account if there aren't higher-priority ones).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the problem is - this node doesn't exist, so scheduler will look at it and effectively ignore it (because it doesn't exists).
So we want to ensure, that if the node set via NominatedNodeName doesn't exist, we will not clear that.

Good point.

we try to schedule, but if the pod is unschedulable and node with NNN name doesn't exist, we leave the NNN as is

I like this over others.
One additional point is what if the pod triggers the preemption. If the pod can go somewhere after the preemption, we should prioritize that, rather than waiting for a new node to be registered.
So, the condition (for not clearing NNN) should be:

  • node with NNN doesn't exist
  • the scheduler isn't trying to put a new NNN (i.e., the preemption hasn't triggered at the scheduling cycle)

we don't even try to schedule if NNN doesn't exist [that is risky though, because if the node won't appear, we will never schedule]

Yeah, risky. If external components (or even the scheduler after the preemption) put the existing node name, but the node is deleted right after that, the pod could get stuck.
Also, if many existing running pods are deleted after CA triggers the node creation, we might be able to schedule pending pods over there without waiting for new nodes to come up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the condition (for not clearing NNN) should be:

  • node with NNN doesn't exist
  • the scheduler isn't trying to put a new NNN (i.e., the preemption hasn't triggered at the scheduling cycle)

Fair - if scheduler in the meantime found a different place for that pod, we should take that into account.

The main question that I have based on that is:

  • do we really want to clear NNN first and later set it to the new value
  • or do we actually want to allow for "changing NNN in place"
    The later would allow us to avoid additional API call so I think we should consider it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I mentioned to have a validation webhook to prevent "changing NNN" somewhere on the KEP though, probably we cannot. Performance perspective of course, and also an external component might set NNN between the moments between the scheduler clears it and sets a new value to it. (what should the scheduler do? clear it again, or give up setting a new value? complexity)

So, it's better to allow changing NNN from non-empty, with just one API call.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if the implementation were changed so that the CA knows the name of the node it plans to provision in advance, isn’t it still possible that the node might fail to be provisioned properly? For example, if compute resources or GPUs in a specific cloud zone are exhausted, the underlying instance might not be provisioned at all. In that case, wouldn't there be a need to change the NNN in place? Or does the CA (or Karpenter) only assign the NNN when it's certain that the node will eventually join the cluster?

Copy link
Member Author

@sanposhiho sanposhiho May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CA can change NNN if they detect the instance failed to start for some reasons and retry to create another, or the scheduler might change NNN if it runs the preemption for the pod. Or, even other external components can overwrite NNN somehow.
Otherwise, we don't have to change NNN (NNN for non-existing node should be just no-op).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As CA, when I'm creating some nodes I already computed some placement for that and would like to be able to let kube-scheduler to reuse that (if it places pods differently they may no longer fit)

Currently we cannot prevent kube-scheduler from placing pods differently, as the semantic of nomination is best effort in this regard. In particular, whenever scheduler changes CA decision, there is a risk that a group of pods would no longer be schedulable. I think we should clearly call it a non-goal and this type of scheduling would be hopefully covered by reservations.

we try to schedule, but if the pod is unschedulable and node with NNN name doesn't exist, we leave the NNN as is

It sounds doable. There is a risk that NNN would point to no longer existing node, but scheduler would either find a new place for it or it would remain unschedulable and CA should find a new home for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to both of your comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potential improvement (though a much bigger change than this KEP on its own) would be to externalize upcoming nodes. Cluster Autoscaler currently only holds them in memory, while Karpenter has NodeClaims that are its own CRD. Making scheduler aware of upcoming nodes not only would let it make better decisions when handling pods with NNN set, it would also allow it to "schedule" pods on upcoming nodes when there is no capacity available, but we expect it to be provisioned. It could potentially even bring NNN field ownership back to scheduler, if upcoming nodes had a list of pods that triggered its creation.

Copy link
Member Author

@sanposhiho sanposhiho May 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love that idea.

But, if an external component updates `NominatedNodeName` that is set by the scheduler,
the pod could end up having different `NominatedNodeName` and `NodeName`.

Probably we should clear `NominatedNodeName` when the pod is bound. (at binding api)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to it

But technically, nothing prevents you from setting NominatedNodeName after the NodeName is set, so there can be divergence anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't apiserver reject setting NominatedNodeName on bound pods?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't apiserver reject setting NominatedNodeName on bound pods?

We can do that too. But, the scenario here is:

  1. The binding cycle sets NNN.
  2. Someone sets NNN. (while the binding cycle is still running)
  3. The binding cycle calls the binding api.
  4. NNN and NodeName would be different.

So, we need to clear NNN at the binding API.

-->
### The scheduler puts `NominatedNodeName`

After the pod is permitted at `WaitOnPermit`, the scheduler needs to update `NominatedNodeName` with the node that it determines the pod is going to.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just went through: kubernetes/kubernetes#125491 (comment)

Is there anything we need to do to accomodate with @dom4ha described there about double-accounting?

Copy link
Member Author

@sanposhiho sanposhiho May 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new flow would be like this:

  1. The scheduler decides where the pod goes to at the scheduling cycle.
  2. At the end of the scheduling cycle, it "assumes" the pod on the node (= reserves the pod's place on the scheduler's cache)
  3. At the beginning of the binding cycle, it adds NNN to the Pod (added by this KEP)
  4. The event handler receives the pod's update, which is getting NNN though, given the pod is assumed, the update is just ignored.

https://github.com/kubernetes/kubernetes/blob/ff74d46bf074476d798653584657ef974306053f/pkg/scheduler/eventhandlers.go#L155-L161

So, I don't see any problem here. Adding NNN after "assume" shouldn't cause anything.
Though maybe I don't 100% understand what @dom4ha was concerned.

Copy link
Member

@dom4ha dom4ha May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point was that nominated node and assumed node are two different things for scheduler. We are now going to use one of these concepts to both cases, which was my concern.

Because of this concern we started discussions about reservations to better understand the problem space before we start changing nomination semantic. So on one hand, we'd like to move things forward and fix a few problems, but on the other hand, we may still want to continue investigating how other similar cases relate to the nomination.

I wonder whether we're in the position we understand them and we should jump into implementation, or use this proposal to explore various options. I'd prefer to avoid a situation when we use NNN just because it's easier to reuse something that already exists.

What do you think about listing different use cases first? The obvious ones that I see are:

  1. Scheduler decides to preempt some pods and wants to inform other components that there is a pod which is about to take the freed up space (the final assignment decision can be change at any time)
  2. Scheduler decided to bind a pod, but performs pre-binding steps first, so wants to inform other components that the binding process started (the final assignment is already decided, only consecutive preemption may change it)
  3. CA case
  4. Scheduler itself tries to find a place for group of pods in gang-scheduling case
  5. Kueue case (external scheduler does gang-scheduling)
    etc.

Looking into above list, CA case sounds the most similar to the preemption case (a nomination is a suggestion that can change without any harm). Delayed binding is a bit different one though.

Copy link
Member Author

@sanposhiho sanposhiho May 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your 5 use cases are also what I could come up with now, but all of them don't seem problematic to me to just reuse NNN. (do you?)

I'd prefer to avoid a situation when we use NNN just because it's easier to reuse something that already exists.

Unless we come up with problems or further use cases that NNN cannot accommodate, it's natural to pick up the simplest solution, just reusing NNN. Rather, the ease/simplicity is actually a big factor.
We can expand NNN's concept and still say NNN is the node name proposed for the next step, in general. If external components set it or preemption sets it (i.e., before scheduling cycles), NNN means the proposal to the scheduling cycle. If the scheduling cycle sets it, NNN means the proposal to the binding cycle.

I don't think we need to try coming up with all possible future use cases right now (impossible), especially because NNN is an existing field. We can design completely a new thing when such use case or problem actually comes up as time goes, and switch from NNN (if needed). That's my thought process.

@sanposhiho
Copy link
Member Author

I updated the KEP once, based on the discussion that we've had so far.

We can expose those internal reservations with `NominatedNodeName` so that external components can take a more appropriate action
based on the expected pod placement.

### External components want to specify a preferred pod placement
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One think that worries me the most is letting other components change the NNN that was set by scheduler. So far the assigned scheduler was the only owner, so if we change it, we'd have to think how it impacts existing processes. I'm not saying those are real problems, but I think the KEP should elaborate on them:

  1. What if CA changes NNN that was set by scheduler as a part of the preemption or delayed binding process
  2. Is it possible that components keep changing NNN for a pod because they run different logics
  3. Does it matter if there are multiple schedulers?
  4. How races are resolved (it should not be a problem, just we should describe it)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I wanted to ask what if we say that CA can only set the NNN but can't change that. But then we actually face the problem that if the placement will no longer be valid, scheduler would have to clear it. And we have chicken and egg problem around that. I agree we need an answer here.

2, 3 and 4 - those are in my mental model the flavors of the same problem. If we can have more components trying to schedule the same pod - that's already a problem. So I would basically say that we should document that you need think through who can set NNN for a given pod in a given state and ensure that we have a single component responsible for doing that.

Copy link
Member

@dom4ha dom4ha May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@x13n
Also the current semantic is that NNN is set in an "overallocation" mode, meaning that it's expected the preempted pods will disappear to make a place for the nominated one.

  • Doesn't CA ignores pods with NNN in its scaling decisions? How it could distinguish it's own nominations from preemption nominations?
  • What if for whatever reason (not sure if it's possible) the nomination set by CA will be also in "overallocation" mode, but already running pods are not going to be turned down?
  • Don't we have other components that may assume something about NNN semantic? I don't know much about Karpenter and many other components that might exist.

Copy link
Member Author

@sanposhiho sanposhiho May 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if CA changes NNN that was set by scheduler as a part of the preemption or delayed binding process

CA should not change NNN if it's non empty. Even more, it should not process any pods that has NNN basically.
One exception is that it can remove (or change) NNN that is set by themselves if, for example, they failed to create an instance and try to create another.

Is it possible that components keep changing NNN for a pod because they run different logics

As mentioned at the risk section, we will just document such situation as a risk. Not doing any technical mitigation for now.

Does it matter if there are multiple schedulers?

Why do we need to consider multiple scheduler here? We don't recommend it at all at the moment and hence I don't think we need to care.

How races are resolved (it should not be a problem, just we should describe it)

It is described at the risk section. As we have talked, we don't regard NNN as something that the scheduler must follow, and hence NNN could be ignored, which we don't call as a problem.

Doesn't CA ignores pods with NNN in its scaling decisions?

(If it does, then we should change it not to do that)

How it could distinguish it's own nominations from preemption nominations?

Why does it have to distinguish?

What if for whatever reason (not sure if it's possible) the nomination set by CA will be also in "overallocation" mode, but already running pods are not going to be turned down?

How would CA set NNN for "overallocation" mode in your words? It should only add a new node to NNN. Or do you have any possible scenario in your mind?

Don't we have other components that may assume something about NNN semantic? I don't know much about Karpenter and many other components that might exist.

As far as my knowledge, No. And, I think this KEP should be acceptable for this point because:

  • in the first place, we haven't yet instructed users to use it, haven't shown how they can use and the scheduler would react.
  • also, this KEP doesn't add a huge change either way? rather we basically try to make sure NNN would work for external components here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CA doesn't ignore pods with NNN, it pretends they are already scheduled on that node when doing simulations. This can lead to "invalid" in memory state where a node is over-allocated, but it allows subsequent logic to account for that pod already being there (which can affect e.g. pod affinity calculation for other pods).

Doesn't CA ignores pods with NNN in its scaling decisions? How it could distinguish it's own nominations from preemption nominations?

CA can track the nodes it tried to create - so if it sees VM name "foo" not being provisioned - it is time to drop "foo" from NNN on all pods. The only caveat is that it requires some refactoring to make CA better track ongoing node provisioning, but it is something we were thinking about doing anyway.

What if for whatever reason (not sure if it's possible) the nomination set by CA will be also in "overallocation" mode, but already running pods are not going to be turned down?

That'd be a bug in Cluster Autoscaler, but scheduler should be able to recover easily, just by realizing the pod doesn't fit and doing a full scan of the Filter phase.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sanposhiho
Once this PR has been reviewed and has the lgtm label, please ask for approval from soltysh. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 12, 2025
@sanposhiho
Copy link
Member Author

sanposhiho commented May 12, 2025

Appended another update based on the latest discussion

Comment on lines +400 to +403
- We will rely on the fact that a pod with NominatedNodeName set is resulting in the in-memory reservation for requested resources.
Higher-priority pods can ignore it, but pods with equal or lower priority don't have access to these resources.
This allows us to prioritize nominated pods when nomination was done by external components.
We just need to ensure that in case when NominatedNodeName was assigned by an external component, this nomination will get reflected in scheduler memory.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand this point. We would want to run Filter plugins always with GE priority nominated pods?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what's GE priority?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant greater than or equal priority. Okay, we do it now as well.

We are planning not to clear nominatedNodeName when the pod is unschedulable. What if the scheduling for such pod won't be successful for some reason for a long time, preventing lower priority pods from being temporarily scheduled on a node that is (virtually) occupied by the high priority nominated pod?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are planning not to clear nominatedNodeName when the pod is unschedulable. What if the scheduling for such pod won't be successful for some reason for a long time, preventing lower priority pods from being temporarily scheduled on a node that is (virtually) occupied by the high priority nominated pod?

Yeah, that's exactly what I noticed at #5287 (comment). In some sense, that is an expected behaviour; as long as NNN is there, the scheduler has to honor it. So, the external component that sets a long-lived NNN should be responsible for clearing NNN when NNN is no longer valid.

-->
### The scheduler puts `NominatedNodeName`

After the pod is permitted at `WaitOnPermit`, the scheduler needs to update `NominatedNodeName` with the node that it determines the pod is going to.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if WaitOnPermit takes a long time and leads to similar CA problems as long PreBind? Of course, we don't have any in-tree plugin that uses the Permit phase, but shouldn't we consider it?

Copy link
Member Author

@sanposhiho sanposhiho May 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WaitOnPermit takes a long time

Hmm, you're right (I was too focusing on PreBind problem). Should we set NNN before WaitOnPermit then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we should do this before WaitOnPermit. So, I think we could set NNN iff any Permit plugin returned Wait or any PreBindPreFlight returned non-Skip.

We can expose those internal reservations with `NominatedNodeName` so that external components can take a more appropriate action
based on the expected pod placement.

### External components want to specify a preferred pod placement
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CA doesn't ignore pods with NNN, it pretends they are already scheduled on that node when doing simulations. This can lead to "invalid" in memory state where a node is over-allocated, but it allows subsequent logic to account for that pod already being there (which can affect e.g. pod affinity calculation for other pods).

Doesn't CA ignores pods with NNN in its scaling decisions? How it could distinguish it's own nominations from preemption nominations?

CA can track the nodes it tried to create - so if it sees VM name "foo" not being provisioned - it is time to drop "foo" from NNN on all pods. The only caveat is that it requires some refactoring to make CA better track ongoing node provisioning, but it is something we were thinking about doing anyway.

What if for whatever reason (not sure if it's possible) the nomination set by CA will be also in "overallocation" mode, but already running pods are not going to be turned down?

That'd be a bug in Cluster Autoscaler, but scheduler should be able to recover easily, just by realizing the pod doesn't fit and doing a full scan of the Filter phase.


So, they know where those pods are likely going to be scheduled.

By specifing their expectation on `NominatedNodeName`, the scheduler can first check whether the pod can go to the nominated node,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potential improvement (though a much bigger change than this KEP on its own) would be to externalize upcoming nodes. Cluster Autoscaler currently only holds them in memory, while Karpenter has NodeClaims that are its own CRD. Making scheduler aware of upcoming nodes not only would let it make better decisions when handling pods with NNN set, it would also allow it to "schedule" pods on upcoming nodes when there is no capacity available, but we expect it to be provisioned. It could potentially even bring NNN field ownership back to scheduler, if upcoming nodes had a list of pods that triggered its creation.

Usually, the scheduler scans all the nodes in the cluster when scheduling pods.

When the cluster autoscaler creates instances for pending pods, it calculate which new node might get which pending pod.
If they can put `NominatedNodeName` based on those calculation, it could tell the scheduler that the node can probably picked up for the pod's scheduling,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But scheduler can still process another pod earlier and land it on the new node, correct? NNN doesn't eliminate this race condition unless scheduler somehow reserves the capacity for pods with NNN set.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless scheduler somehow reserves the capacity for pods with NNN set.

It actually does. When NNN is set to pending pod, the scheduler reserves the place for this pod on the node, and this reservation can only be ignored by higher priority pods.


#### Increasing the load to kube-apiserver

If we simply implement this, we'd double the API calls during a simple binding cycle (NNN + actual binding),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be concerned about additional load on apiserver coming from Cluster Autoscaler? Most of the time CA will generate much lower load than scheduler, but in case of failing a large cluster scale up there will be a spike of NNN setting, followed by NNN clearing.

But, if an external component updates `NominatedNodeName` that is set by the scheduler,
the pod could end up having different `NominatedNodeName` and `NodeName`.

Probably we should clear `NominatedNodeName` when the pod is bound. (at binding api)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't apiserver reject setting NominatedNodeName on bound pods?

and the node doesn't exist.

In order for the cluster autoscaler to levarage this feature,
it has to put unexisting node's name, which is supposed to be registered later after its scale up,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unexisting -> upcoming?

So, we need to keep the node's name on `NominatedNodeName` even when the node doesn't exist.
We'll discuss it at [Only modifying `NominatedNodeName`](#only-modifying-nominatednodename) section.

#### [CA scenario] A new node's taint prevents the pod from going there, and the scheduler ends up clearing `NominatedNodeName`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth noting this is pretty common in GPU provisioning scenarios where GPU driver removes a taint after finishing device initialization.

- if NominatedNodeName is set, but corresponding Node doesn't exist, kube-scheduler will NOT clear it when the pod is unschedulable [assuming that a node might appear soon]
- We will rely on the fact that a pod with NominatedNodeName set is resulting in the in-memory reservation for requested resources.
Higher-priority pods can ignore it, but pods with equal or lower priority don't have access to these resources.
This allows us to prioritize nominated pods when nomination was done by external components.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean scheduler will process pods with NNN set before pods without NNN set as long as they have the same priority?

Copy link
Member Author

@sanposhiho sanposhiho May 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "prioritize" here is not about the scheduling order. It just says, if pod-A has NNN to node-A, then other pods, except higher priority ones, can not steal the space of pod-A on node-A.
I'll change the wording to make it clearer.


We will implement integration tests simulating the above behavior of external components.

#### The scheduler only modifies `NominatedNodeName`, not clears it in any cases
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should external components follow the same principle? When Cluster Autoscaler realizes the node didn't get provisioned, should it clear NNN or keep it as is, potentially overwriting when another scale up is triggered? Clearing may lead to race condition if scheduler wanted to trigger preemption, while leaving as is will mislead scheduler into thinking the node can still get provisioned. The latter will waste some CPU cycles, but perhaps is safer.

Copy link
Member Author

@sanposhiho sanposhiho May 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should external components follow the same principle?

or keep it as is, potentially overwriting when another scale up is triggered?

This could be a discussion point, but, for now, I'd say Yes: the cluster autoscaler can just leave NNN even if it failed to create a new node. (and overwrite with a new NNN potentially)

leaving as is will mislead scheduler into thinking the node can still get provisioned

NNN to non-existing node would be just ignored. So, I don't see any problem. Rather clearing NNN might just cause unnecessary overload to kube-apiserver.
Even if NNN is not valid anymore, as you said, it just would waste a tiny time to try to get a non-existing node from the scheduler cache at every scheduling cycle.

Copy link
Member Author

@sanposhiho sanposhiho May 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, thinking more about this clearning stuff.. CA is ok because non-valid NNN is for non-existing node, which the scheduler can just ignore.
However, let's say an external component sets NNN to the existing node, but later, it notices this NNN is not valid anymore, and also it cannot find a new node to overwrite NNN.
In this case, this external component would need to clear NNN...? otherwise the scheduler would keep reserving the pod's place on invalid NNN Node.
But, if we allow them to clear NNN, you mentioned the race condition with the preemption, which is a good point. We need to instruct users to use PUT or SSA PATCH.I'm not an expert though, PUT or SSA PATCH causes the conflict errors when such race condition happens, and other JSON Patch etc wouldn't cause Conflict errors.

If they can put `NominatedNodeName` based on those calculation, it could tell the scheduler that the node can probably picked up for the pod's scheduling,
prevenging the double effort of scanning/calculating all nodes again at the scheduling retries.

#### Story 3: Kueue specifies `NominatedNodeName` to indicate where it prefers pods being scheduled to
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mimowo @tenzen-y PTAL at this story and let me know if I'm mistaken or any modification you'd prefer to make

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Status: No status
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.