Skip to content

Conversation

dejanzele
Copy link
Contributor

  • One-line PR description: Promote KEP-4368: Job Managed By to GA
  • Other comments: NONE

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 29, 2025
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Sep 29, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Apps Sep 29, 2025
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 29, 2025
@wojtek-t wojtek-t self-assigned this Sep 29, 2025
@mimowo
Copy link
Contributor

mimowo commented Sep 30, 2025

I would suggest to update the PR description, or the PR content to clearly address what is the status of the graduation criteria. Here is my review, feel free to use it as a basis.

So far, with the adoption in MultiKueue we have not got users requesting that extra debuggability. With the managedBy field being immutable it seems the feature is not causing issues requiring extensive debug. I would suggest to keep this ideas as follow up features and not blockers for the graduation.

  • Re-evaluate the need to skip reconciliation in the event handlers to optimize performance

I would say it could be a follow up performance improvement, but does not seem required. I have no record of MultiKueue users reporting hitting performance issues on the management cluster due to Job controller during the SyncJob.

  • Assess the fragmentation of the ecosystem. Look for other implementations of a job controller and asses their conformance with k8s.

We have already a couple of implementations of the ManagedBy field in order to integrate with MultiKueue:

These implementations are following the k8s core design, in particular making the field immutable.

There two slight differences:

Will be done in the follow up PR.

PTAL @soltysh @atiratree @tenzen-y @mwielgus @kannon92

@wojtek-t
Copy link
Member

wojtek-t commented Oct 3, 2025

/approve PRR

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
LGTM Job PoV.

/lgtm
/hold
for other reviewers.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 3, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 3, 2025
@tenzen-y
Copy link
Member

tenzen-y commented Oct 3, 2025

I would suggest to update the PR description, or the PR content to clearly address what is the status of the graduation criteria. Here is my review, feel free to use it as a basis.

So far, with the adoption in MultiKueue we have not got users requesting that extra debuggability. With the managedBy field being immutable it seems the feature is not causing issues requiring extensive debug. I would suggest to keep this ideas as follow up features and not blockers for the graduation.

  • Re-evaluate the need to skip reconciliation in the event handlers to optimize performance

I would say it could be a follow up performance improvement, but does not seem required. I have no record of MultiKueue users reporting hitting performance issues on the management cluster due to Job controller during the SyncJob.

  • Assess the fragmentation of the ecosystem. Look for other implementations of a job controller and asses their conformance with k8s.

We have already a couple of implementations of the ManagedBy field in order to integrate with MultiKueue:

These implementations are following the k8s core design, in particular making the field immutable.

There two slight differences:

Will be done in the follow up PR.

PTAL @soltysh @atiratree @tenzen-y @mwielgus @kannon92

I totally agree wtih @mimowo .

Copy link
Member

@atiratree atiratree left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, with the adoption in MultiKueue we have not got users requesting that extra debuggability. With the managedBy field being immutable it seems the feature is not causing issues requiring extensive debug. I would suggest to keep this ideas as follow up features and not blockers for the graduation.

+1, the metric and immutable field should be sufficient as a source of truth. Having other options seems like an extra work without an apparent benefit.

We have already a couple of implementations of the ManagedBy field in order to integrate with MultiKueue:

Nice to see the usage, thanks for compiling the list. Can we mention these projects in the KEP?

It is good to see that the length constraint (The value cannot exceed 64 characters) is sufficient. But this validation is missing in some of the controllers. So I suppose it will throw an error if it is longer than that?

asses their conformance with k8s.

How do we want to approach this? Should we have a set of tests that these implementations have to pass?

stable: "v1.35"

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
Copy link
Member

@atiratree atiratree Oct 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update the metric name to jobs_by_external_controller_total to reflect the reality? The README also needs an update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment still holds.

###### What steps should be taken if SLOs are not being met to determine the problem?

N/A.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have some new findings for the Troubleshooting section?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of that. We have worked with some users setting up MultiKueue, but it doesn't seem like the spec.managedBy would be bringing any particular reason to troublehooting (so far at least, and up to my knowledge).


#### GA

- Address reviews and bug reports from Beta users
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please update the e2e tests section and link the tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one holds, even if we added only that single e2e 😉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one holds, I'm still missing the link to the e2e.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@mimowo
Copy link
Contributor

mimowo commented Oct 6, 2025

It is good to see that the length constraint (The value cannot exceed 64 characters) is sufficient. But this validation is missing in some of the controllers. So I suppose it will throw an error if it is longer than that?

Oh, good spot. Well, if the other controllers say, KubeRay allow for more than 64 characters, then nothing will break for them. I think the idea of restricting in the core k8s was to make it more constrained.

How do we want to approach this? Should we have a set of tests that these implementations have to pass?

One idea would be to just review a couple of aspects of the implementations, say:

  • validation for spec.managedBy to be at most 63 chars of len
  • validation for spec.managedBy to have a specific format
  • validation for the status
  • immutability
  • skip inside event handlers, vs. controller Reconcile

Then we could compile these aspects in a table and get some bigger picture. If we think some of the aspects are important to follow k8s core, then we could open an issue in the projects to "reconsider" this aspect. We may try to achieve this way better coherence in the ecosystem, but the maintainers of the projects may not prioritize full alignment.

@dejanzele dejanzele mentioned this pull request Oct 6, 2025
16 tasks
@mimowo
Copy link
Contributor

mimowo commented Oct 6, 2025

Then we could compile these aspects in a table and get some bigger picture.

This is the table I was able to compile by checking the implementations in the projects:

k8s Job JobSet Kubeflow Trainer v1 Kubeflow Trainer v2 KubeRay Tekton Pipelines AppWrapper
open set of allowed values y y n n n y y
max length 63 y y y (by closed set of options) y (by closed set of options) y (indirectly by closed set) n n
specific format y y y (by closed set of options) y (by closed set of options) y (by closed set of options) n n
immutability y y y y y y y
validation for the status y n n n n n n
skip inside Reconcile y y y y y y (also in filtering) y

So, there is a bit inconsistency about the "open set of values" vs "closed set of values". I know that in the Kubflow Trainer then motivation was to avoid the extra work on status validation, and closing the set of "recognized" orchestrators like Kueue only (potentially more in the future). It seems like all projects consistently wanted to avoid the effort of validating the status which we took in the k8s Job.

I think the main part if that all of them follow the key aspects: immutability and skip inside Reconcile.

@mimowo
Copy link
Contributor

mimowo commented Oct 6, 2025

Actually on the "status" validation note, I'm wondering if we should plan simplifying this code by answering "true" for all. Maybe except "RejectFailedIndexesOverlappingCompleted" which can be considered a perf optimization (debatable).

The main reason we introduced the checks was to gracefully handle Jobs which could have been updated by the Job controller in an older version of k8s before the validation was introduced.

However:

  1. all Jobs created in 1.32+ would't require the graceful handling as the rules would prevent entering invalid state
  2. older Jobs are very unlikely that would violate the rules, because we haven't got any reports the rules are violated after enabling the validation by default (except for Fix validation for Job with suspend=true,completions=0 to set Complete condition kubernetes#132614, but here the Jobs wouldn't compute anything useful anyway).

So, I'm wondering it would be beneficial to cleanup the code and perform the checks always, the question is "when", some ideas that come to my mind:

  1. just now, we have waited long enough to be safe
  2. later under a dedicated feature gate, part of the KEP
  3. later without a feature gate, just send a PR in 1.36 or so
  4. later under a dedicated feature gate in the future

I'm personally hesitant between (1.) or (3.). (2.) or (4.) seem more involving, but I'm also ok for this way.

wdyt @atiratree @soltysh @deads2k ?

@deads2k
Copy link
Contributor

deads2k commented Oct 6, 2025

So, I'm wondering it would be beneficial to cleanup the code and perform the checks always, the question is "when", some ideas that come to my mind:

1. just now, we have waited long enough to be safe

2. later under a dedicated feature gate, part of the KEP

3. later without a feature gate, just send a PR in 1.36 or so

4. later under a dedicated feature gate in the future

I'm personally hesitant between (1.) or (3.). (2.) or (4.) seem more involving, but I'm also ok for this way.

  1. and 3. are definitely out. Is that table saying we have no known writers of violating data? If we have never had known writers of violating data, then I'm comfortable with 4. If we have ever had a known writer of violating data, then we cannot be guaranteed that all of the invalid data is cleaned up.

If we we wait long enough, declarative validation ratcheting will likely clean this up, but in the meantime we cannot risk making /status unwriteable.

@mimowo
Copy link
Contributor

mimowo commented Oct 6, 2025

Is that table saying we have no known writers of violating data?

Except for the corner case issue:

  • in 1.32+ there is no writer, as the validation is "on" (assuming Beta feature is enabled)
  • prior to 1.32 it could only be Job controller, but I cannot recollect and specific issue violating the rules

If we have never had known writers of violating data, then I'm comfortable with 4.

Sure, I'm good with 4. I think turning the Reject... rules into "true" will not simplify the code dramatically anyway.

If we we wait long enough, declarative validation ratcheting will likely clean this up, but in the meantime we cannot risk making /status unwriteable.

Makes sense. I think we can make the "JobManagedBy" feature gate "GA" in 1.35 and wait a couple of releases to assume all "Jobs" are good. Is there any k8s guidence for how to wait in similar cases?

I think "jobs" are a bit safer than "Deployments" which are meant to run "Indefinitely". Jobs are meant to run only for a finite amount of time, so having a Job surviving a year would be very unusual ( I assume, but I know there might be exceptions ).

@deads2k
Copy link
Contributor

deads2k commented Oct 6, 2025

I'm open to trying with a dedicated featuregate in the future. We'd likely leave the gate for four releases before locking it.

@mimowo
Copy link
Contributor

mimowo commented Oct 7, 2025

sgtm, we may just add a short note to the KEP to re-consider dropping the code gracefully handling Jobs with invalid statuses after 1.39 behind a dedicated feature gate. I think the "Graduation Criteria" sub-section "Followup work" might be ok.

@soltysh
Copy link
Contributor

soltysh commented Oct 7, 2025

I agree with David here, 1.32 with beta on doesn't guarantee that the feature was on in a cluster, b/c users might intentionally disable all beta features. So later with dedicated feature gate seems the most acceptable approach.

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left several comments, which @mimowo actually provided answers here already, so we need to make sure they are present in the doc.

stable: "v1.35"

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment still holds.


#### GA

- Address reviews and bug reports from Beta users
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one holds, even if we added only that single e2e 😉

@dejanzele dejanzele force-pushed the kep-4368/promote-to-stable branch from bf78ac9 to cfd4daf Compare October 8, 2025 05:53
@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 8, 2025
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 8, 2025
@dejanzele dejanzele force-pushed the kep-4368/promote-to-stable branch from cfd4daf to 8223c4d Compare October 8, 2025 13:44
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 8, 2025
@dejanzele
Copy link
Contributor Author

@mimowo thank you for the detailed review! I believe I addresses the points from #5583 (comment)

@dejanzele dejanzele force-pushed the kep-4368/promote-to-stable branch from 8223c4d to 3e72790 Compare October 8, 2025 14:29
- "@mimowo"
owning-sig: sig-apps
status: implementable
status: implemented
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should only be flipped after the cycle is complete.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this is the last step after we complete the KEP, usually done post release.

- All projects consistently wanted to avoid the effort of validating the status which we took in the k8s Job

**Validation inconsistencies:**
- Some controllers (Tekton, AppWrapper) don't enforce the 63-character length limit. If they allow more than 64 characters, then nothing will break for them. The idea of restricting in the core k8s was to make it more constrained
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Some controllers (Tekton, AppWrapper) don't enforce the 63-character length limit. If they allow more than 64 characters, then nothing will break for them. The idea of restricting in the core k8s was to make it more constrained
- Some controllers (Tekton, AppWrapper) don't enforce the 63-character length limit. If they allow more than 63 characters, then nothing will break for them. The idea of restricting in the core k8s was to make it more constrained

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This holds.

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I just left small comments

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last few elements to address:

  1. Release checklist (minor), but let's get it updated
  2. Missing e2e link (blocking), template for what the links should look like.
  3. A few typos (minor).
  4. implemented -> implementable (blocking).


#### GA

- Address reviews and bug reports from Beta users
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one holds, I'm still missing the link to the e2e.

- All projects consistently wanted to avoid the effort of validating the status which we took in the k8s Job

**Validation inconsistencies:**
- Some controllers (Tekton, AppWrapper) don't enforce the 63-character length limit. If they allow more than 64 characters, then nothing will break for them. The idea of restricting in the core k8s was to make it more constrained
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This holds.

- "@mimowo"
owning-sig: sig-apps
status: implementable
status: implemented
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this is the last step after we complete the KEP, usually done post release.

@dejanzele dejanzele force-pushed the kep-4368/promote-to-stable branch from 3e72790 to 896d5e1 Compare October 9, 2025 12:51
@dejanzele
Copy link
Contributor Author

dejanzele commented Oct 9, 2025

Reverted status; checked all boxes in the release checklist; added link to e2e test per provided template here

cc @mimowo @soltysh

Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
@dejanzele dejanzele force-pushed the kep-4368/promote-to-stable branch from 896d5e1 to aa388b8 Compare October 9, 2025 12:59
@mimowo
Copy link
Contributor

mimowo commented Oct 9, 2025

/lgtm
Thank you 👍

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 9, 2025
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dejanzele, soltysh, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 9, 2025
@soltysh
Copy link
Contributor

soltysh commented Oct 9, 2025

with Wojtek's previous approval I'm also dropping the hold

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 9, 2025
@k8s-ci-robot k8s-ci-robot merged commit b27b544 into kubernetes:master Oct 9, 2025
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.35 milestone Oct 9, 2025
@github-project-automation github-project-automation bot moved this from Needs Triage to Done in SIG Apps Oct 9, 2025
@wojtek-t
Copy link
Member

Sorry - just catching up with stuff.

I'm open to trying with a dedicated featuregate in the future. We'd likely leave the gate for four releases before locking it.

I agree with @deads2k and @soltysh here - the option (4) seems fine.

And for posterity - this still LGTM now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

8 participants