Skip to content

Commit cfd4daf

Browse files
committed
KEP-4368: Job Managed By; Promote to GA
Signed-off-by: Dejan Zele Pejchev <pejcev.dejan@gmail.com>
1 parent 253a46c commit cfd4daf

File tree

3 files changed

+25
-22
lines changed

3 files changed

+25
-22
lines changed

keps/prod-readiness/sig-apps/4368.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,5 @@ alpha:
33
approver: "@wojtek-t"
44
beta:
55
approver: "@wojtek-t"
6+
stable:
7+
approver: "@wojtek-t"

keps/sig-apps/4368-support-managed-by-for-batch-jobs/README.md

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -140,7 +140,7 @@ controller, and delegate the status synchronization to the Kueue controller.
140140
### Non-Goals
141141

142142
- passing custom parameters to the external controller
143-
- Introduce a new concurrency policy for CronJobs (eg. `ForbidActive` or `SoftForbid`)
143+
- Introduce a new concurrency policy for CronJobs (e.g. `ForbidActive` or `SoftForbid`)
144144
to replace a Job that is about to complete, but still has terminating pods.
145145

146146
## Proposal
@@ -278,7 +278,7 @@ a typo in the field value, or the cluster administrator may not install the
278278
custom controller (like MultiKueue) on that cluster.
279279

280280
In such cases the user may not observe any progress by the job for a long time
281-
and my need to debug the Job.
281+
and may need to debug the Job.
282282

283283
In order to allow for debugging of situations like this the Job controller will
284284
put a log line indicating the synchronization is delegated to another controller
@@ -352,9 +352,6 @@ type JobSpec struct {
352352
// all characters before the first "/" must be a valid subdomain as defined
353353
// by RFC 1123. All characters trailing the first "/" must be valid HTTP Path
354354
// characters as defined by RFC 3986. The value cannot exceed 64 characters.
355-
//
356-
// This field is alpha-level. The job controller accepts setting the field
357-
// when the feature gate JobManagedBy is enabled (disabled by default).
358355
// +optional
359356
ManagedBy *string
360357
}
@@ -493,9 +490,9 @@ The following scenarios are covered:
493490
- it does not add the Suspended condition for a Job with `.spec.suspend=true` ([link](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2059)).
494491
- the Job controller reconciles jobs with custom "managedBy" field when the feature gate is disabled ([link](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2030))
495492
- the Job controller handles correctly re-enablement of the feature gate [link](https://github.com/kubernetes/kubernetes/blob/169a952720ebd75fcbcb4f3f5cc64e82fdd3ec45/test/integration/job/job_test.go#L1691)
496-
- the `job_by_external_controller_total` metric is incremented when a new Job with custom "managedBy" is created ([link](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2044-L2058))
497-
- the `job_by_external_controller_total` metric is not incremented for a new Job without "managedBy" or with default value ([link](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2000-L2029))
498-
- the `job_by_external_controller_total` metric is not incremented for Job updates (regardless of the "managedBy") (tested indirectly as [here](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2000-L2029) the Job controller updates the Job status)
493+
- the `jobs_by_external_controller_total` metric is incremented when a new Job with custom "managedBy" is created ([link](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2044-L2058))
494+
- the `jobs_by_external_controller_total` metric is not incremented for a new Job without "managedBy" or with default value ([link](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2000-L2029))
495+
- the `jobs_by_external_controller_total` metric is not incremented for Job updates (regardless of the "managedBy") (tested indirectly as [here](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2000-L2029) the Job controller updates the Job status)
499496

500497
The following scenarios related to [Terminating pods and terminal Job conditions](#terminating-pods-and-terminal-job-conditions) are covered:
501498
- `Failed` or `Complete` conditions are not added while there are still terminating pods ([link](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L1183))
@@ -523,7 +520,7 @@ We propose a single e2e test for the following scenario:
523520
API fields affected by the new validation rules
524521
- make CronJob more resilient by checking the Job condition is `Complete` when using `CompletionTime` (see [here](#custom-controllers-not-compatible-with-api-assumptions-by-cronjob))
525522
- The feature flag disabled by default
526-
- implement the `job_by_external_controller_total` metric
523+
- implement the `jobs_by_external_controller_total` metric
527524

528525
Second Alpha (1.31):
529526
- preparatory fix to address all known inconsistencies between validation and the
@@ -548,7 +545,7 @@ Second Alpha (1.31):
548545
- Address reviews and bug reports from Beta users
549546
- Re-evaluate the ideas of improving debuggability (like [extended `kubectl`](#debuggability), [dedicated condition](#condition-to-indicated-job-is-skipped), or [events](#event-indicating-the-job-is-skipped))
550547
- Re-evaluate the need to skip reconciliation in the event handlers to optimize performance
551-
- Assess the fragmentation of the ecosystem. Look for other implementations of a job controller and asses their conformance with k8s.
548+
- Assess the fragmentation of the ecosystem. Look for other implementations of a job controller and assess their conformance with k8s.
552549
- Lock the feature gate
553550

554551
#### Deprecation
@@ -752,7 +749,7 @@ being flipped between two owners.
752749
The feature is opt-in so in case of such problems the custom "managedBy" field
753750
should not be used.
754751

755-
Also, an admin could check if the value of the `job_by_external_controller_total`
752+
Also, an admin could check if the value of the `jobs_by_external_controller_total`
756753
matches the expectations. For example, if the value of the metric does not increase
757754
when new jobs are being added with a custom "managedBy" field, it might be
758755
indicative that the feature is not working correctly.
@@ -831,7 +828,7 @@ checking if there are objects with field X set) may be a last resort. Avoid
831828
logs or events for this purpose.
832829
-->
833830

834-
Check the `job_by_external_controller_total` metric. If the value is non-zero
831+
Check the `jobs_by_external_controller_total` metric. If the value is non-zero
835832
for a field, it means there were Jobs using the custom controller created, so
836833
the feature is in use.
837834

@@ -885,22 +882,24 @@ Pick one more of these and delete the rest.
885882

886883
- [x] Metrics
887884
- Metric name:
888-
- `job_by_external_controller_total` (new), with the `controller_name` label
885+
- `jobs_by_external_controller_total` (new), with the `controller_name` label
889886
corresponding to the custom value of the "managedBy" field. The metric is
890887
incremented by the built-in Job controller on each ADDED Job event,
891888
corresponding to a Job with custom value of the "managedBy" field.
892889
This metric can be helpful to determine the health of a job and its controller
893890
in combination with already existing metrics (see below).
891+
- Components exposing the metric: kube-controller-manager
894892
- `apiserver_request_total[code=409, resource=job, group=batch]` (existing):
895-
substantial increase of this metric, when additionally `job_by_external_controller_total>0`
893+
substantial increase of this metric, when additionally `jobs_by_external_controller_total>0`
896894
may be indicative of two controllers stepping onto each-other causing
897895
conflicts (see [here](#two-controllers-running-at-the-same-time-on-old-version)).
896+
- Components exposing the metric: kube-apiserver
898897
- `kube_cronjob_status_active` (existing), substantial increase of this
899898
metric, may suggest that there are accumulating non-progressing jobs controlled
900-
by `CronJob`. If additionally `job_by_external_controller_total>0` it may suggest
899+
by `CronJob`. If additionally `jobs_by_external_controller_total>0` it may suggest
901900
that the Jobs are getting stuck due to not being synchronized by the custom
902901
controller.
903-
- Components exposing the metric: kube-apiserver
902+
- Components exposing the metric: kube-apiserver
904903

905904
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
906905

@@ -1080,7 +1079,7 @@ N/A.
10801079

10811080
## Implementation History
10821081

1083-
- 2023-12.20 - First version of the KEP
1082+
- 2023-12-20 - First version of the KEP
10841083
- 2024-03-05 - Merged implementation PR [Support for the Job managedBy field (alpha)](https://github.com/kubernetes/kubernetes/pull/123273)
10851084
- 2024-03-07 - Merged [Update Job conformance test for job status updates](https://github.com/kubernetes/kubernetes/pull/123751)
10861085
- 2024-03-08 - Merged [Follow up fix to the job status update test](https://github.com/kubernetes/kubernetes/pull/123815)
@@ -1090,6 +1089,7 @@ N/A.
10901089
- 2024-06-21 - Merged [Update the count of ready pods when deleting pods](https://github.com/kubernetes/kubernetes/pull/125546)
10911090
- 2024-07-12 - Merged [Delay setting terminal Job conditions until all pods are terminal](https://github.com/kubernetes/kubernetes/pull/125510)
10921091
- 2024-07-30 - Merged [Update the docs for JobManagedBy and JobPodReplacementPolicy related to pod termination](https://github.com/kubernetes/website/pull/46808)
1092+
- 2024-10-17 - Merged [Graduate JobManagedBy to Beta in 1.32](https://github.com/kubernetes/kubernetes/pull/127402)
10931093

10941094
<!--
10951095
Major milestones in the lifecycle of a KEP should be tracked in this section.
@@ -1369,7 +1369,7 @@ It makes them not that useful to debug situations when the Job didn't make
13691369
progress for long time. So, they would not give a reliable signal for debugging
13701370
based on playbooks.
13711371

1372-
Renewing the even on every Job update seems excessive from the performance
1372+
Renewing the event on every Job update seems excessive from the performance
13731373
perspective.
13741374

13751375
## Infrastructure Needed (Optional)

keps/sig-apps/4368-support-managed-by-for-batch-jobs/kep.yaml

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ kep-number: 4368
33
authors:
44
- "@mimowo"
55
owning-sig: sig-apps
6-
status: implementable
6+
status: implemented
77
creation-date: 2023-12-20
88
reviewers:
99
- "@alculquicondor"
@@ -18,17 +18,18 @@ see-also:
1818
- "https://github.com/kubernetes/enhancements/pull/4073" # closed PR
1919

2020
# The target maturity stage in the current dev cycle for this KEP.
21-
stage: beta
21+
stage: stable
2222

2323
# The most recent milestone for which work toward delivery of this KEP has been
2424
# done. This can be the current (upcoming) milestone, if it is being actively
2525
# worked on.
26-
latest-milestone: "v1.32"
26+
latest-milestone: "v1.35"
2727

2828
# The milestone at which this feature was, or is targeted to be, at each stage.
2929
milestone:
3030
alpha: "v1.30"
3131
beta: "v1.32"
32+
stable: "v1.35"
3233

3334
# The following PRR answers are required at alpha release
3435
# List the feature gate name and the components for which it must be enabled
@@ -40,4 +41,4 @@ feature-gates:
4041
disable-supported: true
4142

4243
metrics:
43-
- job_by_external_controller_count
44+
- jobs_by_external_controller_total

0 commit comments

Comments
 (0)