You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: keps/sig-apps/4368-support-managed-by-for-batch-jobs/README.md
+18-18Lines changed: 18 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -140,7 +140,7 @@ controller, and delegate the status synchronization to the Kueue controller.
140
140
### Non-Goals
141
141
142
142
- passing custom parameters to the external controller
143
-
- Introduce a new concurrency policy for CronJobs (eg. `ForbidActive` or `SoftForbid`)
143
+
- Introduce a new concurrency policy for CronJobs (e.g. `ForbidActive` or `SoftForbid`)
144
144
to replace a Job that is about to complete, but still has terminating pods.
145
145
146
146
## Proposal
@@ -278,7 +278,7 @@ a typo in the field value, or the cluster administrator may not install the
278
278
custom controller (like MultiKueue) on that cluster.
279
279
280
280
In such cases the user may not observe any progress by the job for a long time
281
-
and my need to debug the Job.
281
+
and may need to debug the Job.
282
282
283
283
In order to allow for debugging of situations like this the Job controller will
284
284
put a log line indicating the synchronization is delegated to another controller
@@ -352,9 +352,6 @@ type JobSpec struct {
352
352
// all characters before the first "/" must be a valid subdomain as defined
353
353
// by RFC 1123. All characters trailing the first "/" must be valid HTTP Path
354
354
// characters as defined by RFC 3986. The value cannot exceed 64 characters.
355
-
//
356
-
// This field is alpha-level. The job controller accepts setting the field
357
-
// when the feature gate JobManagedBy is enabled (disabled by default).
358
355
// +optional
359
356
ManagedBy *string
360
357
}
@@ -493,9 +490,9 @@ The following scenarios are covered:
493
490
- it does not add the Suspended condition for a Job with `.spec.suspend=true` ([link](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2059)).
494
491
- the Job controller reconciles jobs with custom "managedBy" field when the feature gate is disabled ([link](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2030))
495
492
- the Job controller handles correctly re-enablement of the feature gate [link](https://github.com/kubernetes/kubernetes/blob/169a952720ebd75fcbcb4f3f5cc64e82fdd3ec45/test/integration/job/job_test.go#L1691)
496
-
- the `job_by_external_controller_total` metric is incremented when a new Job with custom "managedBy" is created ([link](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2044-L2058))
497
-
- the `job_by_external_controller_total` metric is not incremented for a new Job without "managedBy" or with default value ([link](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2000-L2029))
498
-
- the `job_by_external_controller_total` metric is not incremented for Job updates (regardless of the "managedBy") (tested indirectly as [here](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2000-L2029) the Job controller updates the Job status)
493
+
- the `jobs_by_external_controller_total` metric is incremented when a new Job with custom "managedBy" is created ([link](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2044-L2058))
494
+
- the `jobs_by_external_controller_total` metric is not incremented for a new Job without "managedBy" or with default value ([link](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2000-L2029))
495
+
- the `jobs_by_external_controller_total` metric is not incremented for Job updates (regardless of the "managedBy") (tested indirectly as [here](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L2000-L2029) the Job controller updates the Job status)
499
496
500
497
The following scenarios related to [Terminating pods and terminal Job conditions](#terminating-pods-and-terminal-job-conditions) are covered:
501
498
-`Failed` or `Complete` conditions are not added while there are still terminating pods ([link](https://github.com/kubernetes/kubernetes/blob/856475e5fffe3d99c71606d6024f5ed93e37eebc/test/integration/job/job_test.go#L1183))
@@ -523,7 +520,7 @@ We propose a single e2e test for the following scenario:
523
520
API fields affected by the new validation rules
524
521
- make CronJob more resilient by checking the Job condition is `Complete` when using `CompletionTime` (see [here](#custom-controllers-not-compatible-with-api-assumptions-by-cronjob))
525
522
- The feature flag disabled by default
526
-
- implement the `job_by_external_controller_total` metric
523
+
- implement the `jobs_by_external_controller_total` metric
527
524
528
525
Second Alpha (1.31):
529
526
- preparatory fix to address all known inconsistencies between validation and the
@@ -548,7 +545,7 @@ Second Alpha (1.31):
548
545
- Address reviews and bug reports from Beta users
549
546
- Re-evaluate the ideas of improving debuggability (like [extended `kubectl`](#debuggability), [dedicated condition](#condition-to-indicated-job-is-skipped), or [events](#event-indicating-the-job-is-skipped))
550
547
- Re-evaluate the need to skip reconciliation in the event handlers to optimize performance
551
-
- Assess the fragmentation of the ecosystem. Look for other implementations of a job controller and asses their conformance with k8s.
548
+
- Assess the fragmentation of the ecosystem. Look for other implementations of a job controller and assess their conformance with k8s.
552
549
- Lock the feature gate
553
550
554
551
#### Deprecation
@@ -752,7 +749,7 @@ being flipped between two owners.
752
749
The feature is opt-in so in case of such problems the custom "managedBy" field
753
750
should not be used.
754
751
755
-
Also, an admin could check if the value of the `job_by_external_controller_total`
752
+
Also, an admin could check if the value of the `jobs_by_external_controller_total`
756
753
matches the expectations. For example, if the value of the metric does not increase
757
754
when new jobs are being added with a custom "managedBy" field, it might be
758
755
indicative that the feature is not working correctly.
@@ -831,7 +828,7 @@ checking if there are objects with field X set) may be a last resort. Avoid
831
828
logs or events for this purpose.
832
829
-->
833
830
834
-
Check the `job_by_external_controller_total` metric. If the value is non-zero
831
+
Check the `jobs_by_external_controller_total` metric. If the value is non-zero
835
832
for a field, it means there were Jobs using the custom controller created, so
836
833
the feature is in use.
837
834
@@ -885,22 +882,24 @@ Pick one more of these and delete the rest.
885
882
886
883
-[x] Metrics
887
884
- Metric name:
888
-
-`job_by_external_controller_total` (new), with the `controller_name` label
885
+
-`jobs_by_external_controller_total` (new), with the `controller_name` label
889
886
corresponding to the custom value of the "managedBy" field. The metric is
890
887
incremented by the built-in Job controller on each ADDED Job event,
891
888
corresponding to a Job with custom value of the "managedBy" field.
892
889
This metric can be helpful to determine the health of a job and its controller
893
890
in combination with already existing metrics (see below).
891
+
- Components exposing the metric: kube-controller-manager
substantial increase of this metric, when additionally `job_by_external_controller_total>0`
893
+
substantial increase of this metric, when additionally `jobs_by_external_controller_total>0`
896
894
may be indicative of two controllers stepping onto each-other causing
897
895
conflicts (see [here](#two-controllers-running-at-the-same-time-on-old-version)).
896
+
- Components exposing the metric: kube-apiserver
898
897
- `kube_cronjob_status_active` (existing), substantial increase of this
899
898
metric, may suggest that there are accumulating non-progressing jobs controlled
900
-
by `CronJob`. If additionally `job_by_external_controller_total>0` it may suggest
899
+
by `CronJob`. If additionally `jobs_by_external_controller_total>0` it may suggest
901
900
that the Jobs are getting stuck due to not being synchronized by the custom
902
901
controller.
903
-
- Components exposing the metric: kube-apiserver
902
+
- Components exposing the metric: kube-apiserver
904
903
905
904
###### Are there any missing metrics that would be useful to have to improve observability of this feature?
906
905
@@ -1080,7 +1079,7 @@ N/A.
1080
1079
1081
1080
## Implementation History
1082
1081
1083
-
- 2023-12.20 - First version of the KEP
1082
+
- 2023-12-20 - First version of the KEP
1084
1083
- 2024-03-05 - Merged implementation PR [Support for the Job managedBy field (alpha)](https://github.com/kubernetes/kubernetes/pull/123273)
1085
1084
- 2024-03-07 - Merged [Update Job conformance test for job status updates](https://github.com/kubernetes/kubernetes/pull/123751)
1086
1085
- 2024-03-08 - Merged [Follow up fix to the job status update test](https://github.com/kubernetes/kubernetes/pull/123815)
@@ -1090,6 +1089,7 @@ N/A.
1090
1089
- 2024-06-21 - Merged [Update the count of ready pods when deleting pods](https://github.com/kubernetes/kubernetes/pull/125546)
1091
1090
- 2024-07-12 - Merged [Delay setting terminal Job conditions until all pods are terminal](https://github.com/kubernetes/kubernetes/pull/125510)
1092
1091
- 2024-07-30 - Merged [Update the docs for JobManagedBy and JobPodReplacementPolicy related to pod termination](https://github.com/kubernetes/website/pull/46808)
1092
+
- 2024-10-17 - Merged [Graduate JobManagedBy to Beta in 1.32](https://github.com/kubernetes/kubernetes/pull/127402)
1093
1093
1094
1094
<!--
1095
1095
Major milestones in the lifecycle of a KEP should be tracked in this section.
@@ -1369,7 +1369,7 @@ It makes them not that useful to debug situations when the Job didn't make
1369
1369
progress for long time. So, they would not give a reliable signal for debugging
1370
1370
based on playbooks.
1371
1371
1372
-
Renewing the even on every Job update seems excessive from the performance
1372
+
Renewing the event on every Job update seems excessive from the performance
0 commit comments