diff --git a/keps/sig-node/4188-kubelet-pod-readiness-api/README.md b/keps/sig-node/4188-kubelet-pod-readiness-api/README.md index 943143c2e41..b79af834e2f 100644 --- a/keps/sig-node/4188-kubelet-pod-readiness-api/README.md +++ b/keps/sig-node/4188-kubelet-pod-readiness-api/README.md @@ -1,4 +1,4 @@ -# KEP-4188: New Kubelet gRPC API with endpoint returning local Pods readiness information +# KEP-4188: New Kubelet gRPC API with endpoint returning local Pods information - [Release Signoff Checklist](#release-signoff-checklist) @@ -6,16 +6,20 @@ - [Motivation](#motivation) - [Goals](#goals) - [Non-Goals](#non-goals) - - [User Stories](#user-stories) - [Proposal](#proposal) - [What kind of API to chose?](#what-kind-of-api-to-chose) - [Can we integrate with PodResource API?](#can-we-integrate-with-podresource-api) + - [User Stories](#user-stories) - [Risks and Mitigations](#risks-and-mitigations) - [Control Plane availability issue](#control-plane-availability-issue) - [Kubelet restarts issue](#kubelet-restarts-issue) - [Design Details](#design-details) + - [Pod State Selection](#pod-state-selection) - [Proposed API](#proposed-api) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) - [e2e tests](#e2e-tests) - [Graduation Criteria](#graduation-criteria) - [Alpha](#alpha) @@ -66,76 +70,87 @@ Items marked with (R) are required *prior to targeting to a milestone / release* ## Summary -Proposal to add a new Kubelet gRPC API with endpoint returning local Pods readiness information. -Serving that information by Kubelet within a Node will increase reliability and reduce load to the Kubernetes API Server and traffic outside the node. A connectivity issue between Node and Control Plane should not impact workloads which depend on Pods readiness statuses. +Proposal to add a new Kubelet gRPC API with endpoint returning local Pods information, +including the full PodSpec and status. +Serving that information by Kubelet within a Node will increase reliability and reduce load to +the Kubernetes API Server and traffic outside the node. A connectivity issue between Node and +Control Plane should not impact workloads which depend on Pods state information. +Because the full PodSpec may contain sensitive information, access to this API will be secured by +restricting the UNIX socket to local admin users only. ## Motivation Kubelet is responsible for running Health Checks (probes) and communicating the results via Pod status. All that information is stored in cache and reported to -Kube-API. Right now pod's readiness information is tightly coupled with the Kubernetes API +Kube-API. Right now pod's state information is tightly coupled with the Kubernetes API Server. When a workload wants to know the actual state of Pods running on the Node, it needs to fetch it from Kube-API. This causes some issues: * Reliability - for various reasons Kube-API might not be available (connectivity issue, control plane updates) * Scalability - adding new watchers to kube-API is a scalability concern. By - exposing the endpoint that will serve the Pods readiness status directly from the Kubelet - cache we can use it on node workloads and avoid additional dependencies on the + exposing the endpoint that will serve the Pods state directly from the Kubelet + cache we can use it on node workloads and avoid additional dependencies on the Kubernetes API Server. +* Flexibility - consumers may need more than just readiness, such as phase, IPs, + resource usage, or labels/annotations, or the full pod spec. Supporting field selection + enables lean, efficient queries, but the API can also provide the full pod object when needed. + Because the full pod spec may contain secrets or other sensitive data, access must be tightly controlled. | Impact | Description| | ------- | ------------ | -| + | Reliability - for various reasons kube-API might not be available but this doesn’t mean that local workloads are not accessible and on node system workloads should have the most recent data about pod's readiness even when kube-API is unreachable. | -| + | Scalability - Reduce the load on kube-API by reducing the number of watchers and using Kubelet to fetch local Pods readiness. Fetching only Pods limited to one node is costly operation for kube-API. | +| + | Reliability - for various reasons kube-API might not be available but this doesn’t mean that local workloads are not accessible and on node system workloads should have the most recent data about pod's state even when kube-API is unreachable. | +| + | Scalability - Reduce the load on kube-API by reducing the number of watchers and using Kubelet to fetch local Pods information. Fetching only Pods limited to one node is costly operation for kube-API. | | + | Safety - Read-only API will not add security risks. | -| + | Reduce resource consumption by workload. Using Kube-API we can fetch objects like PodSpec or PodStatus, for some on-node workload this might be unnecessary, with this API workload can reduce the resource consumption. | +| + | Reduce resource consumption by workload. Using Kube-API we can fetch objects like PodSpec or PodStatus, for some on-node workload this might be unnecessary, with this API workload can reduce the resource consumption by requesting only the fields they need. The API can also provide the full pod spec for advanced use cases, but this is restricted to authorized local users. | | - | This API will add load to Kubelet (mitigation: API will be rate limited) | -| - | Kubelet does not support RBAC authorization for gRPC. (mitigation: This API is designed to be accessible for all workloads running on the node without the authorization. The unix socket will be used for the connection and all exposed data will be carefully reviewed.) | -| - | Limiting the scope of the API to the readiness information because we are not introducing the RBAC for this API. | +| - | Kubelet does not support RBAC authorization for gRPC. (mitigation: This API is designed to be accessible only to local admin users. The unix socket will be secured with file permissions to restrict access to privileged users, and all exposed data will be carefully reviewed.) | ### Goals -The goal of this API is to expose Pod readiness information directly from the +- Expand Kubelet API Scope: Provide a general pod information API, equivalent to the apiserver. +- Support an optional fieldmask-based filtering for lean responses. +- Maintain strict local-only, read-only access via UNIX socket and file permissions. + +The goal of this API is to expose comprehensive Pod information, including the full PodSpec and status, directly from the source - Kubelet, independent of Control Plane availability. This would remove the need for node-local components to request this node-local information from the Kubernetes API Server. +The API should allow consumers to request only the fields they need, using protobuf fieldmasks, to enable efficient +and lean data transfer. For advanced use cases, the full pod spec can be returned, but only to authorized local admin +users due to the potential for sensitive information. + Kubelet already has a podresources endpoint ([2403-pod-resources-allocatable-resources](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2403-pod-resources-allocatable-resources)) which returns information about Pod’s containers and Devices. This API does not -contain information about pod readiness status. +contain information about pod state. Kubelet is responsible for computing Pod status and stores it in a local cache. -We want to create a new gRPC API that will expose pod conditions that are computed by +We want to create a new gRPC API that will expose pod information that is computed by Kubelet and return the most recent data even when kube-API is not reachable. This API is open for future modification if needed but exposed data via this API should be limited -to pod's readiness information. +to pod's information relevant for node-local consumers. ### Non-Goals -Exposing pods detailed data that are not related to pod's readiness. - -### User Stories -* Some on node system workloads want to reduce Control Plane dependency and - introduce locality for Pod’s readiness to improve reliability and scalability. -* Custom monitoring tools may want to have local visibility into the readiness - of Pods running on the same Node. -* Some on node system workloads interested in Pod readiness want to reduce - resource consumption. +- Providing write access to pod information via this API. +- Exposing cluster-wide information beyond the scope of specific, node-local pods. ## Proposal -We are proposing to create a new Kubelet API that will return pod's readiness information. +We are proposing to create a new Kubelet API that will return pod information, including the full PodSpec and status, not just readiness. * The API will return data about both Static and Regular Pods. * The API will not return partial data. If Kubelet does not know actual information about workloads then gRPC FAILED_PRECONDITION (9) error code will be returned. * The API should return the most recent information about Pods computed by the Kubelet even when those data were not reported or accepted by kube-API. -* The API will be read-only and accessible for on-node workloads (we will use a - unix socket for the connection) with authorization limited to unix standard permissions. +* The API will be read-only and accessible for on-node workloads via a + unix socket, with access restricted to local admin users (e.g., root or a specific admin group) using file permissions. * The API will be versioned. +* The API will support protobuf fieldmasks to allow clients to request only the fields they need from the pod information, +* but can also return the full pod spec for authorized users. ### What kind of API to chose? @@ -171,18 +186,30 @@ The PodResource API includes an entirely unrelated set of information that is unlikely to be of use to the set of clients that would benefit from understanding Pod readiness. We propose creating a new API for this purpose. +### User Stories +* Some on-node system workloads want to reduce Control Plane dependency and + introduce locality for Pod’s state to improve reliability and scalability. + * checking Pod readiness for GKE CNI plugins to correctly program networking rules. + * checking system Pods health for [node problem detector](https://github.com/kubernetes/node-problem-detector/issues/1112). +* Custom monitoring tools may want to have local information on Pods running on + the same Node, faster and more reliable than fetching data from kube-API. + * [Inspektor Gadget](https://github.com/inspektor-gadget/inspektor-gadget) and Kubescape's [node-agent](https://github.com/kubescape/node-agent) are using a Pod informer and some delays + may be introduced between Pod state change and informer cache update. +* Some on-node system workloads interested in Pod state want to reduce + resource consumption by requesting only the fields they need. + ### Risks and Mitigations This API is read-only, which removes a large class of risks. -| Risk | Impact | Mitigation | -| --------------------------------------------------------- | ------------- | ---------- | -| Too many requests to the API impacting the Kubelet performances | High | Rate limiting the API. | -| Misuse of the API | High | This API is Read-only. We will expose only a small portion of the pod's information related to the pod readiness. Exposed data does not contain sensitive information that could be used in a malicious way. | -| Kubelet restart [issue](https://github.com/kubernetes/kubernetes/issues/100277) | High | This API should serve only complete information about workloads readiness. If Kubelet is in the init phase and not all pod’s readiness information is known, then the API should report the error. | -| Unauthorized access to the API | Medium | This API is designed to be accessed by all on-node workloads. Authorization will be provided by unix standard permissions to the socket file. | -| Exposing the API to all workloads on the node | Medium | Exposed data via the API is limited to readiness information only. | -| Kube-API is down or unreachable | Low | Kube-API availability should not impact this API. When the control plane is down or unreachable but Kubelet is working properly this API should return most recent data about local Pods readiness that are computed by Kubelet even if those data were not reported or accepted by Kube-API. | +| Risk | Impact | Mitigation | +| --------------------------------------------------------- | ------------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| Too many requests to the API impacting the Kubelet performances | High | Rate limiting the API. | +| Misuse of the API | High | This API is Read-only. We will expose only a subset of the pod's information relevant for node-local consumers. Exposed data does not contain sensitive information that could be used in a malicious way. | +| Kubelet restart [issue](https://github.com/kubernetes/kubernetes/issues/100277) | High | This API should serve only complete information about workloads. If Kubelet is in the init phase and not all pod’s information is known, then the API should report an error. | +| Unauthorized access to the API | Medium | This API is designed to be accessed by all on-node workloads. Authorization will be provided by unix standard permissions to the socket file. | +| Exposing the API to all workloads on the node | Medium | The unix socket will be secured with file permissions to restrict access to local admin users only. Exposed data via the API may contain sensitive information (e.g., environment variables, secrets references) present in the PodSpec, so access is tightly controlled. | +| Kube-API is down or unreachable | Low | Kube-API availability should not impact this API. When the control plane is down or unreachable but Kubelet is working properly this API should return most recent data about local Pods that are computed by Kubelet even if those data were not reported or accepted by Kube-API. | #### Control Plane availability issue @@ -217,75 +244,94 @@ the Pods. We don't want this API to return partial data. ## Design Details +### Pod State Selection + +Kubelet maintains multiple representations of pod state: +- The state derived from the Container Runtime Interface (CRI), reflecting the actual status of containers on the node. +- The state that is sent to the API server, which may be subject to additional processing or delays. +- The state received from the API server, reflecting the control plane's view. + +It is important to define which state this API will serve. For the purposes of this API, the intent is to serve the most +up-to-date and accurate pod state as known locally by the Kubelet, typically the state based on the CRI and Kubelet's internal +reconciliation, rather than the potentially stale state received from the API server. This ensures consumers receive the +freshest possible information about pods running on the node. + ### Proposed API -We propose to add new gPRC API `status` in Kubelet, listening on a unix socket -at `/var/lib/Kubelet/status/Kubelet.sock`. The endpoint will be versioned. The -gRPC Service will expose 3 methods serving local Pods statuses data: +We propose to add new gRPC API `pods` in Kubelet, listening on a unix socket +at `/var/lib/kubelet/pods/kubelet.sock`. The endpoint will not be versioned. The +gRPC Service will expose 3 methods serving local Pods data: ```protobuf -service PodStatus { - // ListPodStatus returns a of List of PodStatus - rpc ListPodStatus(PodStatusListRequest) returns (PodStatusListResponse) {} - // WatchPodStatus returns a stream of List of PodStatus - // Whenever a pod state change api returns the new list - rpc WatchPodStatus(PodStatusWatchRequest) returns (stream PodStatusWatchResponse) {} - // GetPodStatus returns a PodStatus for given pod's UID - rpc GetPodStatus(PodStatusGetRequest) returns (PodStatusGetResponse) {} +import "google/protobuf/field_mask.proto"; +import "k8s.io/api/core/v1/generated.proto"; + +service Pods { + // ListPods returns a list of v1.Pod, filtered by field mask. + rpc ListPodStatus(PodListRequest) returns (PodListResponse) {} + // WatchPods returns a stream of list of PodInfo, filtered by field mask. + // Whenever a pod state changes, api returns the new list. + rpc WatchPods(PodWatchRequest) returns (stream PodWatchResponse) {} + // GetPod returns a PodInfo for given pod's UID, filtered by field mask. + rpc GetPod(PodGetRequest) returns (PodGetResponse) {} } -// PodCondition aligns with v1.PodCondition. -message PodCondition { - PodConditionType Type = 1; - ConditionStatus Status = 2; - Timestamp LastProbeTime = 3; - Timestamp LastTransitionTime = 4; - string Reason = 5; - string Message = 6; +// PodListRequest allows specifying a field mask. +message PodListRequest { + // Optional field mask in the gRPC metadata. } -// PodConditionType aligns with v1.PodConditionType -enum PodConditionType { - ContainersReady = 0; - Initialized = 1; - Ready = 2; - PodScheduled = 3; - DisruptionTarget = 4; +// PodListResponse returns a list of v1.Pod. +message PodListResponse { + repeated v1.Pod pods = 1; } -// ConditionStatus aligns with v1.ConditionStatus -enum ConditionStatus { - True = 0; - False = 1; - Unknown = 2; +// PodWatchRequest allows specifying a field mask. +message PodWatchRequest { + // Optional field mask in the gRPC metadata. } -// PodStatus returns a Pod details and list of status Conditions with deletion info. -message PodStatus { - string podUID = 1; - string podNamespace = 2; - string podName = 3; - bool static = 4; - repeated PodCondition conditions = 5; - Timestamp DeletionTimestamp = 3; +// PodWatchResponse returns a v1.Pod, as a stream. +message PodWatchResponse { + v1.Pod pods = 1; } -// PodStatusResponse returns a stream of List of PodStatus. -// Whenever a Pod state changes it will return the new list. -message PodStatusListResponse { - // PodStatus includes the Readiness information of Pods. - // In the future it may be extended to include additional information. - repeated PodStatus Pods = 1; +// PodGetRequest contains Pod UID and optional field mask. +message PodGetRequest { + string podUID = 1; + // Optional field mask in the gRPC metadata. } -// PodStatusGetRequest contains Pods UID -message PodStatusGetRequest { - string podUID = 1; +// PodGetResponse returns a v1.Pod. +message PodGetResponse { + v1.Pod pod = 1; } + +// ...other request/response messages as needed... ``` +The v1.Pod message is the latest core Pod API based on the version of Kubelet, from the file `k8s.io/api/core/v1/generated.proto`. +Which means if there is a version skew between the API server and Kubelet, they may be serving different versions of the Pod API. + +The use of `google.protobuf.FieldMask` allows clients to specify which fields of the v1.Pod message they are interested in, +enabling lean and efficient responses. Only authorized local admin users can use this API due to the potential sensitivity of the full PodSpec. + +### Test Plan + +##### Prerequisite testing updates + ##### Unit tests +- Test parsing and validation of protobuf FieldMasks. +- Test filtering of Pod objects based on requested FieldMasks. + +##### Integration tests + +- End-to-end tests to verify that the API works as expected with different FieldMask combinations and client requests. +- Test API work when Kubelet restarts. +- Test API work when Control Plane is unreachable. +- Test API to work with kubelet Standalone mode. +- Scalability tests to ensure the API performs well under load. ##### e2e tests @@ -303,6 +349,7 @@ message PodStatusGetRequest { - [ ] sig-auth input on proposed lack of authorization for this API. #### Beta + - [ ] Fix Kubelet restart [issue](https://github.com/kubernetes/kubernetes/issues/100277) #### GA @@ -330,11 +377,11 @@ N/A ###### Does enabling the feature change any default behavior? -No. +No, but the API will only be accessible to local admin users due to the sensitive nature of the full PodSpec. ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? -Yes, through feature gates. +Yes, through feature gates and by restricting/removing access to the UNIX socket. ###### What happens if we reenable the feature if it was previously rolled back? @@ -434,7 +481,7 @@ N/A. ###### What are other known failure modes? -The Kubelet might be in init phase when client call the API. The API should return well-known error message. +The Kubelet might be in init phase when client call the API. The API should return well-known error message. Unauthorized access attempts will be prevented by UNIX socket file permissions. ###### What steps should be taken if SLOs are not being met to determine the problem? @@ -443,6 +490,7 @@ The API should be disabled using the feature gate. ## Implementation History - 2023-09-05: KEP created +- 2025-09-30: KEP updated with new API and goals ## Drawbacks @@ -450,4 +498,4 @@ The API should be disabled using the feature gate. ## Future work -This API is open to future extension but added information should be limited to pod's readiness information. +This API is open to future extension but added information should be limited to pod's information relevant for node-local consumers. The use of field masks allows for future extensibility while maintaining efficient message sizes. The security model may be revisited if finer-grained access control is needed in the future. diff --git a/keps/sig-node/4188-kubelet-pod-readiness-api/kep.yaml b/keps/sig-node/4188-kubelet-pod-readiness-api/kep.yaml index 93e6d05829f..dd307f740b3 100644 --- a/keps/sig-node/4188-kubelet-pod-readiness-api/kep.yaml +++ b/keps/sig-node/4188-kubelet-pod-readiness-api/kep.yaml @@ -13,7 +13,7 @@ reviewers: approvers: - "@dchen1107" - "@mrunalp" -see-also: +see-also: [] replaces: [] # The target maturity stage in the current dev cycle for this KEP. @@ -22,13 +22,13 @@ stage: alpha # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively # worked on. -latest-milestone: "v1.29" +latest-milestone: "v1.35" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: - alpha: "v1.29" - beta: "v1.30" - stable: "v1.31" + alpha: "v1.35" + beta: "v1.36" + stable: "v1.37" # The following PRR answers are required at alpha release # List the feature gate name and the components for which it must be enabled