KEP-5237: watch-based route controller reconciliation using informers #5289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

lukasmetzner wants to merge 6 commits into kubernetes:master from lukasmetzner:5237-watch-based-route-controller-reconciliation

lukasmetzner commented May 8, 2025

One-line PR description: Introduce a feature gate to enable informer-based reconciliation in the routes' controller of cloud-controller-manager, reducing API calls and improving efficiency.

Issue link: CCM: watch-based route controller reconciliation using informers #5237

Other comments:


          KEP-5237: initial draft

95d3b06

k8s-ci-robot added cncf-cla: yes kind/kep labels

Contributor

k8s-ci-robot commented May 8, 2025

Welcome @lukasmetzner!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot added the sig/cloud-provider label

k8s-ci-robot requested review from bridgetkromhout and cheftako

May 8, 2025 08:08

Contributor

k8s-ci-robot commented May 8, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lukasmetzner
Once this PR has been reviewed and has the lgtm label, please assign cheftako for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-cloud-provider/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Contributor

k8s-ci-robot commented May 8, 2025

Hi @lukasmetzner. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added needs-ok-to-test size/L labels

Author

lukasmetzner commented May 8, 2025

/cc @elmiko @JoelSpeed

k8s-ci-robot requested review from elmiko and JoelSpeed

May 8, 2025 08:10


          KEP-5237: fix toc

aca35ba

Member

apricote commented May 8, 2025

/ok-to-test

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels

elmiko reviewed

View reviewed changes

Contributor

elmiko left a comment

the core concepts make sense to me, i think we should clean up the language around the term "cloud provider". in some places we use it to mean the controllers (ie the ccm), and other places we use it to mean the infrastructure provider (eg aws, azure, gcp), and we also refer to the framework as well.

we also need to make some decisions about the open questions. i wonder if we should go over these questions at the next sig meeting?

keps/sig-cloud-provider/5237-watch-based-route-controller-reconciliation/README.md Outdated Show resolved Hide resolved

keps/sig-cloud-provider/5237-watch-based-route-controller-reconciliation/README.md Outdated Show resolved Hide resolved

keps/sig-cloud-provider/5237-watch-based-route-controller-reconciliation/README.md Outdated Show resolved Hide resolved

keps/sig-cloud-provider/5237-watch-based-route-controller-reconciliation/README.md Outdated


		#### Story 3

		As a cluster operator I need to use the API rate limits from my Cloud Provider effectively. Sending frequent API requests even though nothing changed causes me to deplete the API rate limits faster.

Contributor

elmiko May 8, 2025

if we aren't going to capitalize "Cloud Provider" in other places, we shouldn't capitalize here.

alternatively, if we want to distinguish the "cloud provider" (eg aws, gcp, etc), then i think we should a less overloaded term, like "infrastructure provider".

Author

lukasmetzner May 9, 2025

I am fine settling for the term infrastructure provider here.

keps/sig-cloud-provider/5237-watch-based-route-controller-reconciliation/README.md Outdated Show resolved Hide resolved

keps/sig-cloud-provider/5237-watch-based-route-controller-reconciliation/README.md Outdated


		We currently use the `Node.Status.Addresses` and `PodCIDRs` fields to trigger updates in the route reconciliation mechanism. However, relying solely on these fields may be insufficient, potentially causing missed route reconciliations when updates are necessary. This depends on the specific cloud-provider implementations. Using these fields works for the CCM maintained by the authors, but we do not know the details of other providers.

		This is mitigated by a feature gate, which allows other cloud providers to test it and provide feedback on the fields.

Contributor

elmiko May 8, 2025

which "cloud providers" are we talking about here: the cloud-controller-manager, the infrastructure provider, or the maintainers of other ccm projects?

Author

lukasmetzner May 9, 2025

I think both infrastructure providers and maintainers of other ccm projects are welcome to test this feature and provide feedback on the fields, which were choosen to trigger a reconcile.

keps/sig-cloud-provider/5237-watch-based-route-controller-reconciliation/README.md Outdated


		Other controllers rely on “Owner” references to make sure that the resource is only deleted when the controller had the chance to run any cleanup. This is currently not implemented for any controller in [`k8s.io/cloud-provider`](http://k8s.io/cloud-provider). Because of this, Nodes may get deleted without the possibility to process the event in the route controller.

		This can cause issues with limits on the number of routes in from the cloud provider, as well as invalid routes being advertised as valid, causing possible networking reliability or confidentiality issues.

Contributor

elmiko May 8, 2025

is this "cloud provider" referring to the cloud-controller-manager?

Author

lukasmetzner May 9, 2025 •

edited

Loading

No, this is referring to the infrastructure provider. If an infrastructure provider has a limit on the max. amount of routes for a private network we can get rid of unused routes. This is at least the case for Hetzner.

Founds also a small typo here in from the cloud provider, which I am going to clean up.

keps/sig-cloud-provider/5237-watch-based-route-controller-reconciliation/README.md Outdated Show resolved Hide resolved

lukasmetzner and others added 3 commits

May 9, 2025 08:32


          fix: ccm/cloud-provider naming

0eb4a96

Co-authored-by: Michael McCune <msm@opbstudios.com>


          fix: added second goal

fb3ba7a


          refactor: rename cloud providers -> infrastructure providers

51187d3

Author

lukasmetzner commented May 9, 2025

@elmiko If possible, I’d love to work through the open questions here in the PR so we can keep things moving, and then have a discussion in the next SIG meeting. Otherwise, we might end up losing a two of weeks of time. What do you think?

Open Questions:

What should be the default frequency for the periodic full reconciliation?
- Input from @JoelSpeed: We should do it similar to other controllers and choose a random time between 12h and 24h.
- As we use the same shared informers' factory as in other controllers, this should already be implemented.
Are there other Node fields besides node.status.addresses and PodCIDRs that should trigger a route update?
How should we set the interval for the periodic reconcile? Options:
- Adjust --route-reconcile-period when feature gate enabled
- Use --min-resync-period; currently defaults to 12h
- Introduce a new flag
- If we use the 12h-24h option we can probably reuse --min-resync-period.


          feat: tested upgrade and rollback

Contributor

elmiko commented May 9, 2025 •

edited

Loading

i'm fine to continue the discussions here.

Input from @JoelSpeed: We should do it similar to other controllers and choose a random time between 12h and 24h.

i think 12h sounds fine to me.

Are there other Node fields besides node.status.addresses and PodCIDRs that should trigger a route update?

i'll have to think about this a little more, i have a feeling that those are good to start with.

How should we set the interval for the periodic reconcile?

i like the idea of adjusting the default for the --route-reconcile-period, but i don't want users to get confused about this.

my only issue with using --min-resync-period is that it sounds much more general and we are just focusing on the route controller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes kind/kep ok-to-test sig/cloud-provider size/L