Skip to content

KEP-5237: watch-based route controller reconciliation using informers #5289

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

lukasmetzner
Copy link

  • One-line PR description: Introduce a feature gate to enable informer-based reconciliation in the routes' controller of cloud-controller-manager, reducing API calls and improving efficiency.
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels May 8, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @lukasmetzner!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. label May 8, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lukasmetzner
Once this PR has been reviewed and has the lgtm label, please assign cheftako for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Hi @lukasmetzner. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 8, 2025
@lukasmetzner
Copy link
Author

/cc @elmiko @JoelSpeed

@k8s-ci-robot k8s-ci-robot requested review from elmiko and JoelSpeed May 8, 2025 08:10
@apricote
Copy link
Member

apricote commented May 8, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 8, 2025
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the core concepts make sense to me, i think we should clean up the language around the term "cloud provider". in some places we use it to mean the controllers (ie the ccm), and other places we use it to mean the infrastructure provider (eg aws, azure, gcp), and we also refer to the framework as well.

we also need to make some decisions about the open questions. i wonder if we should go over these questions at the next sig meeting?


#### Story 3

As a cluster operator I need to use the API rate limits from my Cloud Provider effectively. Sending frequent API requests even though nothing changed causes me to deplete the API rate limits faster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we aren't going to capitalize "Cloud Provider" in other places, we shouldn't capitalize here.

alternatively, if we want to distinguish the "cloud provider" (eg aws, gcp, etc), then i think we should a less overloaded term, like "infrastructure provider".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine settling for the term infrastructure provider here.


We currently use the `Node.Status.Addresses` and `PodCIDRs` fields to trigger updates in the route reconciliation mechanism. However, relying solely on these fields may be insufficient, potentially causing missed route reconciliations when updates are necessary. This depends on the specific cloud-provider implementations. Using these fields works for the CCM maintained by the authors, but we do not know the details of other providers.

This is mitigated by a feature gate, which allows other cloud providers to test it and provide feedback on the fields.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which "cloud providers" are we talking about here: the cloud-controller-manager, the infrastructure provider, or the maintainers of other ccm projects?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both infrastructure providers and maintainers of other ccm projects are welcome to test this feature and provide feedback on the fields, which were choosen to trigger a reconcile.


Other controllers rely on “Owner” references to make sure that the resource is only deleted when the controller had the chance to run any cleanup. This is currently not implemented for any controller in [`k8s.io/cloud-provider`](http://k8s.io/cloud-provider). Because of this, Nodes may get deleted without the possibility to process the event in the route controller.

This can cause issues with limits on the number of routes in from the cloud provider, as well as invalid routes being advertised as valid, causing possible networking reliability or confidentiality issues.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this "cloud provider" referring to the cloud-controller-manager?

Copy link
Author

@lukasmetzner lukasmetzner May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is referring to the infrastructure provider. If an infrastructure provider has a limit on the max. amount of routes for a private network we can get rid of unused routes. This is at least the case for Hetzner.

Founds also a small typo here in from the cloud provider, which I am going to clean up.

@lukasmetzner
Copy link
Author

@elmiko If possible, I’d love to work through the open questions here in the PR so we can keep things moving, and then have a discussion in the next SIG meeting. Otherwise, we might end up losing a two of weeks of time. What do you think?

Open Questions:

  • What should be the default frequency for the periodic full reconciliation?
    • Input from @JoelSpeed: We should do it similar to other controllers and choose a random time between 12h and 24h.
    • As we use the same shared informers' factory as in other controllers, this should already be implemented.
  • Are there other Node fields besides node.status.addresses and PodCIDRs that should trigger a route update?
  • How should we set the interval for the periodic reconcile? Options:
    • Adjust --route-reconcile-period when feature gate enabled
    • Use --min-resync-period; currently defaults to 12h
    • Introduce a new flag
    • If we use the 12h-24h option we can probably reuse --min-resync-period.

@elmiko
Copy link
Contributor

elmiko commented May 9, 2025

i'm fine to continue the discussions here.

Input from @JoelSpeed: We should do it similar to other controllers and choose a random time between 12h and 24h.

i think 12h sounds fine to me.

Are there other Node fields besides node.status.addresses and PodCIDRs that should trigger a route update?

i'll have to think about this a little more, i have a feeling that those are good to start with.

How should we set the interval for the periodic reconcile?

i like the idea of adjusting the default for the --route-reconcile-period, but i don't want users to get confused about this.

my only issue with using --min-resync-period is that it sounds much more general and we are just focusing on the route controller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants