-
Notifications
You must be signed in to change notification settings - Fork 189
Fail if node cannot join the cluster because a <host>.node-password.rke2 secret already exists in cluster #335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Fixes #333 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not delete this file and put here logic with deleting secrets and diffed nodes like in previous deleted parts, and execute it if specific variables are set. This will cover all the cases like before in 1.37.0
Restoring cluster with same node names, restoring cluster with efemeral node names etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this task should stay and should be triggered if another switch exists eg. rke2_cleanup_secrets
. This file is only triggered on fresh unprovisioned and just restored cluster.
I've tested and added some changes to @simonfelding repository. It is working, waiting for merge so we can review it here. |
Hi @michalg91, thanks! I really do think it should fail though, and that the code doesn't belong in this project at all. You could fix your pipeline simply deleting the nodes from your cluster ( Would love to hear @MonolithProjects take on it. |
I can but, we are testing disaster recovery scenario, not preparation to recovery. So we need full etcd recovery to be working not prepared early with deleted nodes. This is why i reverted the About introduced variable, If the condition is false (by default) the role will fail as you provided. I think this is win, win situation. We're not braking someone else usage and we introduced some default safety tasks. (additional changes i am covering are waiting here https://github.com/simonfelding/ansible-role-rke2/pull/1/files) |
I'm okay with that toggle - looking forward to hear what @MonolithProjects thinks :) I don't want to do more work on it before we agree on what direction we are going with it. |
Hi all, The cluster restoration option was added to this role two years ago, and from my point of view, it is quite useful. Additionally, I don't agree that it adds a new feature to RKE2, as it actually uses a built-in command for cluster restoration. So, I am in favor of fixing it rather than removing it. |
OK! @michalg91 I merged your changes. Can you verify it works as expected? |
It's working for me for couple of weeks :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @simonfelding , please do those small changes. other than that the PR looks fine. Thx!
register: registered_node_names | ||
|
||
- name: Restore etcd - remove old nodes | ||
- name: Restore etcd - cleanup <node>.node-password.rke2 secrets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please delete the trailing whitespace https://github.com/lablabs/ansible-role-rke2/actions/runs/16873948643/job/49236826242
To join this node, please recreate the file with the password, use a different node name (rke2_node_name), or remove the secret from the cluster using: | ||
kubectl delete secret {{ rke2_node_name}}.node-password.rke2 -n kube-system | ||
when: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please delete the trailing whitespace https://github.com/lablabs/ansible-role-rke2/actions/runs/16873948643/job/49236826242
rke2_snapshooter: overlayfs # legacy variable that only exists to keep backward compatibility with previous configurations | ||
|
||
# when doing restore allow cleanup of old nodes secrets and remove not existing nodes | ||
rke2_cleanup_on_restore: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add this new variable also to the README.md file and to the argument_specs.yml
Description
This ansible role isn't meant to add features to RKE2, so we should just delete this entire block of code and inform users of this issue instead, because I think it's kind of common in test clusters.
This whole thing is kind of a upstream RKE2 issue, in the sense that RKE2 doesn't really support the specific use case of deleting node VMs, restoring etcd and then joining nodes with the same hostnames as those that have been deleted, because of the node-secret.rke2 system cannot (or isn't supposed to) clean up when nodes are deleted without being removed from Kubernetes.
See discussion in #334
Type of change
How Has This Been Tested?
I tested the commands, but I haven't tested the code. I hope @michalg91 can test it for me.