Skip to content

Conversation

simonfelding
Copy link
Contributor

Description

This ansible role isn't meant to add features to RKE2, so we should just delete this entire block of code and inform users of this issue instead, because I think it's kind of common in test clusters.

This whole thing is kind of a upstream RKE2 issue, in the sense that RKE2 doesn't really support the specific use case of deleting node VMs, restoring etcd and then joining nodes with the same hostnames as those that have been deleted, because of the node-secret.rke2 system cannot (or isn't supposed to) clean up when nodes are deleted without being removed from Kubernetes.

See discussion in #334

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
  • Small minor change not affecting the Ansible Role code (GitHub Actions Workflow, Documentation etc.)

How Has This Been Tested?

I tested the commands, but I haven't tested the code. I hope @michalg91 can test it for me.

@simonfelding
Copy link
Contributor Author

Fixes #333

@michalg91
Copy link

michalg91 commented Jul 10, 2025

As mentioned in #334 i'd rather not fail the playbook or add ability to delete secret with some switch in variables then it will cover all scenarios and will not introduce need to redesign whole recovery testing pipelines (like in my case). Let's work on this here i am closing #334

Copy link

@michalg91 michalg91 Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not delete this file and put here logic with deleting secrets and diffed nodes like in previous deleted parts, and execute it if specific variables are set. This will cover all the cases like before in 1.37.0 Restoring cluster with same node names, restoring cluster with efemeral node names etc.

Copy link

@michalg91 michalg91 Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this task should stay and should be triggered if another switch exists eg. rke2_cleanup_secrets. This file is only triggered on fresh unprovisioned and just restored cluster.

@michalg91
Copy link

I've tested and added some changes to @simonfelding repository. It is working, waiting for merge so we can review it here.

@simonfelding
Copy link
Contributor Author

simonfelding commented Jul 11, 2025

Hi @michalg91, thanks!

I really do think it should fail though, and that the code doesn't belong in this project at all.

You could fix your pipeline simply deleting the nodes from your cluster (for host in node1 node2; do kubectl delete node $host; done) before backing up etcd. Or even just delete the secrets, not the nodes.

Would love to hear @MonolithProjects take on it.

@michalg91
Copy link

michalg91 commented Jul 11, 2025

I can but, we are testing disaster recovery scenario, not preparation to recovery. So we need full etcd recovery to be working not prepared early with deleted nodes. This is why i reverted the first-server-restore.yaml and added additional condition to it (worth notice that this file is only triggered when you are doing recovery on fresh cluster and this behavior was working like it since i started to use this role ~2 years).

About introduced variable, If the condition is false (by default) the role will fail as you provided.

I think this is win, win situation. We're not braking someone else usage and we introduced some default safety tasks.

(additional changes i am covering are waiting here https://github.com/simonfelding/ansible-role-rke2/pull/1/files)

@simonfelding
Copy link
Contributor Author

I'm okay with that toggle - looking forward to hear what @MonolithProjects thinks :) I don't want to do more work on it before we agree on what direction we are going with it.

@MonolithProjects MonolithProjects self-assigned this Jul 21, 2025
@MonolithProjects
Copy link
Collaborator

Hi all, The cluster restoration option was added to this role two years ago, and from my point of view, it is quite useful. Additionally, I don't agree that it adds a new feature to RKE2, as it actually uses a built-in command for cluster restoration. So, I am in favor of fixing it rather than removing it.

@simonfelding
Copy link
Contributor Author

simonfelding commented Aug 11, 2025

OK!

@michalg91 I merged your changes. Can you verify it works as expected?

@michalg91
Copy link

It's working for me for couple of weeks :)

@MonolithProjects MonolithProjects added the bug Something isn't working label Aug 29, 2025
Copy link
Collaborator

@MonolithProjects MonolithProjects left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @simonfelding , please do those small changes. other than that the PR looks fine. Thx!

register: registered_node_names

- name: Restore etcd - remove old nodes
- name: Restore etcd - cleanup <node>.node-password.rke2 secrets
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To join this node, please recreate the file with the password, use a different node name (rke2_node_name), or remove the secret from the cluster using:
kubectl delete secret {{ rke2_node_name}}.node-password.rke2 -n kube-system
when:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rke2_snapshooter: overlayfs # legacy variable that only exists to keep backward compatibility with previous configurations

# when doing restore allow cleanup of old nodes secrets and remove not existing nodes
rke2_cleanup_on_restore: false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add this new variable also to the README.md file and to the argument_specs.yml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants