Fail if node cannot join the cluster because a <host>.node-password.rke2 secret already exists in cluster #335

simonfelding · 2025-07-10T15:52:50Z

Description

This ansible role isn't meant to add features to RKE2, so we should just delete this entire block of code and inform users of this issue instead, because I think it's kind of common in test clusters.

This whole thing is kind of a upstream RKE2 issue, in the sense that RKE2 doesn't really support the specific use case of deleting node VMs, restoring etcd and then joining nodes with the same hostnames as those that have been deleted, because of the node-secret.rke2 system cannot (or isn't supposed to) clean up when nodes are deleted without being removed from Kubernetes.

See discussion in #334

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update
Small minor change not affecting the Ansible Role code (GitHub Actions Workflow, Documentation etc.)

How Has This Been Tested?

I tested the commands, but I haven't tested the code. I hope @michalg91 can test it for me.

simonfelding · 2025-07-10T15:55:51Z

Fixes #333

michalg91 · 2025-07-10T16:31:22Z

As mentioned in #334 i'd rather not fail the playbook or add ability to delete secret with some switch in variables then it will cover all scenarios and will not introduce need to redesign whole recovery testing pipelines (like in my case). Let's work on this here i am closing #334

michalg91 · 2025-07-10T16:56:02Z

tasks/first_server_restore.yml

we should not delete this file and put here logic with deleting secrets and diffed nodes like in previous deleted parts, and execute it if specific variables are set. This will cover all the cases like before in 1.37.0 Restoring cluster with same node names, restoring cluster with efemeral node names etc.

michalg91 · 2025-07-10T17:04:39Z

tasks/main.yml

this task should stay and should be triggered if another switch exists eg. rke2_cleanup_secrets. This file is only triggered on fresh unprovisioned and just restored cluster.

This reverts commit 8917178.

michalg91 · 2025-07-11T06:57:32Z

I've tested and added some changes to @simonfelding repository. It is working, waiting for merge so we can review it here.

simonfelding · 2025-07-11T14:38:07Z

Hi @michalg91, thanks!

I really do think it should fail though, and that the code doesn't belong in this project at all.

You could fix your pipeline simply deleting the nodes from your cluster (for host in node1 node2; do kubectl delete node $host; done) before backing up etcd. Or even just delete the secrets, not the nodes.

Would love to hear @MonolithProjects take on it.

michalg91 · 2025-07-11T14:55:26Z

I can but, we are testing disaster recovery scenario, not preparation to recovery. So we need full etcd recovery to be working not prepared early with deleted nodes. This is why i reverted the first-server-restore.yaml and added additional condition to it (worth notice that this file is only triggered when you are doing recovery on fresh cluster and this behavior was working like it since i started to use this role ~2 years).

About introduced variable, If the condition is false (by default) the role will fail as you provided.

I think this is win, win situation. We're not braking someone else usage and we introduced some default safety tasks.

(additional changes i am covering are waiting here https://github.com/simonfelding/ansible-role-rke2/pull/1/files)

simonfelding · 2025-07-15T11:44:24Z

I'm okay with that toggle - looking forward to hear what @MonolithProjects thinks :) I don't want to do more work on it before we agree on what direction we are going with it.

MonolithProjects · 2025-07-21T14:06:22Z

Hi all, The cluster restoration option was added to this role two years ago, and from my point of view, it is quite useful. Additionally, I don't agree that it adds a new feature to RKE2, as it actually uses a built-in command for cluster restoration. So, I am in favor of fixing it rather than removing it.

…cd-restoring-code

simonfelding · 2025-08-11T07:45:23Z

OK!

@michalg91 I merged your changes. Can you verify it works as expected?

michalg91 · 2025-08-11T15:37:22Z

It's working for me for couple of weeks :)

MonolithProjects

Hi @simonfelding , please do those small changes. other than that the PR looks fine. Thx!

MonolithProjects · 2025-08-30T10:14:51Z

tasks/first_server_restore.yml

+  register: registered_node_names

- name: Restore etcd - remove old nodes
+- name: Restore etcd - cleanup <node>.node-password.rke2 secrets 


please delete the trailing whitespace https://github.com/lablabs/ansible-role-rke2/actions/runs/16873948643/job/49236826242

MonolithProjects · 2025-08-30T10:16:29Z

tasks/remaining_nodes.yml

+          
+          To join this node, please recreate the file with the password, use a different node name (rke2_node_name), or remove the secret from the cluster using:
+          kubectl delete secret {{ rke2_node_name}}.node-password.rke2 -n kube-system
+      when: 


please delete the trailing whitespace https://github.com/lablabs/ansible-role-rke2/actions/runs/16873948643/job/49236826242

MonolithProjects · 2025-08-30T10:18:33Z

defaults/main.yml

 rke2_snapshooter: overlayfs # legacy variable that only exists to keep backward compatibility with previous configurations

+# when doing restore allow cleanup of old nodes secrets and remove not existing nodes
+rke2_cleanup_on_restore: false


please add this new variable also to the README.md file and to the argument_specs.yml

simonfelding added 2 commits July 10, 2025 17:48

add the fail condition

3dc105b

remove first_server_restore.yml from main.yml

8917178

simonfelding added 2 commits July 10, 2025 17:59

better variables

41bb39b

set -o pipefail

ae322a7

michalg91 mentioned this pull request Jul 10, 2025

fix: #333 get rke2_node_name for only hosts in current cluster inventory group #334

Closed

5 tasks

michalg91 reviewed Jul 10, 2025

View reviewed changes

michalg91 added 2 commits July 10, 2025 19:08

Revert "remove first_server_restore.yml from main.yml"

f519b6d

This reverts commit 8917178.

allow some dangerous parts to be executed

e061595

add changed_when:false to get current nodes secrets

b619d09

MonolithProjects self-assigned this Jul 21, 2025

simonfelding added 2 commits August 11, 2025 09:43

Merge pull request #1 from michalg91/simonfelding-remove-dangerous-et…

294190d

…cd-restoring-code

Merge branch 'lablabs:main' into remove-dangerous-etcd-restoring-code

e15630f

MonolithProjects added the bug Something isn't working label Aug 29, 2025

MonolithProjects requested changes Aug 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fail if node cannot join the cluster because a <host>.node-password.rke2 secret already exists in cluster #335

Fail if node cannot join the cluster because a <host>.node-password.rke2 secret already exists in cluster #335

Uh oh!

simonfelding commented Jul 10, 2025

Uh oh!

simonfelding commented Jul 10, 2025

Uh oh!

michalg91 commented Jul 10, 2025 •

edited

Loading

Uh oh!

michalg91 Jul 10, 2025 •

edited

Loading

Uh oh!

michalg91 Jul 10, 2025 •

edited

Loading

Uh oh!

michalg91 commented Jul 11, 2025

Uh oh!

simonfelding commented Jul 11, 2025 •

edited

Loading

Uh oh!

michalg91 commented Jul 11, 2025 •

edited

Loading

Uh oh!

simonfelding commented Jul 15, 2025

Uh oh!

MonolithProjects commented Jul 21, 2025

Uh oh!

simonfelding commented Aug 11, 2025 •

edited

Loading

Uh oh!

michalg91 commented Aug 11, 2025

Uh oh!

MonolithProjects left a comment

Uh oh!

MonolithProjects Aug 30, 2025

Uh oh!

MonolithProjects Aug 30, 2025

Uh oh!

MonolithProjects Aug 30, 2025

Uh oh!

Uh oh!

Fail if node cannot join the cluster because a <host>.node-password.rke2 secret already exists in cluster #335

Are you sure you want to change the base?

Fail if node cannot join the cluster because a <host>.node-password.rke2 secret already exists in cluster #335

Uh oh!

Conversation

simonfelding commented Jul 10, 2025

Description

Type of change

How Has This Been Tested?

Uh oh!

simonfelding commented Jul 10, 2025

Uh oh!

michalg91 commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michalg91 Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michalg91 Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

michalg91 commented Jul 11, 2025

Uh oh!

simonfelding commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michalg91 commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simonfelding commented Jul 15, 2025

Uh oh!

MonolithProjects commented Jul 21, 2025

Uh oh!

simonfelding commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

michalg91 commented Aug 11, 2025

Uh oh!

MonolithProjects left a comment

Choose a reason for hiding this comment

Uh oh!

MonolithProjects Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

MonolithProjects Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

MonolithProjects Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

michalg91 commented Jul 10, 2025 •

edited

Loading

michalg91 Jul 10, 2025 •

edited

Loading

michalg91 Jul 10, 2025 •

edited

Loading

simonfelding commented Jul 11, 2025 •

edited

Loading

michalg91 commented Jul 11, 2025 •

edited

Loading

simonfelding commented Aug 11, 2025 •

edited

Loading