Skip to content

Conversation

michalg91
Copy link

Description

This fixes #333

  • The problem was in extracting hostvars by set_fact method which handled all hosts in inventory
  • register: node_names registered ansible_fact dict so next steps in this block execution were comparing list to dict so were triggered when not needed
{'ansible_facts': {'node_names': ['node01', 'node02,  'node03']]}, 'failed': False, 'changed': False}

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
  • Small minor change not affecting the Ansible Role code (GitHub Actions Workflow, Documentation etc.)

How Has This Been Tested?

Created clean nodes and set credentials for s3 restore, then run playbook which invoked role.

…ly hosts in current cluster inventory group
@michalg91 michalg91 marked this pull request as draft July 8, 2025 20:06
@michalg91
Copy link
Author

michalg91 commented Jul 8, 2025

i encountered another problem (since this one encountered for me when i bumped role from 1.37.0 to 1.44.1)
i have automatic jenkins pipeline that restores cluster in daily manner, it brakes when joinining worker nodes now.

e43df6b this commit totally changed logic, so right after my "fix" it is not executed anymore since all hostnames stays with the same name in my backup scenario.

@simonfelding do you remember why this logic is changed? i am able to fix it but i need to know what was the case here so my change will not produce any other failures.

@simonfelding
Copy link
Contributor

simonfelding commented Jul 9, 2025 via email

@michalg91
Copy link
Author

michalg91 commented Jul 9, 2025

Yeah, sure.
It works like that:

  1. Login to control-plane node, trigger rke2 etcd-snapshot save --name xxx to aws s3
  2. Generates restore variables for role
  3. Destroying cluster on vmware with terraform destroy
  4. Recreates fresh nodes from same terraform definitions terraform apply (so here all node names are same like previous)
  5. Runs ansible-role-rke2 with provided restore settings

As i mentioned in issue, firstly it was broken in gathering rke2_node_name because it also tried to find rke2_node_name in other not related to k8s_cluster group hosts

- name: Get all node names
ansible.builtin.set_fact:
node_names: "{{ hostvars | dict2items | map(attribute='value.rke2_node_name') }}"
run_once: true
register: node_names

15:44:12  TASK [lablabs.rke2 : Get all node names] ***************************************
15:44:16  fatal: [dev-02-node01]: FAILED! => 
15:44:16    msg: |-
15:44:16      The task includes an option with an undefined variable. The error was: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'rke2_node_name'. 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'rke2_node_name'
15:44:16    
15:44:16      The error appears to be in '/var/lib/jenkins/workspace/PLAT/Infra/etcd-restore-test/ansible/roles/lablabs.rke2/tasks/first_server.yml': line 169, column 7, but may
15:44:16      be elsewhere in the file depending on the exact syntax problem.
15:44:16    
15:44:16      The offending line appears to be:
15:44:16    
15:44:16    
15:44:16          - name: Get all node names
15:44:16            ^ here

so i changed it to find in k8s cluster group, also removed register part because it causes to return dict instead of list of hosts. So diff always happened and it removed all my nodes secrets (also with fresh first master node).

18:42:41  TASK [lablabs.rke2 : Remove old <node>.node-password.rke2 secrets] *************
18:42:47  ok: [dev-02-node01] => (item=dev-02-node01)
18:42:48  ok: [dev-02-node01] => (item=dev-02-node02)
18:42:49  ok: [dev-02-node01] => (item=dev-02-node03)
18:42:49  ok: [dev-02-node01] => (item=dev-02-node4)
18:42:49  ok: [dev-02-node01] => (item=dev-02-node5)

And then it hanged on deletion of node, because it was deleting itself.

After "my" change the step with removing "unwanted secrets" and "unwanted nodes" are skipped because no diff exists (which is true, all my nodes have same name as before destroy).

What i meant with changed logic. In 1.37.0 restore removed all nodes secrets in kube-system, but skipped master so playbook was working fine. Right now it tries to diff kubectl get node output with ansible inventory nodes. So in my case it is skipping deletion of old secrets and when node is joining to cluster it will not join because it has missmatched credentials Node password rejected, duplicate hostname or contents of '/etc/rancher/node/password'.

To fix this i think we should reintroduce previous version in just this step (deleting secrets) so all existing secrets instead currents playbook node will be deleted. The next step with deleting not existing nodes is fine for me with differential check.

@michalg91 michalg91 marked this pull request as ready for review July 9, 2025 16:11
@michalg91
Copy link
Author

michalg91 commented Jul 9, 2025

7518ec5

- name: Remove old <node>.node-password.rke2 secrets
ansible.builtin.shell: |
{{ rke2_data_path }}/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml \
delete secret {{ item }}.node-password.rke2 -n kube-system 2>&1 || true
args:
executable: /bin/bash
with_items: "{{ registered_node_names.stdout_lines | difference([rke2_node_name]) }}"
changed_when: false

this additional change fixed all my problems

@MonolithProjects MonolithProjects added the bug Something isn't working label Jul 10, 2025
@MonolithProjects MonolithProjects self-assigned this Jul 10, 2025
@simonfelding
Copy link
Contributor

simonfelding commented Jul 10, 2025

Yes because you effectively just deleted the Remove old <node>.node-password.rke2 secrets step by making the list always be empty in your case - you might as well just have deleted the block. Your patch makes it delete all secrets from nodes that are not the node used in the first_server_restore.yaml task. I think this is not a good solution.

I think your troubleshooting did not lead you to the correct conclusions (for example your conclusion about the node_names variable being a dict, not a list - maybe because you omitted dict2items in your patched line 171?) Your problem is actually this error from #333:

15:44:12  TASK [lablabs.rke2 : Get all node names] ***************************************
15:44:16  fatal: [dev-02-node01]: FAILED! => 
15:44:16    msg: |-
15:44:16      The task includes an option with an undefined variable. The error was: 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'rke2_node_name'. 'ansible.vars.hostvars.HostVarsVars object' has no attribute 'rke2_node_name'

The error means that dev-02-node01 does not have a rke2_node_name set. It's set by default based on inventory_hostname in https://github.com/lablabs/ansible-role-rke2/blob/main/defaults/main.yml, so how did you manage not to have it set? Ansible sets this automatically, unless you have gather_facts: false set.

Your original error came from the Get all node names step, which happens before the Remove old <node>.node-password.rke2 secrets , which comes after. Since your patch has nothing to with the step that caused the error, I think you might have just enabled gather_facts again in your testing process, or only targeted hosts that are in the rke2_servers_group_name group.

A good fix would make line 171 get hostvars only from hosts in the rke2_servers_group_name group, if it's because you are targeting more hosts. That would be a much better solution :)

@michalg91
Copy link
Author

michalg91 commented Jul 10, 2025

no @simonfelding i run some debug and output hostvars, the problem is what i stated and explained. When i am using inventory file with not only kubernetes hosts or running any play which invoked localhost before invoking role. It will not found rke2_node_name on host that is not part of cluster group. Just check it yourself :). Use rke2_node_name only in k8s_cluster group and debug this line by piping it to ansible.builtin.debug, with any any other host outside this group.

Deleting secrets is not working with current version because it checks for difference with inventory file and kubectl output. In my case host names didn't change (there is no difference). So when playbook was joining reporovisioned nodes they missmatched rke2 credentials. So this step must delete old secrets - Just like before e43df6b this commit. Also when you registered same variable name as fact name it dumps dict instead of list so after your commit difference always existed falsely, so for my case 3 step which deletes kubernetes nodes deleted also current play host so it totally broke recovery.

Of course this can be also checked in debug.

Check it out.

put this inside debug.yaml to role

 - name: Restore etcd
   when:
    - inventory_hostname == groups[rke2_servers_group_name].0
   block:
    - name: Get registered nodes
      ansible.builtin.shell:
        cmd: |
          set -o pipefail
          {{ rke2_data_path }}/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml \
          get nodes --no-headers | awk '{print $1}'
      args:
        executable: /bin/bash
      changed_when: false
      register: registered_node_names

    - name: Get all node names
      ansible.builtin.set_fact:
        node_names: "{{ hostvars | dict2items | map(attribute='value.rke2_node_name') }}"
      run_once: true 
      register: node_names 

    - name: Remove old <node>.node-password.rke2 secrets
      ansible.builtin.debug: 
        msg: |
          {{ rke2_data_path }}/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml \
          delete secret {{ item }}.node-password.rke2 -n kube-system 2>&1 || true
      # args:
      #   executable: /bin/bash
      with_items: "{{ registered_node_names.stdout_lines | difference(node_names) }}"
      changed_when: false

    - name: Remove old nodes
      ansible.builtin.debug:
        msg: |
          {{ rke2_data_path }}/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml \
          delete node {{ item }} 2>&1 || true
      # args:
      #   executable: /bin/bash
      with_items: "{{ registered_node_names.stdout_lines | difference(node_names) }}"
      changed_when: false

    - name: variables
      ansible.builtin.debug:
        msg: |
          stdout
          {{ registered_node_names.stdout_lines }}

          node_names
          {{ node_names }}

          diffed
          {{ registered_node_names.stdout_lines | difference(node_names) }}

          diffed with node name
          {{ registered_node_names.stdout_lines | difference([rke2_node_name]) }}

and run:

- name: debug
  tags: debug
  gather_facts: yes
  become: yes
  hosts: all
  tasks:
    - ansible.builtin.include_role:
        name: lablabs.rke2
        tasks_from: debug

all in all recreating new cluster with etcd snapshot should delete all previously existed secrets so new joined node can join cluster. Like i wrote before. It was working with 1.37.0 and is not working in newer releases.

@simonfelding
Copy link
Contributor

simonfelding commented Jul 10, 2025

all in all recreating new cluster with etcd snapshot should delete all previously existed secrets so new joined node can join cluster.

This is the part I'm worried about - is that always the case, or is that only in your case? You could also have a situation where you need to recover the etcd snapshot because etcd failed for some reason, but you didn't delete the nodes. Your code always deletes all the node-password.rke2 secrets (except for one). The original idea is just to delete nodes that are no longer in the ansible inventory so the playbook can restore from a etcd backup. I made the host names a list so I could use it again to also delete the node objects in Kubernetes, not just the passwords. I put the code inside a block to make it more readable and simplify the variable use. I accidentally introduced this bug where it tries to get hostnames from outside the rke2_servers_group_name group.

I'm not 100% sure why rke2 uses the node-passwords like that, but it looks like a security feature? I think we shouldn't just delete all the secrets just because etcd is being restored.

It looks to me like the problem is actually just that line 171 assumes all hosts are rke2 nodes - which is why your localhost was included:

node_names: "{{ hostvars | dict2items | map(attribute='value.rke2_node_name') }}"

Isn't the solution just to change it to something like "{{ hostvars[groups['rke2_servers_group_name']] | dict2items | map(attribute='value.rke2_node_name') }}"?

Although I do agree your use case where you destroy the nodes and recreate with the same name should be handled, but I'm not sure if it should be by this ansible role, by rke2 upstream, or by your own playbook.

from the RKE2 docs:

Agents register with the server using the cluster secret portion of the join token, along with a randomly generated node-specific password, which is stored on the agent at /etc/rancher/node/password. The server will store the passwords for individual nodes as Kubernetes secrets, and any subsequent attempts must use the same password. Node password secrets are stored in the kube-system namespace with names using the template .node-password.rke2. These secrets are deleted when the corresponding Kubernetes node is deleted.

If the /etc/rancher/node directory of an agent is removed, the password file should be recreated for the agent prior to startup, or the entry removed from the server or Kubernetes cluster (depending on the RKE2 version).

I think you could just solve the node-joining problem like the RKE2 docs suggest, by either making your pipeline:

  • Delete the nodes from Kubernetes before you backup etcd
  • Recreate the password on the nodes prior to startup
  • Use hostnames for your nodes that use a UUID, not a recycled sequence of numbers (it's good practice for exactly the reason I believe RKE2 tries to prevent by using the node-passwords.rke2 system). This is probably harder for you, since you need to have ansible use a dynamic inventory, or have terraform generate it for you.

@simonfelding
Copy link
Contributor

simonfelding commented Jul 10, 2025

I think the reason node_names returns a dict is this:

    - name: Get all node names
      ansible.builtin.set_fact:
        node_names: "{{ hostvars | dict2items | map(attribute='value.rke2_node_name') }}"
      run_once: true 
      register: node_names

It shouldn't use register, as set_fact already registers it and makes it globally accessible as a list :) Registering the output of the fact registration task returns a dict with the result of the registration. Nice catch!

@simonfelding
Copy link
Contributor

simonfelding commented Jul 10, 2025

Actually, the entire block is misplaced/should be deleted.

- name: Restore etcd
when: do_etcd_restore is defined or do_etcd_restore_from_s3 is defined
block:
- name: Get registered nodes
ansible.builtin.shell:
cmd: |
set -o pipefail
{{ rke2_data_path }}/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml \
get nodes --no-headers | awk '{print $1}'
args:
executable: /bin/bash
changed_when: false
register: registered_node_names
- name: Get all node names
ansible.builtin.set_fact:
node_names: "{{ hostvars | dict2items | map(attribute='value.rke2_node_name') }}"
run_once: true
register: node_names
- name: Remove old <node>.node-password.rke2 secrets
ansible.builtin.shell: |
{{ rke2_data_path }}/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml \
delete secret {{ item }}.node-password.rke2 -n kube-system 2>&1 || true
args:
executable: /bin/bash
with_items: "{{ registered_node_names.stdout_lines | difference(node_names) }}"
changed_when: false
- name: Remove old nodes
ansible.builtin.shell: |
{{ rke2_data_path }}/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml \
delete node {{ item }} 2>&1 || true
args:
executable: /bin/bash
with_items: "{{ registered_node_names.stdout_lines | difference(node_names) }}"
changed_when: false

It doesn't do what it says it does (restore etcd), as this was already done earlier in the file. It actually deletes nodes that are no longer present in the ansible inventory so they can join again, after restoring etcd.

Let's say you have nodes joined that aren't managed by ansible - then this block would remove those too.

I think this whole thing is pretty delicate and should be handled with care, as the normal use of the etcd restore function is in an emergency situation, not a cluster bootstrap situtation like you do with your pipeline (why?). I'd be pissed if I tried to restore etcd and had my secrets and nodes deleted without my consent.

This whole thing is kind of a upstream RKE2 issue, in the sense that RKE2 doesn't really support this specific use case of deleting node VMs, restoring etcd and then joining nodes with the same hostnames as those that have been deleted, because of the node-secret.rke2 system cannot (or isn't supposed to) clean up when nodes are deleted without being removed from Kubernetes. You might want to discuss it on the RKE2 issue tracker too and link this PR.

This ansible role isn't meant to add features to RKE2, so maybe we should just delete this entire block of code and inform users of this issue instead, because I think it's kind of common in test clusters.

Maybe your patch could instead do the following:

  • Delete L155-L192
  • In the beginning of remaining_nodes.yml, detect if the user is trying to join a node that already has a node-password.rke2 secret in the cluster and a different password on the node. Give an informative warning about it with suggestions on how to resolve and avoid the problem, instead of just mysteriously failing to join.

What's your take on this, @MonolithProjects?


Alternatively, I think this block belongs in the beginning of the remaining_nodes.yml file, to make it clear that it's a matter of removing nodes that aren't in the ansible inventory. It's not directly related to restoring etcd and unintuitively only affects nodes that are not the first server.

If we don't just delete it, it would be super cool if you could make your patch:

  • fix the bug by getting the rke2_node_name only from the cluster nodes, as proposed above
  • move the block to remaining_nodes.yml
  • change the name and the when:, so it stops being related to etcd
  • make it optional to run the block by having some variable in when: that activates it.

@simonfelding
Copy link
Contributor

  • Delete L155-L192
  • In the beginning of remaining_nodes.yml, detect if the user is trying to join a node that already has a node-password.rke2 secret in the cluster and a different password on the node. Give an informative warning about it with suggestions on how to resolve and avoid the problem, instead of just mysteriously failing to join.

I'll just make a PR with this now.

@michalg91
Copy link
Author

I can agree that this part of code is unnecessary at all, but i have covered all both test case scenarios (destroying and just recovering etcd) and second one is working fine. I don't think #335 should fail instead i'd rather introduce some switch that will destroy the secrets. I think this will cover all. We should discuss it in #335.

@michalg91 michalg91 closed this Jul 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: etcd restore failing on getting rke2_node_name
3 participants