-
Notifications
You must be signed in to change notification settings - Fork 192
fix: #333 get rke2_node_name for only hosts in current cluster inventory group #334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ly hosts in current cluster inventory group
i encountered another problem (since this one encountered for me when i bumped role from 1.37.0 to 1.44.1) e43df6b this commit totally changed logic, so right after my "fix" it is not executed anymore since all hostnames stays with the same name in my backup scenario. @simonfelding do you remember why this logic is changed? i am able to fix it but i need to know what was the case here so my change will not produce any other failures. |
Hey, would love to help you out on this. Could you give us a bit more info on what exactly doesn't work? It's not entirely clear for me what it is your pipeline is doing and what the error is.
Sent from Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: michalg91 ***@***.***>
Sent: Tuesday, July 8, 2025 10:28:13 PM
To: lablabs/ansible-role-rke2 ***@***.***>
Cc: simonfelding ***@***.***>; Mention ***@***.***>
Subject: Re: [lablabs/ansible-role-rke2] fix: #333 get rke2_node_name for only hosts in current cluster inventory group (PR #334)
[https://avatars.githubusercontent.com/u/63045346?s=20&v=4]michalg91 left a comment (lablabs/ansible-role-rke2#334)<#334 (comment)>
i encountered another problem (since this one encountered for me when i bumped role from 1.37.0 to 1.44.1)
i have automatic jenkins pipeline that restores cluster in daily manner, it brakes when joinining worker nodes now.
e43df6b<e43df6b> this commit totally changed logic, so right after my "fix" it is not executed anymore since all hostnames have the same name in my backup scenario.
@simonfelding<https://github.com/simonfelding> do you remember why this logic is changed? i am able to fix it but i need to know what was the case here so my change will not produce any other failures.
—
Reply to this email directly, view it on GitHub<#334 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AKYOW7ZGPNSDBNPSN5KLQRT3HQSV3AVCNFSM6AAAAACBB6T6QOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTANJQGIYTKMBSG4>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Yeah, sure.
As i mentioned in issue, firstly it was broken in gathering ansible-role-rke2/tasks/first_server.yml Lines 169 to 173 in 602b420
so i changed it to find in k8s cluster group, also removed register part because it causes to return dict instead of list of hosts. So diff always happened and it removed all my nodes secrets (also with fresh first master node).
And then it hanged on deletion of node, because it was deleting itself. After "my" change the step with removing "unwanted secrets" and "unwanted nodes" are skipped because no diff exists (which is true, all my nodes have same name as before destroy). What i meant with changed logic. In 1.37.0 restore removed all nodes secrets in kube-system, but skipped master so playbook was working fine. Right now it tries to diff To fix this i think we should reintroduce previous version in just this step (deleting secrets) so all existing secrets instead currents playbook node will be deleted. The next step with deleting not existing nodes is fine for me with differential check. |
ansible-role-rke2/tasks/first_server.yml Lines 173 to 180 in 7518ec5
this additional change fixed all my problems |
Yes because you effectively just deleted the I think your troubleshooting did not lead you to the correct conclusions (for example your conclusion about the
The error means that Your original error came from the A good fix would make line 171 get |
no @simonfelding i run some debug and output hostvars, the problem is what i stated and explained. When i am using inventory file with not only kubernetes hosts or running any play which invoked localhost before invoking role. It will not found Deleting secrets is not working with current version because it checks for difference with inventory file and kubectl output. In my case host names didn't change (there is no difference). So when playbook was joining reporovisioned nodes they missmatched rke2 credentials. So this step must delete old secrets - Just like before e43df6b this commit. Also when you registered same variable name as fact name it dumps dict instead of list so after your commit difference always existed falsely, so for my case 3 step which deletes kubernetes nodes deleted also current play host so it totally broke recovery. Of course this can be also checked in debug. Check it out. put this inside debug.yaml to role - name: Restore etcd
when:
- inventory_hostname == groups[rke2_servers_group_name].0
block:
- name: Get registered nodes
ansible.builtin.shell:
cmd: |
set -o pipefail
{{ rke2_data_path }}/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml \
get nodes --no-headers | awk '{print $1}'
args:
executable: /bin/bash
changed_when: false
register: registered_node_names
- name: Get all node names
ansible.builtin.set_fact:
node_names: "{{ hostvars | dict2items | map(attribute='value.rke2_node_name') }}"
run_once: true
register: node_names
- name: Remove old <node>.node-password.rke2 secrets
ansible.builtin.debug:
msg: |
{{ rke2_data_path }}/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml \
delete secret {{ item }}.node-password.rke2 -n kube-system 2>&1 || true
# args:
# executable: /bin/bash
with_items: "{{ registered_node_names.stdout_lines | difference(node_names) }}"
changed_when: false
- name: Remove old nodes
ansible.builtin.debug:
msg: |
{{ rke2_data_path }}/bin/kubectl --kubeconfig /etc/rancher/rke2/rke2.yaml \
delete node {{ item }} 2>&1 || true
# args:
# executable: /bin/bash
with_items: "{{ registered_node_names.stdout_lines | difference(node_names) }}"
changed_when: false
- name: variables
ansible.builtin.debug:
msg: |
stdout
{{ registered_node_names.stdout_lines }}
node_names
{{ node_names }}
diffed
{{ registered_node_names.stdout_lines | difference(node_names) }}
diffed with node name
{{ registered_node_names.stdout_lines | difference([rke2_node_name]) }} and run:
all in all recreating new cluster with etcd snapshot should delete all previously existed secrets so new joined node can join cluster. Like i wrote before. It was working with |
This is the part I'm worried about - is that always the case, or is that only in your case? You could also have a situation where you need to recover the etcd snapshot because etcd failed for some reason, but you didn't delete the nodes. Your code always deletes all the node-password.rke2 secrets (except for one). The original idea is just to delete nodes that are no longer in the ansible inventory so the playbook can restore from a etcd backup. I made the host names a list so I could use it again to also delete the node objects in Kubernetes, not just the passwords. I put the code inside a block to make it more readable and simplify the variable use. I accidentally introduced this bug where it tries to get hostnames from outside the I'm not 100% sure why rke2 uses the node-passwords like that, but it looks like a security feature? I think we shouldn't just delete all the secrets just because etcd is being restored. It looks to me like the problem is actually just that line 171 assumes all hosts are rke2 nodes - which is why your localhost was included: ansible-role-rke2/tasks/first_server.yml Line 171 in 602b420
Isn't the solution just to change it to something like Although I do agree your use case where you destroy the nodes and recreate with the same name should be handled, but I'm not sure if it should be by this ansible role, by rke2 upstream, or by your own playbook. from the RKE2 docs:
I think you could just solve the node-joining problem like the RKE2 docs suggest, by either making your pipeline:
|
I think the reason
It shouldn't use |
Actually, the entire block is misplaced/should be deleted. ansible-role-rke2/tasks/first_server.yml Lines 155 to 192 in 602b420
It doesn't do what it says it does (restore etcd), as this was already done earlier in the file. It actually deletes nodes that are no longer present in the ansible inventory so they can join again, after restoring etcd. Let's say you have nodes joined that aren't managed by ansible - then this block would remove those too. I think this whole thing is pretty delicate and should be handled with care, as the normal use of the etcd restore function is in an emergency situation, not a cluster bootstrap situtation like you do with your pipeline (why?). I'd be pissed if I tried to restore etcd and had my secrets and nodes deleted without my consent. This whole thing is kind of a upstream RKE2 issue, in the sense that RKE2 doesn't really support this specific use case of deleting node VMs, restoring etcd and then joining nodes with the same hostnames as those that have been deleted, because of the node-secret.rke2 system cannot (or isn't supposed to) clean up when nodes are deleted without being removed from Kubernetes. You might want to discuss it on the RKE2 issue tracker too and link this PR. This ansible role isn't meant to add features to RKE2, so maybe we should just delete this entire block of code and inform users of this issue instead, because I think it's kind of common in test clusters. Maybe your patch could instead do the following:
What's your take on this, @MonolithProjects? Alternatively, I think this block belongs in the beginning of the If we don't just delete it, it would be super cool if you could make your patch:
|
I'll just make a PR with this now. |
I can agree that this part of code is unnecessary at all, but i have covered all both test case scenarios (destroying and just recovering etcd) and second one is working fine. I don't think #335 should fail instead i'd rather introduce some switch that will destroy the secrets. I think this will cover all. We should discuss it in #335. |
Description
This fixes #333
register: node_names
registered ansible_fact dict so next steps in this block execution were comparing list to dict so were triggered when not neededType of change
How Has This Been Tested?
Created clean nodes and set credentials for s3 restore, then run playbook which invoked role.