-
Notifications
You must be signed in to change notification settings - Fork 1.4k
🐛 refactored cleanup for clusterctl upgrade e2e #12775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @irapandey. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/area e2e-testing |
• [FAILED] [340.655 seconds]
When testing clusterctl upgrades using ClusterClass (v1.10=>current) [ClusterClass] [It] Should create a management cluster and then upgrade all the providers [testing-ira]
/Users/ira/OpenSource/cluster-api/test/e2e/clusterctl_upgrade.go:276
Captured StdOut/StdErr Output >>
Creating cluster "clusterctl-upgrade-management-xb1e91" ...
• Ensuring node image (kindest/node:v1.33.0) 🖼 ...
✓ Ensuring node image (kindest/node:v1.33.0) 🖼
• Preparing nodes 📦 ...
✓ Preparing nodes 📦
• Writing configuration 📜 ...
✓ Writing configuration 📜
• Starting control-plane 🕹️ ...
✓ Starting control-plane 🕹️
• Installing CNI 🔌 ...
✓ Installing CNI 🔌
• Installing StorageClass 💾 ...
✓ Installing StorageClass 💾
I0918 10:27:32.469345 9116 warning_handler.go:64] "Cluster refers to ClusterClass clusterctl-upgrade/quick-start, but this ClusterClass does not exist. Cluster topology has not been fully validated. The ClusterClass must be created to reconcile the Cluster"
Failed to get nodes for the cluster: Get "https://127.0.0.1:50402/api/v1/nodes": dial tcp 127.0.0.1:50402: connect: connection refused
<< Captured StdOut/StdErr Output
Timeline >>
STEP: Cleaning up any leftover clusters from previous runs @ 09/18/25 10:25:33.559
STEP: Creating a kind cluster to be used as a new management cluster @ 09/18/25 10:25:33.636
INFO: Creating a kind cluster with name "clusterctl-upgrade-management-xb1e91"
INFO: The kubeconfig file for the kind cluster is /var/folders/3w/_89wp8yx15v76yjqkhbnz6xc0000gn/T/e2e-kind2869965825
INFO: Loading image: "gcr.io/k8s-staging-cluster-api/cluster-api-controller-arm64:dev"
INFO: Image gcr.io/k8s-staging-cluster-api/cluster-api-controller-arm64:dev is present in local container image cache
INFO: Loading image: "gcr.io/k8s-staging-cluster-api/kubeadm-bootstrap-controller-arm64:dev"
INFO: Image gcr.io/k8s-staging-cluster-api/kubeadm-bootstrap-controller-arm64:dev is present in local container image cache
INFO: Loading image: "gcr.io/k8s-staging-cluster-api/kubeadm-control-plane-controller-arm64:dev"
INFO: Image gcr.io/k8s-staging-cluster-api/kubeadm-control-plane-controller-arm64:dev is present in local container image cache
INFO: Loading image: "gcr.io/k8s-staging-cluster-api/capd-manager-arm64:dev"
INFO: Image gcr.io/k8s-staging-cluster-api/capd-manager-arm64:dev is present in local container image cache
INFO: Loading image: "gcr.io/k8s-staging-cluster-api/test-extension-arm64:dev"
INFO: Image gcr.io/k8s-staging-cluster-api/test-extension-arm64:dev is present in local container image cache
INFO: Loading image: "quay.io/jetstack/cert-manager-cainjector:v1.18.2"
INFO: Image quay.io/jetstack/cert-manager-cainjector:v1.18.2 is present in local container image cache
INFO: Loading image: "quay.io/jetstack/cert-manager-webhook:v1.18.2"
INFO: Image quay.io/jetstack/cert-manager-webhook:v1.18.2 is present in local container image cache
INFO: Loading image: "quay.io/jetstack/cert-manager-controller:v1.18.2"
INFO: Image quay.io/jetstack/cert-manager-controller:v1.18.2 is present in local container image cache
STEP: Turning the new cluster into a management cluster with older versions of providers @ 09/18/25 10:26:06.005
STEP: Downloading clusterctl binary from https://github.com/kubernetes-sigs/cluster-api/releases/download/v1.10.6/clusterctl-darwin-arm64 @ 09/18/25 10:26:06.005
STEP: Initializing the new management cluster with older versions of providers @ 09/18/25 10:26:15.687
INFO: clusterctl init --config /Users/ira/OpenSource/cluster-api/_artifacts/repository/clusterctl-config.yaml --kubeconfig /var/folders/3w/_89wp8yx15v76yjqkhbnz6xc0000gn/T/e2e-kind2869965825 --wait-providers --core cluster-api:v1.10.6 --bootstrap kubeadm:v1.10.6 --control-plane kubeadm:v1.10.6 --infrastructure docker:v1.10.6
INFO: Waiting for provider controllers to be running
STEP: Waiting for deployment capd-system/capd-controller-manager to be available @ 09/18/25 10:27:27.715
INFO: Creating log watcher for controller capd-system/capd-controller-manager, pod capd-controller-manager-776d88cc44-l88lm, container manager
STEP: Waiting for deployment capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager to be available @ 09/18/25 10:27:27.833
INFO: Creating log watcher for controller capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager, pod capi-kubeadm-bootstrap-controller-manager-6b8594d56f-9sxls, container manager
STEP: Waiting for deployment capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager to be available @ 09/18/25 10:27:27.847
INFO: Creating log watcher for controller capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager, pod capi-kubeadm-control-plane-controller-manager-b6fdf8c98-mzcj7, container manager
STEP: Waiting for deployment capi-system/capi-controller-manager to be available @ 09/18/25 10:27:27.854
INFO: Creating log watcher for controller capi-system/capi-controller-manager, pod capi-controller-manager-5547b8d7d7-qvmdr, container manager
STEP: THE MANAGEMENT CLUSTER WITH THE OLDER VERSION OF PROVIDERS IS UP&RUNNING! @ 09/18/25 10:27:28.125
STEP: Creating a namespace for hosting the clusterctl-upgrade test workload cluster @ 09/18/25 10:27:28.125
INFO: Creating namespace clusterctl-upgrade
INFO: Creating event watcher for namespace "clusterctl-upgrade"
STEP: Creating a test workload cluster @ 09/18/25 10:27:28.138
INFO: Creating the workload cluster with name "clusterctl-upgrade-workload-1d7sce" using the "(default)" template (Kubernetes v1.33.0, 1 control-plane machines, 1 worker machines)
INFO: Getting the cluster template yaml
INFO: clusterctl generate cluster clusterctl-upgrade-workload-1d7sce --infrastructure --kubernetes-version v1.33.0 --worker-machine-count 1 --flavor topology --target-namespace clusterctl-upgrade --config /Users/ira/OpenSource/cluster-api/_artifacts/repository/clusterctl-config.yaml --kubeconfig /var/folders/3w/_89wp8yx15v76yjqkhbnz6xc0000gn/T/e2e-kind2869965825 --control-plane-machine-count 1
INFO: Applying the cluster template yaml to the cluster in dry-run
INFO: Applying the cluster template yaml to the cluster
STEP: Calculating expected MachineDeployment and MachinePool Machine and Node counts @ 09/18/25 10:27:43.3
STEP: Expect 3 Machines and 1 MachinePool replicas to exist @ 09/18/25 10:27:43.325
STEP: Waiting for the machines to exist @ 09/18/25 10:27:43.325
[FAILED] in [It] - /Users/ira/OpenSource/cluster-api/test/e2e/clusterctl_upgrade.go:633 @ 09/18/25 10:27:53.327
STEP: Cleaning up workload cluster (early cleanup) @ 09/18/25 10:27:53.328
INFO: Found remaining kind cluster clusterctl-upgrade-management-xb1e91, cleaning it up
INFO: Successfully deleted kind cluster clusterctl-upgrade-management-xb1e91
INFO: Found remaining kind cluster clusterctl-upgrade-workload-1d7sce, cleaning it up
INFO: Error getting pod capd-system/capd-controller-manager-776d88cc44-l88lm, container manager: Get "https://127.0.0.1:50402/api/v1/namespaces/capd-system/pods/capd-controller-manager-776d88cc44-l88lm": dial tcp 127.0.0.1:50402: connect: connection refused - error from a previous attempt: unexpected EOF
INFO: Error getting pod capi-system/capi-controller-manager-5547b8d7d7-qvmdr, container manager: Get "https://127.0.0.1:50402/api/v1/namespaces/capi-system/pods/capi-controller-manager-5547b8d7d7-qvmdr": dial tcp 127.0.0.1:50402: connect: connection refused - error from a previous attempt: unexpected EOF
INFO: Error getting pod capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager-b6fdf8c98-mzcj7, container manager: Get "https://127.0.0.1:50402/api/v1/namespaces/capi-kubeadm-control-plane-system/pods/capi-kubeadm-control-plane-controller-manager-b6fdf8c98-mzcj7": dial tcp 127.0.0.1:50402: connect: connection refused - error from a previous attempt: unexpected EOF
INFO: Error getting pod capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager-6b8594d56f-9sxls, container manager: Get "https://127.0.0.1:50402/api/v1/namespaces/capi-kubeadm-bootstrap-system/pods/capi-kubeadm-bootstrap-controller-manager-6b8594d56f-9sxls": dial tcp 127.0.0.1:50402: connect: connection refused - error from a previous attempt: unexpected EOF
INFO: Successfully deleted kind cluster clusterctl-upgrade-workload-1d7sce
STEP: Cleaning up test namespace and workload cluster @ 09/18/25 10:28:14.174
STEP: Dumping logs from the "clusterctl-upgrade-workload-1d7sce" workload cluster @ 09/18/25 10:28:14.174
[FAILED] in [DeferCleanup (Each)] - /Users/ira/OpenSource/cluster-api/test/framework/cluster_proxy.go:418 @ 09/18/25 10:31:14.152
STEP: Dumping kind cluster logs @ 09/18/25 10:31:14.153
STEP: Disposing kind management cluster proxy @ 09/18/25 10:31:14.155
STEP: Disposing kind management cluster provider @ 09/18/25 10:31:14.155
<< Timeline
[FAILED] Timed out after 10.001s.
Timed out waiting for all Machines to exist
Expected
<int64>: 0
to equal
<int64>: 3
In [It] at: /Users/ira/OpenSource/cluster-api/test/e2e/clusterctl_upgrade.go:633 @ 09/18/25 10:27:53.327
Full Stack Trace
sigs.k8s.io/cluster-api/test/e2e.ClusterctlUpgradeSpec.func2()
/Users/ira/OpenSource/cluster-api/test/e2e/clusterctl_upgrade.go:633 +0x24a8
There were additional failures detected. To view them in detail run ginkgo -vv
------------------------------
[SynchronizedAfterSuite] PASSED [0.000 seconds]
[SynchronizedAfterSuite]
/Users/ira/OpenSource/cluster-api/test/e2e/e2e_suite_test.go:186
------------------------------
[SynchronizedAfterSuite] PASSED [351.015 seconds]
[SynchronizedAfterSuite]
/Users/ira/OpenSource/cluster-api/test/e2e/e2e_suite_test.go:186
Timeline >>
STEP: Dumping logs from the bootstrap cluster @ 09/18/25 10:31:14.202
STEP: Tearing down the management cluster @ 09/18/25 10:31:14.611
INFO: Got error while streaming logs for pod capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager-75f7886785-d6ftc, container manager: context canceled
INFO: Stopped streaming logs for pod capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager-75f7886785-d6ftc, container manager: context canceled
INFO: Got error while streaming logs for pod capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager-64d6b48fc9-w4jhj, container manager: context canceled
INFO: Stopped streaming logs for pod capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager-64d6b48fc9-w4jhj, container manager: context canceled
INFO: Got error while streaming logs for pod test-extension-system/test-extension-controller-manager-7dfdd9b4d7-x7ccz, container manager: context canceled
INFO: Stopped streaming logs for pod test-extension-system/test-extension-controller-manager-7dfdd9b4d7-x7ccz, container manager: context canceled
INFO: Got error while streaming logs for pod capd-system/capd-controller-manager-5cc7d7494-4h5ds, container manager: context canceled
INFO: Got error while streaming logs for pod capi-system/capi-controller-manager-5f6b575c65-j6c6r, container manager: context canceled
INFO: Stopped streaming logs for pod capd-system/capd-controller-manager-5cc7d7494-4h5ds, container manager: context canceled
INFO: Stopped streaming logs for pod capi-system/capi-controller-manager-5f6b575c65-j6c6r, container manager: context canceled
<< Timeline
------------------------------
[ReportAfterSuite] PASSED [0.003 seconds]
[ReportAfterSuite] Autogenerated ReportAfterSuite for --junit-report
autogenerated by Ginkgo
------------------------------
Summarizing 1 Failure:
[FAIL] When testing clusterctl upgrades using ClusterClass (v1.10=>current) [ClusterClass] [It] Should create a management cluster and then upgrade all the providers [testing-ira]
/Users/ira/OpenSource/cluster-api/test/e2e/clusterctl_upgrade.go:633
Ran 1 of 38 Specs in 498.868 seconds
FAIL! -- 0 Passed | 1 Failed | 0 Pending | 37 Skipped
Ginkgo ran 1 suite in 8m35.255712542s
Test Suite Failed
ERROR: Found unexpected running containers:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4e776d0e863d ghcr.io/github/github-mcp-server:latest "stdio" 10 minutes ago Up 10 minutes confident_proskuriakova
5e2be0583ed5 ghcr.io/github/github-mcp-server:latest "stdio" About a minute ago Up About a minute vigorous_nightingale
|
}) | ||
|
||
It("Should create a management cluster and then upgrade all the providers", func() { | ||
// Clean up any leftover clusters from previous test runs at the very beginning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if we could avoid a huge refactoring like this. It took us a while to get to the current state and there is a lot of nuance in all of this (e.g. different apiVersions being available in different test cases).
Can we try to fix the issue by only improving AfterEach?
I think it's also easier to reason through one AfterEach block then trying to figure out how all of this can add up in various success and error cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can give it another go for sure - I did observe that AfterEach wasn't being executed and thought it might be because of nested AfterEach.
I think a couple more days working on this may ring a bell.
If not - I'll keep the changes to minimal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did observe that AfterEach wasn't being executed
I think that should only happen if the test case is not started (we run a bunch of code before entering the test case, see:
cluster-api/test/e2e/e2e_suite_test.go
Line 135 in d30d3e3
var _ = SynchronizedBeforeSuite(func() []byte { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at #12578 (comment), I think your test is stuck in SynchronizedBeforeSuite. But it definitely looks like the clusterctl upgrade test is not started
managementClusterProxy.Dispose(ctx) | ||
managementClusterProvider.Dispose(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know why this doesn't work in some cases?
I would have expected it to work in cases where all tests pass (e.g. https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-mink8s-release-1-9/1951726318206849024)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we have a recent occurrence of this issue?
I couldn't find one but maybe my search is wrong: https://storage.googleapis.com/k8s-triage/index.html?text=Found%20unexpected%20running%20containers&job=.*cluster-api.*e2e.*main&xjob=.*-provider-.*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @sbueringer - I tried a couple of times on a VM this time and I wasn't able to reproduce the failure
@arshadd-b - Have you seen the flake recently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this PR does / why we need it:
This PR used DeferCleanup instead of AfterEach to ensure a better cleanup process
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #12578
/area e2e-testing