Skip to content

Conversation

irapandey
Copy link
Contributor

@irapandey irapandey commented Sep 17, 2025

What this PR does / why we need it:
This PR used DeferCleanup instead of AfterEach to ensure a better cleanup process
Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #12578

/area e2e-testing

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 17, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/needs-area PR is missing an area label label Sep 17, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign fabriziopandini for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 17, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @irapandey. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@irapandey irapandey changed the title refactored cleanup 🐛 refactored cleanup for clusterctl upgrade e2e Sep 17, 2025
@irapandey
Copy link
Contributor Author

/area e2e-testing

@k8s-ci-robot k8s-ci-robot added area/e2e-testing Issues or PRs related to e2e testing and removed do-not-merge/needs-area PR is missing an area label labels Sep 17, 2025
@irapandey
Copy link
Contributor Author

• [FAILED] [340.655 seconds]
When testing clusterctl upgrades using ClusterClass (v1.10=>current) [ClusterClass] [It] Should create a management cluster and then upgrade all the providers [testing-ira]
/Users/ira/OpenSource/cluster-api/test/e2e/clusterctl_upgrade.go:276

  Captured StdOut/StdErr Output >>
  Creating cluster "clusterctl-upgrade-management-xb1e91" ...
   • Ensuring node image (kindest/node:v1.33.0) 🖼  ...
   ✓ Ensuring node image (kindest/node:v1.33.0) 🖼
   • Preparing nodes 📦   ...
   ✓ Preparing nodes 📦 
   • Writing configuration 📜  ...
   ✓ Writing configuration 📜
   • Starting control-plane 🕹️  ...
   ✓ Starting control-plane 🕹️
   • Installing CNI 🔌  ...
   ✓ Installing CNI 🔌
   • Installing StorageClass 💾  ...
   ✓ Installing StorageClass 💾
  I0918 10:27:32.469345    9116 warning_handler.go:64] "Cluster refers to ClusterClass clusterctl-upgrade/quick-start, but this ClusterClass does not exist. Cluster topology has not been fully validated. The ClusterClass must be created to reconcile the Cluster"
  Failed to get nodes for the cluster: Get "https://127.0.0.1:50402/api/v1/nodes": dial tcp 127.0.0.1:50402: connect: connection refused
  << Captured StdOut/StdErr Output

  Timeline >>
  STEP: Cleaning up any leftover clusters from previous runs @ 09/18/25 10:25:33.559
  STEP: Creating a kind cluster to be used as a new management cluster @ 09/18/25 10:25:33.636
  INFO: Creating a kind cluster with name "clusterctl-upgrade-management-xb1e91"
  INFO: The kubeconfig file for the kind cluster is /var/folders/3w/_89wp8yx15v76yjqkhbnz6xc0000gn/T/e2e-kind2869965825
  INFO: Loading image: "gcr.io/k8s-staging-cluster-api/cluster-api-controller-arm64:dev"
  INFO: Image gcr.io/k8s-staging-cluster-api/cluster-api-controller-arm64:dev is present in local container image cache
  INFO: Loading image: "gcr.io/k8s-staging-cluster-api/kubeadm-bootstrap-controller-arm64:dev"
  INFO: Image gcr.io/k8s-staging-cluster-api/kubeadm-bootstrap-controller-arm64:dev is present in local container image cache
  INFO: Loading image: "gcr.io/k8s-staging-cluster-api/kubeadm-control-plane-controller-arm64:dev"
  INFO: Image gcr.io/k8s-staging-cluster-api/kubeadm-control-plane-controller-arm64:dev is present in local container image cache
  INFO: Loading image: "gcr.io/k8s-staging-cluster-api/capd-manager-arm64:dev"
  INFO: Image gcr.io/k8s-staging-cluster-api/capd-manager-arm64:dev is present in local container image cache
  INFO: Loading image: "gcr.io/k8s-staging-cluster-api/test-extension-arm64:dev"
  INFO: Image gcr.io/k8s-staging-cluster-api/test-extension-arm64:dev is present in local container image cache
  INFO: Loading image: "quay.io/jetstack/cert-manager-cainjector:v1.18.2"
  INFO: Image quay.io/jetstack/cert-manager-cainjector:v1.18.2 is present in local container image cache
  INFO: Loading image: "quay.io/jetstack/cert-manager-webhook:v1.18.2"
  INFO: Image quay.io/jetstack/cert-manager-webhook:v1.18.2 is present in local container image cache
  INFO: Loading image: "quay.io/jetstack/cert-manager-controller:v1.18.2"
  INFO: Image quay.io/jetstack/cert-manager-controller:v1.18.2 is present in local container image cache
  STEP: Turning the new cluster into a management cluster with older versions of providers @ 09/18/25 10:26:06.005
  STEP: Downloading clusterctl binary from https://github.com/kubernetes-sigs/cluster-api/releases/download/v1.10.6/clusterctl-darwin-arm64 @ 09/18/25 10:26:06.005
  STEP: Initializing the new management cluster with older versions of providers @ 09/18/25 10:26:15.687
  INFO: clusterctl init --config /Users/ira/OpenSource/cluster-api/_artifacts/repository/clusterctl-config.yaml --kubeconfig /var/folders/3w/_89wp8yx15v76yjqkhbnz6xc0000gn/T/e2e-kind2869965825 --wait-providers --core cluster-api:v1.10.6 --bootstrap kubeadm:v1.10.6 --control-plane kubeadm:v1.10.6 --infrastructure docker:v1.10.6
  INFO: Waiting for provider controllers to be running
  STEP: Waiting for deployment capd-system/capd-controller-manager to be available @ 09/18/25 10:27:27.715
  INFO: Creating log watcher for controller capd-system/capd-controller-manager, pod capd-controller-manager-776d88cc44-l88lm, container manager
  STEP: Waiting for deployment capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager to be available @ 09/18/25 10:27:27.833
  INFO: Creating log watcher for controller capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager, pod capi-kubeadm-bootstrap-controller-manager-6b8594d56f-9sxls, container manager
  STEP: Waiting for deployment capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager to be available @ 09/18/25 10:27:27.847
  INFO: Creating log watcher for controller capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager, pod capi-kubeadm-control-plane-controller-manager-b6fdf8c98-mzcj7, container manager
  STEP: Waiting for deployment capi-system/capi-controller-manager to be available @ 09/18/25 10:27:27.854
  INFO: Creating log watcher for controller capi-system/capi-controller-manager, pod capi-controller-manager-5547b8d7d7-qvmdr, container manager
  STEP: THE MANAGEMENT CLUSTER WITH THE OLDER VERSION OF PROVIDERS IS UP&RUNNING! @ 09/18/25 10:27:28.125
  STEP: Creating a namespace for hosting the clusterctl-upgrade test workload cluster @ 09/18/25 10:27:28.125
  INFO: Creating namespace clusterctl-upgrade
  INFO: Creating event watcher for namespace "clusterctl-upgrade"
  STEP: Creating a test workload cluster @ 09/18/25 10:27:28.138
  INFO: Creating the workload cluster with name "clusterctl-upgrade-workload-1d7sce" using the "(default)" template (Kubernetes v1.33.0, 1 control-plane machines, 1 worker machines)
  INFO: Getting the cluster template yaml
  INFO: clusterctl generate cluster clusterctl-upgrade-workload-1d7sce --infrastructure  --kubernetes-version v1.33.0 --worker-machine-count 1 --flavor topology --target-namespace clusterctl-upgrade --config /Users/ira/OpenSource/cluster-api/_artifacts/repository/clusterctl-config.yaml --kubeconfig /var/folders/3w/_89wp8yx15v76yjqkhbnz6xc0000gn/T/e2e-kind2869965825 --control-plane-machine-count 1
  INFO: Applying the cluster template yaml to the cluster in dry-run
  INFO: Applying the cluster template yaml to the cluster
  STEP: Calculating expected MachineDeployment and MachinePool Machine and Node counts @ 09/18/25 10:27:43.3
  STEP: Expect 3 Machines and 1 MachinePool replicas to exist @ 09/18/25 10:27:43.325
  STEP: Waiting for the machines to exist @ 09/18/25 10:27:43.325
  [FAILED] in [It] - /Users/ira/OpenSource/cluster-api/test/e2e/clusterctl_upgrade.go:633 @ 09/18/25 10:27:53.327
  STEP: Cleaning up workload cluster (early cleanup) @ 09/18/25 10:27:53.328
  INFO: Found remaining kind cluster clusterctl-upgrade-management-xb1e91, cleaning it up
  INFO: Successfully deleted kind cluster clusterctl-upgrade-management-xb1e91
  INFO: Found remaining kind cluster clusterctl-upgrade-workload-1d7sce, cleaning it up
  INFO: Error getting pod capd-system/capd-controller-manager-776d88cc44-l88lm, container manager: Get "https://127.0.0.1:50402/api/v1/namespaces/capd-system/pods/capd-controller-manager-776d88cc44-l88lm": dial tcp 127.0.0.1:50402: connect: connection refused - error from a previous attempt: unexpected EOF
  INFO: Error getting pod capi-system/capi-controller-manager-5547b8d7d7-qvmdr, container manager: Get "https://127.0.0.1:50402/api/v1/namespaces/capi-system/pods/capi-controller-manager-5547b8d7d7-qvmdr": dial tcp 127.0.0.1:50402: connect: connection refused - error from a previous attempt: unexpected EOF
  INFO: Error getting pod capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager-b6fdf8c98-mzcj7, container manager: Get "https://127.0.0.1:50402/api/v1/namespaces/capi-kubeadm-control-plane-system/pods/capi-kubeadm-control-plane-controller-manager-b6fdf8c98-mzcj7": dial tcp 127.0.0.1:50402: connect: connection refused - error from a previous attempt: unexpected EOF
  INFO: Error getting pod capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager-6b8594d56f-9sxls, container manager: Get "https://127.0.0.1:50402/api/v1/namespaces/capi-kubeadm-bootstrap-system/pods/capi-kubeadm-bootstrap-controller-manager-6b8594d56f-9sxls": dial tcp 127.0.0.1:50402: connect: connection refused - error from a previous attempt: unexpected EOF
  INFO: Successfully deleted kind cluster clusterctl-upgrade-workload-1d7sce
  STEP: Cleaning up test namespace and workload cluster @ 09/18/25 10:28:14.174
  STEP: Dumping logs from the "clusterctl-upgrade-workload-1d7sce" workload cluster @ 09/18/25 10:28:14.174
  [FAILED] in [DeferCleanup (Each)] - /Users/ira/OpenSource/cluster-api/test/framework/cluster_proxy.go:418 @ 09/18/25 10:31:14.152
  STEP: Dumping kind cluster logs @ 09/18/25 10:31:14.153
  STEP: Disposing kind management cluster proxy @ 09/18/25 10:31:14.155
  STEP: Disposing kind management cluster provider @ 09/18/25 10:31:14.155
  << Timeline

  [FAILED] Timed out after 10.001s.
  Timed out waiting for all Machines to exist
  Expected
      <int64>: 0
  to equal
      <int64>: 3
  In [It] at: /Users/ira/OpenSource/cluster-api/test/e2e/clusterctl_upgrade.go:633 @ 09/18/25 10:27:53.327

  Full Stack Trace
    sigs.k8s.io/cluster-api/test/e2e.ClusterctlUpgradeSpec.func2()
    	/Users/ira/OpenSource/cluster-api/test/e2e/clusterctl_upgrade.go:633 +0x24a8

  There were additional failures detected.  To view them in detail run ginkgo -vv
------------------------------
[SynchronizedAfterSuite] PASSED [0.000 seconds]
[SynchronizedAfterSuite] 
/Users/ira/OpenSource/cluster-api/test/e2e/e2e_suite_test.go:186
------------------------------
[SynchronizedAfterSuite] PASSED [351.015 seconds]
[SynchronizedAfterSuite] 
/Users/ira/OpenSource/cluster-api/test/e2e/e2e_suite_test.go:186

  Timeline >>
  STEP: Dumping logs from the bootstrap cluster @ 09/18/25 10:31:14.202
  STEP: Tearing down the management cluster @ 09/18/25 10:31:14.611
  INFO: Got error while streaming logs for pod capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager-75f7886785-d6ftc, container manager: context canceled
  INFO: Stopped streaming logs for pod capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager-75f7886785-d6ftc, container manager: context canceled
  INFO: Got error while streaming logs for pod capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager-64d6b48fc9-w4jhj, container manager: context canceled
  INFO: Stopped streaming logs for pod capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager-64d6b48fc9-w4jhj, container manager: context canceled
  INFO: Got error while streaming logs for pod test-extension-system/test-extension-controller-manager-7dfdd9b4d7-x7ccz, container manager: context canceled
  INFO: Stopped streaming logs for pod test-extension-system/test-extension-controller-manager-7dfdd9b4d7-x7ccz, container manager: context canceled
  INFO: Got error while streaming logs for pod capd-system/capd-controller-manager-5cc7d7494-4h5ds, container manager: context canceled
  INFO: Got error while streaming logs for pod capi-system/capi-controller-manager-5f6b575c65-j6c6r, container manager: context canceled
  INFO: Stopped streaming logs for pod capd-system/capd-controller-manager-5cc7d7494-4h5ds, container manager: context canceled
  INFO: Stopped streaming logs for pod capi-system/capi-controller-manager-5f6b575c65-j6c6r, container manager: context canceled
  << Timeline
------------------------------
[ReportAfterSuite] PASSED [0.003 seconds]
[ReportAfterSuite] Autogenerated ReportAfterSuite for --junit-report
autogenerated by Ginkgo
------------------------------

Summarizing 1 Failure:
  [FAIL] When testing clusterctl upgrades using ClusterClass (v1.10=>current) [ClusterClass] [It] Should create a management cluster and then upgrade all the providers [testing-ira]
  /Users/ira/OpenSource/cluster-api/test/e2e/clusterctl_upgrade.go:633

Ran 1 of 38 Specs in 498.868 seconds
FAIL! -- 0 Passed | 1 Failed | 0 Pending | 37 Skipped


Ginkgo ran 1 suite in 8m35.255712542s

Test Suite Failed
ERROR: Found unexpected running containers:

CONTAINER ID   IMAGE                                     COMMAND   CREATED              STATUS              PORTS     NAMES
4e776d0e863d   ghcr.io/github/github-mcp-server:latest   "stdio"   10 minutes ago       Up 10 minutes                 confident_proskuriakova
5e2be0583ed5   ghcr.io/github/github-mcp-server:latest   "stdio"   About a minute ago   Up About a minute             vigorous_nightingale

})

It("Should create a management cluster and then upgrade all the providers", func() {
// Clean up any leftover clusters from previous test runs at the very beginning
Copy link
Member

@sbueringer sbueringer Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if we could avoid a huge refactoring like this. It took us a while to get to the current state and there is a lot of nuance in all of this (e.g. different apiVersions being available in different test cases).

Can we try to fix the issue by only improving AfterEach?

I think it's also easier to reason through one AfterEach block then trying to figure out how all of this can add up in various success and error cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can give it another go for sure - I did observe that AfterEach wasn't being executed and thought it might be because of nested AfterEach.

I think a couple more days working on this may ring a bell.

If not - I'll keep the changes to minimal

Copy link
Member

@sbueringer sbueringer Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did observe that AfterEach wasn't being executed

I think that should only happen if the test case is not started (we run a bunch of code before entering the test case, see:

var _ = SynchronizedBeforeSuite(func() []byte {
)

Copy link
Member

@sbueringer sbueringer Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at #12578 (comment), I think your test is stuck in SynchronizedBeforeSuite. But it definitely looks like the clusterctl upgrade test is not started

Comment on lines -823 to -824
managementClusterProxy.Dispose(ctx)
managementClusterProvider.Dispose(ctx)
Copy link
Member

@sbueringer sbueringer Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know why this doesn't work in some cases?

I would have expected it to work in cases where all tests pass (e.g. https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-mink8s-release-1-9/1951726318206849024)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we have a recent occurrence of this issue?

I couldn't find one but maybe my search is wrong: https://storage.googleapis.com/k8s-triage/index.html?text=Found%20unexpected%20running%20containers&job=.*cluster-api.*e2e.*main&xjob=.*-provider-.*

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @sbueringer - I tried a couple of times on a VM this time and I wasn't able to reproduce the failure

@arshadd-b - Have you seen the flake recently?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/e2e-testing Issues or PRs related to e2e testing cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[e2e test flake] Found unexpected running containers
4 participants