Skip to content

[🐛 Bug]: Nodes stuck in DRAINING #2654

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
farioas opened this issue Feb 12, 2025 · 20 comments
Open

[🐛 Bug]: Nodes stuck in DRAINING #2654

farioas opened this issue Feb 12, 2025 · 20 comments

Comments

@farioas
Copy link

farioas commented Feb 12, 2025

What happened?

I'm using KEDA with drain session = 1.

After session execution node stuck and won't deprovision.

In the same time I see these message in grid:

2025-02-12 17:50:57,374 DEBUG [Probe.Liveness] - Session Queue Size: 60, Session Count: 0, Max Session: 0
2025-02-12 17:50:57,375 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.

Command used to start Selenium Grid with Docker (or Kubernetes)

Helm chart 0.39.2

Relevant log output

16:31:38.418 INFO [NodeServer$2.lambda$start$1] - Sending registration event...
16:31:43.421 INFO [NodeServer$2.lambda$start$1] - Sending registration event...
16:31:46.218 INFO [NodeServer.lambda$createHandlers$2] - Node has been added
16:31:48.445 INFO [LocalNode.newSession] - Session created by the Node. Id: d712a9e4ffc7956978a907188bd423cd, Caps: Capabilities {acceptInsecureCerts: false, browserName: chrome, browserVersion: 133.0.6943.53, chrome: {chromedriverVersion: 133.0.6943.53 (9a80935019b0..., userDataDir: /tmp/.org.chromium.Chromium...}, fedcm:accounts: true, goog:chromeOptions: {debuggerAddress: localhost:40295}, networkConnectionEnabled: false, pageLoadStrategy: normal, platformName: linux, proxy: Proxy(), se:bidiEnabled: false, se:cdp: wss://selenium-grid.XXXX..., se:cdpVersion: 133.0.6943.53, se:containerName: selenium-grid-selenium-node..., se:noVncPort: 7900, se:vnc: wss://selenium-grid.appx.hu..., se:vncEnabled: true, se:vncLocalAddress: ws://10.1.118.22:7900, setWindowRect: true, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: dismiss and notify, webauthn:extension:credBlob: true, webauthn:extension:largeBlob: true, webauthn:extension:minPinLength: true, webauthn:extension:prf: true, webauthn:virtualAuthenticators: true}
16:31:48.447 INFO [LocalNode.checkSessionCount] - Draining Node, configured sessions value (1) has been reached.
2025-02-12 16:32:13,351 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:32:18,227 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:32:23,305 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:32:28,931 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:32:33,307 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:32:36,334 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:32:43,333 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:32:48,730 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:32:53,351 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:32:58,129 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:33:03,305 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:33:07,628 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:33:13,355 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:33:19,381 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:33:23,336 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:33:29,363 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:33:33,359 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:33:39,389 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:33:43,332 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:33:49,359 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:33:53,346 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:33:56,529 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:34:03,307 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:34:09,031 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:34:13,347 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:34:17,729 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:34:23,327 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:34:26,732 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:34:33,311 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:34:36,728 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:34:43,322 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:34:46,835 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:34:53,354 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:34:57,727 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:35:03,372 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:35:06,731 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:35:13,325 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:35:16,430 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:35:23,345 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:35:26,432 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:35:33,304 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:35:37,327 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:35:43,334 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:35:46,734 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:35:53,322 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:35:56,828 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:36:03,303 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:36:06,827 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:36:13,323 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:36:17,827 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:36:23,302 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:36:26,529 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:36:33,351 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:36:38,034 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:36:43,320 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:36:47,637 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:36:53,321 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:36:56,927 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:37:03,322 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:37:06,828 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:37:13,333 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:37:17,236 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:37:23,351 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:37:28,130 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:37:33,323 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:37:37,428 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:37:43,362 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:37:46,930 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:37:53,323 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:37:56,526 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:38:03,303 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:38:07,530 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:38:13,322 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:38:17,226 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:38:23,321 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:38:26,527 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:38:33,343 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:38:38,127 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:38:43,311 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:38:48,227 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:38:53,301 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:38:57,637 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:39:03,310 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING
2025-02-12 16:39:06,336 [Probe.Liveness] - Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.
2025-02-12 16:39:13,321 [Probe.Liveness] - Node responds the ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 with status: DRAINING

Operating System

EKS

Docker Selenium version (image tag)

4.28.1-20250202

Selenium Grid chart version (chart version)

0.39.2

Copy link

@farioas, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

@VietND96
Copy link
Member

Is this Deployment or Job?

@farioas
Copy link
Author

farioas commented Feb 13, 2025

It's Job.

@VietND96
Copy link
Member

I think this is an expected behavior.
The message

16:31:48.447 INFO [LocalNode.checkSessionCount] - Draining Node, configured sessions value (1) has been reached.

to inform that max-sessions reached, and Node status switching from UP to DRAINING. The node will be drained once the current session is completed.
I am not sure at this time, in Node was having any session ongoing. I will probably update the probe to print number of ongoing sessions together with Node status.
On message

Node ID: bb6c2956-d79f-4727-99da-a96dd2ad0ba6 is not found in the Grid. Node is not ready.

Probably, Node could not reach the Hub status endpoint to double check the NodeId there. Can you take a Pod to printenv all container env vars for my checking further?

@farioas
Copy link
Author

farioas commented Feb 18, 2025

I already tried to fight with this strange behavior of Hub, when it kills itself because of:

It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.

This worked for some amount of time:

hub:
  livenessProbe:
    failureThreshold: 50

After hub restart old nodes (that were provisioned, but haven't received a task) are not connected back to grid, but KEDA sees all the provisioned nodes and doing nothing - so we have overfilled queue and idling provisioned nodes.

Currently I was able to solve this behavior by switching to distributed mode with Redis as an external data store.

@VietND96
Copy link
Member

That is the liveness probe in Hub, it will be failed once queue is up but the number of sessions keep as 0 for a while until failureThreshold
As your description, I understand the issue is Hub get restarted. After restarted, existing Nodes could not register back to Hub, there is no Node attached in Hub, while having a mount of Node pods are running, which cause KEDA scaler get confused that already had enough node, then no more scale happening.

@VietND96
Copy link
Member

I also discovered a scenario where Hub get restarted (e.g it took over 1 minute to be ready fully). In the meantime, Node pods are scaled up and started registration, the register-period decides how long Node will try to register to Distributor, after that Node will not attempt to register again. In Helm chart, it Node startup probe is enabled, the probe period will be used for that register period (around 60s as I see). If Hub get back after this period, the Node pod will continue running there and do nothing. It leads to above scenario for autoscaling works incorrect.

@farioas
Copy link
Author

farioas commented Feb 18, 2025

Even with isolated components the distributor still restarted:

2025-02-17 03:10:22,642 DEBUG [Probe.Liveness] - Distributor is healthy.
2025-02-17 03:11:22,650 DEBUG [Probe.Liveness] - Distributor is healthy.
2025-02-17 03:12:22,661 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:12:22,662 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:13:22,661 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:13:22,663 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:14:22,661 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:14:22,662 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:15:22,680 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:15:22,681 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:16:22,669 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:16:22,670 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:17:22,680 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:17:22,681 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:18:22,653 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:18:22,654 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:19:22,660 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:19:22,661 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:20:22,650 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:20:22,651 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:21:22,694 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:21:22,695 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:22:22,682 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:22:22,683 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:23:22,668 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:23:22,669 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:24:22,685 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:24:22,686 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:25:22,671 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:25:22,672 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:26:22,712 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:26:22,713 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:27:22,650 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:27:22,651 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:28:22,684 DEBUG [Probe.Liveness] - Session Queue Size: 20, Session Count: 0, Max Session: 0
2025-02-17 03:28:22,685 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:29:22,674 DEBUG [Probe.Liveness] - Session Queue Size: 23, Session Count: 0, Max Session: 0
2025-02-17 03:29:22,675 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:30:22,694 DEBUG [Probe.Liveness] - Session Queue Size: 23, Session Count: 0, Max Session: 0
2025-02-17 03:30:22,695 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:31:22,705 DEBUG [Probe.Liveness] - Session Queue Size: 23, Session Count: 0, Max Session: 0
2025-02-17 03:31:22,707 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:32:22,656 DEBUG [Probe.Liveness] - Session Queue Size: 23, Session Count: 0, Max Session: 0
2025-02-17 03:32:22,657 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:33:22,685 DEBUG [Probe.Liveness] - Session Queue Size: 23, Session Count: 0, Max Session: 0
2025-02-17 03:33:22,686 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:34:22,667 DEBUG [Probe.Liveness] - Session Queue Size: 23, Session Count: 0, Max Session: 0
2025-02-17 03:34:22,668 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:35:22,676 DEBUG [Probe.Liveness] - Session Queue Size: 23, Session Count: 0, Max Session: 0
2025-02-17 03:35:22,677 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:36:22,685 DEBUG [Probe.Liveness] - Session Queue Size: 43, Session Count: 0, Max Session: 0
2025-02-17 03:36:22,687 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:37:22,692 DEBUG [Probe.Liveness] - Session Queue Size: 43, Session Count: 0, Max Session: 0
2025-02-17 03:37:22,694 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:38:22,670 DEBUG [Probe.Liveness] - Session Queue Size: 43, Session Count: 0, Max Session: 0
2025-02-17 03:38:22,671 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:39:22,697 DEBUG [Probe.Liveness] - Session Queue Size: 43, Session Count: 0, Max Session: 0
2025-02-17 03:39:22,699 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:40:22,694 DEBUG [Probe.Liveness] - Session Queue Size: 43, Session Count: 0, Max Session: 0
2025-02-17 03:40:22,695 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-02-17 03:41:22,693 DEBUG [Probe.Liveness] - Session Queue Size: 43, Session Count: 0, Max Session: 0
2025-02-17 03:41:22,694 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
Trapped SIGTERM/SIGINT/x so shutting down supervisord...
2025-02-17 03:41:23,517 WARN received SIGTERM indicating exit request
2025-02-17 03:41:23,518 INFO waiting for selenium-grid-distributor to die
2025-02-17 03:41:24,519 WARN stopped: selenium-grid-distributor (terminated by SIGTERM)

@farioas
Copy link
Author

farioas commented Feb 20, 2025

Today it happened again and I observed strange resource consumption in distributor and router.
Distributor consumed 2 cores:
Image
Router consumed 6 cores:
Image

At the same time keda provisioned 200 pods, which is expected for nightly testing:

2025-02-20T04:17:03Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 200}

But 0 of these pods were utilized by hub:
router logs

Feb 20, 2025 3:41:11 AM org.openqa.selenium.remote.tracing.SpanWrappedHttpHandler execute
WARNING: Unable to execute request: java.util.concurrent.TimeoutException
Build info: version: '4.28.1', revision: '73f5ad48a2'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.1.124-134.200.amzn2023.x86_64', java.version: '17.0.13'
Driver info: driver.version: unknown
org.openqa.selenium.TimeoutException: java.util.concurrent.TimeoutException
Build info: version: '4.28.1', revision: '73f5ad48a2'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.1.124-134.200.amzn2023.x86_64', java.version: '17.0.13'
Driver info: driver.version: unknown
	at org.openqa.selenium.remote.http.jdk.JdkHttpClient.execute(JdkHttpClient.java:424)
	at org.openqa.selenium.remote.tracing.TracedHttpClient.execute(TracedHttpClient.java:54)
	at org.openqa.selenium.grid.security.AddSecretFilter.lambda$apply$0(AddSecretFilter.java:40)
	at org.openqa.selenium.remote.http.Filter.lambda$andFinally$1(Filter.java:55)
	at org.openqa.selenium.grid.sessionqueue.remote.RemoteNewSessionQueue.addToQueue(RemoteNewSessionQueue.java:110)
	at org.openqa.selenium.grid.sessionqueue.NewSessionQueue.lambda$new$0(NewSessionQueue.java:68)
	at org.openqa.selenium.remote.http.Route$TemplatizedRoute.handle(Route.java:192)
	at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
	at org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:360)
	at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
	at org.openqa.selenium.grid.sessionqueue.NewSessionQueue.execute(NewSessionQueue.java:128)
	at org.openqa.selenium.remote.tracing.SpanWrappedHttpHandler.execute(SpanWrappedHttpHandler.java:87)
	at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)
	at org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:360)
	at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
	at org.openqa.selenium.grid.router.Router.execute(Router.java:89)
	at org.openqa.selenium.grid.web.EnsureSpecCompliantResponseHeaders.lambda$apply$0(EnsureSpecCompliantResponseHeaders.java:34)
	at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)
	at org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:360)
	at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
	at org.openqa.selenium.remote.http.Route$NestedRoute.handle(Route.java:270)
	at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
	at org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:360)
	at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
	at org.openqa.selenium.grid.security.BasicAuthenticationFilter.lambda$apply$0(BasicAuthenticationFilter.java:54)
	at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)
	at org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:360)
	at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
	at org.openqa.selenium.remote.AddWebDriverSpecHeaders.lambda$apply$0(AddWebDriverSpecHeaders.java:35)
	at org.openqa.selenium.remote.ErrorFilter.lambda$apply$0(ErrorFilter.java:44)
	at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)
	at org.openqa.selenium.remote.ErrorFilter.lambda$apply$0(ErrorFilter.java:44)
	at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)
	at org.openqa.selenium.netty.server.SeleniumHandler.lambda$channelRead0$0(SeleniumHandler.java:49)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.util.concurrent.TimeoutException
	at java.base/java.util.concurrent.CompletableFuture$Timeout.run(CompletableFuture.java:2874)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	... 3 more

Distributor logs

Exception in thread "Local Distributor - Session Creation" org.openqa.selenium.WebDriverException: Command timed out after 1 minute(s)
Build info: version: '4.28.1', revision: '73f5ad48a2'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.1.127-135.201.amzn2023.x86_64', java.version: '17.0.13'
Driver info: driver.version: unknown
	at jdk.internal.reflect.GeneratedConstructorAccessor15.newInstance(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:481)
	at org.openqa.selenium.remote.ErrorCodec.decode(ErrorCodec.java:167)
	at org.openqa.selenium.grid.web.Values.get(Values.java:47)
	at org.openqa.selenium.grid.sessionmap.remote.RemoteSessionMap.makeRequest(RemoteSessionMap.java:118)
	at org.openqa.selenium.grid.sessionmap.remote.RemoteSessionMap.add(RemoteSessionMap.java:82)
	at org.openqa.selenium.grid.distributor.local.LocalDistributor.newSession(LocalDistributor.java:598)
	at org.openqa.selenium.grid.distributor.local.LocalDistributor$NewSessionRunnable.handleNewSessionRequest(LocalDistributor.java:861)
	at org.openqa.selenium.grid.distributor.local.LocalDistributor$NewSessionRunnable.lambda$run$1(LocalDistributor.java:818)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)

@farioas
Copy link
Author

farioas commented Feb 20, 2025

@VietND96 If you're interested, I can send you the logs from the distributor and router for the past few hours to your email for troubleshooting

@VietND96
Copy link
Member

Can you also share your values file that was used to deploy the setup (just set of different configs against the default values.yaml)?
And also the code where you initialize RemoteWebDriver (which includes all browser args, capabilities) for me to reproduce?

@VietND96
Copy link
Member

I just saw your issue seems similar to #2655 where the Distributor and Router get rebooted.

@farioas
Copy link
Author

farioas commented Feb 20, 2025

I sent my log values to your email.

Here's my override for the default values.yaml:

global:
  seleniumGrid:
    stdoutProbeLog: true
    nodesImageTag: nightly # Pull the latest stable chrome image all the time (other options: dev | beta)

ingress:
  hostname: my.app.host
  nginx:
    useHttp2: false
  tls:
    - secretName: my.app.host-tls-secret
      hosts:
        - my.app.host

autoscaling:
  patchObjectFinalizers:
    enabled: false # doesn't work with argocd
  annotations: ! # remove helm hooks
  enableWithExistingKEDA: true
  scaledJobOptions:
    scalingStrategy:
      strategy: default
  scaledOptions:
    minReplicaCount: 0
    maxReplicaCount: 200
    pollingInterval: 10

chromeNode:
  enabled: true
  annotations:
    karpenter.sh/do-not-disrupt: "true" # protect from karpenter evict
  resources:
    requests:
      memory: "2Gi"
      cpu: "1"
    limits:
      memory: "2Gi"
      cpu:
  extraVolumeMounts:
    - name: downloads
      mountPath: /home/seluser/Downloads
  extraVolumes:
    - name: downloads
      emptyDir: {}
  extraEnvironmentVariables:
    - name: SE_BROWSER_ARGS_DISABLE_FEATURES
      value: "--disable-features=OptimizationGuideModelDownloading,OptimizationHintsFetching,OptimizationTargetPrediction,OptimizationHints"
    - name: SE_BROWSER_ARGS_DISABLE_SEARCH_ENGINE
      value: "--disable-search-engine-choice-screen"
    - name: SE_JAVA_OPTS
      value: "-Dselenium.LOGGER.level=SEVERE -Dwebdriver.chrome.whitelistedIps="
    - name: SE_VNC_NO_PASSWORD
      value: "true"
    - name: SE_SCREEN_WIDTH
      value: "1920"
    - name: SE_SCREEN_HEIGHT
      value: "1200"
    - name: SE_NODE_SESSION_TIMEOUT
      value: "1800"
    - name: SE_NODE_ENABLE_MANAGED_DOWNLOADS
      value: "true"
  readinessProbe:
    enabled: false
  livenessProbe:
    enabled: true
  hpa:
    unsafeSsl: 'false'
    platformName: "Linux"

firefoxNode:
  enabled: true
  resources:
    requests:
      memory: "2Gi"
      cpu: "1"
    limits:
      memory: "2Gi"
      cpu:
  extraVolumeMounts:
    - name: downloads
      mountPath: /home/seluser/Downloads
  extraVolumes:
    - name: downloads
      emptyDir: { }
  extraEnvironmentVariables:
    - name: SE_VNC_NO_PASSWORD
      value: "true"
    - name: SE_SCREEN_WIDTH
      value: "1920"
    - name: SE_SCREEN_HEIGHT
      value: "1200"
  readinessProbe:
    enabled: false
  livenessProbe:
    enabled: true
  hpa:
    platformName: "Linux"

edgeNode:
  enabled: false

redis:
  enabled: true

isolateComponents: true
components:
  sessionMap:
    resources:
      requests:
        cpu: "1"
        memory: "2Gi"
      limits:
        cpu:
        memory: "2Gi"
    externalDatastore:
      enabled: true
      backend: redis
  sessionQueue:
    extraEnvironmentVariables:
      - name: SE_SESSION_REQUEST_TIMEOUT
        value: "1800"
  router:
    resources:
      requests:
        cpu: "1"
        memory: "2Gi"
      limits:
        cpu:
        memory: "2Gi"
  distributor:
    resources:
      requests:
        cpu: "2"
        memory: "4Gi"
      limits:
        cpu:
        memory: "8Gi"

basicAuth:
  enabled: true
  username: user
  password: 123456
  embeddedUrl: true

webdriver init snippet:

# pom.xml dep
        <dependency>
            <groupId>io.github.bonigarcia</groupId>
            <artifactId>webdrivermanager</artifactId>
                <version>5.9.2</version>
            <scope>test</scope>
        </dependency>
chromeOpts.setCapability("browserName", "chrome");
chromeOpts.setCapability("platformName", "Linux");

try {
    Map<String, Object> prefsMap = new HashMap<>();
    prefsMap.put("profile.default_content_settings.popups", 0);
    prefsMap.put("download.default_directory", "/home/seluser/Downloads/");
    chromeOpts.setExperimentalOption("prefs", prefsMap);

    WebDriver driver;

    driver = WebDriverManager.chromedriver()
            .remoteAddress(new URL(gridUrl))
            .capabilities(chromeOpts)
            .create();

    driver.manage().window().maximize();
    FileHandler.setCurrentDownloadFolderName(driver, downloadFolderName);
    FileHandler.setCurrentDownloadFolderPath(driver, downloadFilepath);

    return driver;

} catch (MalformedURLException e) {
    throw new RuntimeException(e);
}

@VietND96
Copy link
Member

In the meantime, can you validate with the latest version 4.29.0?

@farioas
Copy link
Author

farioas commented Feb 24, 2025

Thanks, just applied it, I think I'll have results within this week.

@farioas
Copy link
Author

farioas commented Mar 10, 2025

After 2 weeks of testing no issues so far, thanks

@farioas farioas closed this as completed Mar 10, 2025
@VietND96
Copy link
Member

After 2 weeks of testing no issues so far, thanks

Great to hear that 🎉

@farioas
Copy link
Author

farioas commented Mar 12, 2025

Seem like the issue returned back:

2025-03-12 16:03:19,067 DEBUG [Probe.Liveness] - Distributor is healthy.
2025-03-12 16:04:19,076 DEBUG [Probe.Liveness] - Distributor is healthy.
2025-03-12 16:05:19,057 DEBUG [Probe.Liveness] - Distributor is healthy.
2025-03-12 16:06:19,055 DEBUG [Probe.Liveness] - Distributor is healthy.
2025-03-12 16:07:19,060 DEBUG [Probe.Liveness] - Distributor is healthy.
2025-03-12 16:08:19,090 DEBUG [Probe.Liveness] - Session Queue Size: 19, Session Count: 0, Max Session: 0
2025-03-12 16:08:19,091 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
2025-03-12 16:09:19,122 DEBUG [Probe.Liveness] - Session Queue Size: 19, Session Count: 0, Max Session: 0
2025-03-12 16:09:19,123 DEBUG [Probe.Liveness] - It seems the Distributor is delayed in processing a new session in the queue. Probe checks failed.
16:09:36.285 INFO [Node.<init>] - Binding additional locator mechanisms: relative
16:09:36.580 INFO [GridModel.setAvailability] - Switching Node 1c4edfdf-bec5-45f1-a68d-fd7efdbeff55 (uri: http://10.1.95.42:5555) from DOWN to UP
16:09:36.580 INFO [LocalDistributor.add] - Added node 1c4edfdf-bec5-45f1-a68d-fd7efdbeff55 at http://10.1.95.42:5555. Health check every 120s

keda:

2025-03-12T16:07:11Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 0}
2025-03-12T16:07:11Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 0}
2025-03-12T16:07:21Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 0}
2025-03-12T16:07:21Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 0}
2025-03-12T16:07:21Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 0}
2025-03-12T16:07:21Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 0}
2025-03-12T16:07:31Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 0}
2025-03-12T16:07:31Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 0}
2025-03-12T16:07:31Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 0}
2025-03-12T16:07:31Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 0}
2025-03-12T16:07:41Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 0}
2025-03-12T16:07:41Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 0}
2025-03-12T16:07:41Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 0}
2025-03-12T16:07:41Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 0}
2025-03-12T16:07:51Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 0}
2025-03-12T16:07:51Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 0}
2025-03-12T16:07:51Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 0}
2025-03-12T16:07:51Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 0}
2025-03-12T16:08:01Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 0}
2025-03-12T16:08:01Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 0}
2025-03-12T16:08:01Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 0}
2025-03-12T16:08:01Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 0}
2025-03-12T16:08:11Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 0}
2025-03-12T16:08:11Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 0}
2025-03-12T16:08:11Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 0}
2025-03-12T16:08:11Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 0}
2025-03-12T16:08:11Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Effective number of max jobs": 19}
2025-03-12T16:08:11Z	INFO	scaleexecutor	Creating jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of jobs": 19}
2025-03-12T16:08:11Z	INFO	scaleexecutor	Created jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of jobs": 19}
2025-03-12T16:08:21Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 0}
2025-03-12T16:08:21Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 0}
2025-03-12T16:08:21Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 19}
2025-03-12T16:08:21Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 19}
2025-03-12T16:08:21Z	INFO	scaleexecutor	No need to create jobs - all requested jobs already exist	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "jobs": 0}
2025-03-12T16:08:31Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 0}
2025-03-12T16:08:31Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 0}
2025-03-12T16:08:31Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 19}
2025-03-12T16:08:31Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 19}
2025-03-12T16:08:31Z	INFO	scaleexecutor	No need to create jobs - all requested jobs already exist	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "jobs": 0}
2025-03-12T16:08:41Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 0}
2025-03-12T16:08:41Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 0}
2025-03-12T16:08:41Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 19}
2025-03-12T16:08:41Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 19}
2025-03-12T16:08:41Z	INFO	scaleexecutor	No need to create jobs - all requested jobs already exist	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "jobs": 0}
2025-03-12T16:08:51Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 0}
2025-03-12T16:08:51Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 0}
2025-03-12T16:08:51Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 19}
2025-03-12T16:08:51Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 19}
2025-03-12T16:08:51Z	INFO	scaleexecutor	No need to create jobs - all requested jobs already exist	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "jobs": 0}
2025-03-12T16:09:01Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 0}
2025-03-12T16:09:01Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 0}
2025-03-12T16:09:01Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 19}
2025-03-12T16:09:01Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 19}
2025-03-12T16:09:01Z	INFO	scaleexecutor	No need to create jobs - all requested jobs already exist	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "jobs": 0}
2025-03-12T16:09:11Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 0}
2025-03-12T16:09:11Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 0}
2025-03-12T16:09:11Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 19}
2025-03-12T16:09:11Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 19}
2025-03-12T16:09:11Z	INFO	scaleexecutor	No need to create jobs - all requested jobs already exist	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "jobs": 0}
2025-03-12T16:09:21Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 0}
2025-03-12T16:09:21Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 0}
2025-03-12T16:09:21Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 19}
2025-03-12T16:09:21Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 19}
2025-03-12T16:09:21Z	INFO	scaleexecutor	No need to create jobs - all requested jobs already exist	{"scaledJob.Name": "selenium-grid-selenium-node-chrome", "scaledJob.Namespace": "selenium-grid", "jobs": 0}
2025-03-12T16:09:31Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of running Jobs": 0}
2025-03-12T16:09:31Z	INFO	scaleexecutor	Scaling Jobs	{"scaledJob.Name": "selenium-grid-selenium-node-firefox", "scaledJob.Namespace": "selenium-grid", "Number of pending Jobs": 0}

Also reproducible after upgrade to chart 0.40.1

@farioas farioas reopened this Mar 12, 2025
@farioas
Copy link
Author

farioas commented Mar 14, 2025

❯ k get pods -n selenium-grid
NAME                                                    READY   STATUS    RESTARTS   AGE
selenium-grid-redis-master-0                            1/1     Running   0          6h52m
selenium-grid-selenium-distributor-57b4f9b486-q5bxl     1/1     Running   0          6h59m
selenium-grid-selenium-event-bus-fc79c4f9d-lxhmz        1/1     Running   0          6h42m
selenium-grid-selenium-node-chrome-28llw-sr5r6          1/1     Running   0          19m
selenium-grid-selenium-node-chrome-5xglm-7gfwl          1/1     Running   0          19m
selenium-grid-selenium-node-chrome-64v4l-f27gg          1/1     Running   0          21m
selenium-grid-selenium-node-chrome-8gtn8-sxg8x          1/1     Running   0          19m
selenium-grid-selenium-node-chrome-b2jz9-l7r9h          1/1     Running   0          25m
selenium-grid-selenium-node-chrome-bc6x9-75hrj          1/1     Running   0          20m
selenium-grid-selenium-node-chrome-cqn5m-qs69s          1/1     Running   0          19m
selenium-grid-selenium-node-chrome-h6dgt-lv6nm          1/1     Running   0          23m
selenium-grid-selenium-node-chrome-hxcrb-66ffg          1/1     Running   0          25m
selenium-grid-selenium-node-chrome-kggj4-kcf86          1/1     Running   0          19m
selenium-grid-selenium-node-chrome-kzvx2-42s4x          1/1     Running   0          25m
selenium-grid-selenium-node-chrome-ls7d2-jg2ns          1/1     Running   0          20m
selenium-grid-selenium-node-chrome-m2tb8-f2skm          1/1     Running   0          20m
selenium-grid-selenium-node-chrome-m59qb-9cd2f          1/1     Running   0          20m
selenium-grid-selenium-node-chrome-pxnd4-r6jpn          1/1     Running   0          20m
selenium-grid-selenium-node-chrome-qzthw-rwh24          1/1     Running   0          19m
selenium-grid-selenium-node-chrome-vvmmm-g9r7s          1/1     Running   0          30h
selenium-grid-selenium-node-chrome-wnjmk-xff5q          1/1     Running   0          20m
selenium-grid-selenium-router-7c98c9d8d5-66vbq          1/1     Running   0          6h52m
selenium-grid-selenium-session-map-cf95f846b-jfn6z      1/1     Running   0          6h52m
selenium-grid-selenium-session-queue-7c44c456d8-8q975   1/1     Running   0          6h49m

There are some nodes that up and running for 30h, while they should be downscaled, or failed to register:

❯ k -n selenium-grid logs selenium-grid-selenium-node-chrome-vvmmm-g9r7s
Defaulted container "selenium-grid-selenium-node-chrome" out of: selenium-grid-selenium-node-chrome, pre-puller-selenium-grid-selenium-node-chrome (init)
2025-03-13 02:37:54,579 INFO Included extra file "/etc/supervisor/conf.d/chrome-cleanup.conf" during parsing
2025-03-13 02:37:54,579 INFO Included extra file "/etc/supervisor/conf.d/recorder.conf" during parsing
2025-03-13 02:37:54,579 INFO Included extra file "/etc/supervisor/conf.d/selenium.conf" during parsing
2025-03-13 02:37:54,579 INFO Included extra file "/etc/supervisor/conf.d/uploader.conf" during parsing
2025-03-13 02:37:54,582 INFO RPC interface 'supervisor' initialized
2025-03-13 02:37:54,582 INFO supervisord started with pid 8
2025-03-13 02:37:55,588 INFO spawned: 'xvfb' with pid 9
2025-03-13 02:37:55,590 INFO spawned: 'vnc' with pid 10
2025-03-13 02:37:55,592 INFO spawned: 'novnc' with pid 11
2025-03-13 02:37:55,594 INFO spawned: 'selenium-node' with pid 12
2025-03-13 02:37:55,608 INFO success: selenium-node entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
Appending Selenium option: --enable-managed-downloads true
Appending Selenium option: --register-period 120
Appending Selenium option: --register-cycle 5
Appending Selenium option: --register-shutdown-on-failure
Appending Selenium option: --heartbeat-period 30
Appending Selenium option: --log-level INFO
Appending Selenium option: --http-logs false
Appending Selenium option: --structured-logs false
Generating Selenium Config
Appending Selenium option: --tracing false
Tracing is disabled
Appending Selenium option: --bind-host false
Appending Selenium option: --config /opt/selenium/config.toml
Selenium Grid Node configuration:
[events]
    publish = "tcp://selenium-grid-selenium-event-bus:4442"
    subscribe = "tcp://selenium-grid-selenium-event-bus:4443"

[server]
host = "10.1.60.149"
port = "5555"
[node]
grid-url = "https://XXXX"
session-timeout = 1800
override-max-sessions = false
detect-drivers = false
drain-after-session-count = 1
max-sessions = 1

[[node.driver-configuration]]
display-name = "chrome"
stereotype = '{"browserName": "chrome", "browserVersion": "", "platformName": "Linux", "goog:chromeOptions": {"binary": "/usr/bin/google-chrome"}, "se:containerName": "selenium-grid-selenium-node-chrome-vvmmm-g9r7s", "container:hostname": "selenium-grid-selenium-node-chrome-vvmmm-g9r7s"}'
max-sessions = 1

Starting Selenium Grid Node...
Using JAVA_OPTS: -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/selenium/logs -Dselenium.LOGGER.level=SEVERE -Dwebdriver.chrome.whitelistedIps= -Dwebdriver.remote.enableTracing=false -Dwebdriver.httpclient.version=HTTP_1_1
02:37:56.402 INFO [LoggingOptions.configureLogEncoding] - Using the system default encoding
02:37:56.408 INFO [LoggingOptions.getTracer] - Using null tracer
02:37:56.456 INFO [UnboundZmqEventBus.<init>] - Connecting to tcp://selenium-grid-selenium-event-bus:4442 and tcp://selenium-grid-selenium-event-bus:4443
02:37:56.504 INFO [UnboundZmqEventBus.<init>] - Sockets created
2025-03-13 02:37:56,616 INFO success: xvfb entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2025-03-13 02:37:56,616 INFO success: vnc entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2025-03-13 02:37:56,616 INFO success: novnc entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
02:37:57.505 INFO [UnboundZmqEventBus.<init>] - Event bus ready
02:37:57.641 INFO [NodeServer.createHandlers] - Reporting self as: http://10.1.60.149:5555
02:37:57.645 INFO [LoggingOptions.getTracer] - Using null tracer
02:37:57.661 INFO [NodeOptions.getSessionFactories] - Detected 8 available processors
02:37:57.712 INFO [NodeOptions.report] - Adding chrome for {"browserName": "chrome","browserVersion": "","container:hostname": "selenium-grid-selenium-node-chrome-vvmmm-g9r7s","goog:chromeOptions": {"binary": "\u002fusr\u002fbin\u002fgoogle-chrome"},"platformName": "linux","se:containerName": "selenium-grid-selenium-node-chrome-vvmmm-g9r7s","se:downloadsEnabled": true,"se:noVncPort": 7900,"se:vncEnabled": true} 1 times
02:37:57.753 INFO [Node.<init>] - Binding additional locator mechanisms: relative
02:37:57.919 INFO [NodeServer$2.start] - Starting registration process for Node http://10.1.60.149:5555
02:37:57.921 INFO [NodeServer.execute] - Started Selenium node 4.29.0 (revision 18ae989): http://10.1.60.149:5555
02:37:57.933 INFO [NodeServer$2.lambda$start$2] - Sending registration event...
02:37:58.031 INFO [NodeServer.lambda$createHandlers$2] - Node has been added
02:37:58.666 INFO [LocalNode.newSession] - Session created by the Node. Id: c1a53540374088557a8c27905f0cf44a, Caps: Capabilities {acceptInsecureCerts: false, browserName: chrome, browserVersion: 134.0.6998.35, chrome: {chromedriverVersion: 134.0.6998.35 (ea6ef4c2ac15..., userDataDir: /tmp/.org.chromium.Chromium...}, fedcm:accounts: true, goog:chromeOptions: {debuggerAddress: localhost:32891}, networkConnectionEnabled: false, pageLoadStrategy: normal, platformName: linux, proxy: Proxy(), se:bidiEnabled: false, se:cdp: wss://selenium-grid.appx.hu..., se:cdpVersion: 134.0.6998.35, se:containerName: selenium-grid-selenium-node..., se:downloadsEnabled: true, se:noVncPort: 7900, se:vnc: wss://selenium-grid.appx.hu..., se:vncEnabled: true, se:vncLocalAddress: ws://10.1.60.149:7900, setWindowRect: true, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: dismiss and notify, webauthn:extension:credBlob: true, webauthn:extension:largeBlob: true, webauthn:extension:minPinLength: true, webauthn:extension:prf: true, webauthn:virtualAuthenticators: true}
02:37:58.668 INFO [LocalNode.checkSessionCount] - Draining Node, configured sessions value (1) has been reached.
02:37:58.716 INFO [ProxyNodeWebsockets.createWsEndPoint] - Establishing connection to ws://localhost:32891/devtools/browser/ca3c1b6a-cdb8-47d6-9b83-e152fc17cb7e
02:39:35.297 WARN [ProxyNodeWebsockets$ForwardingListener.onError] - Error proxying websocket command
java.net.SocketException: Connection reset
	at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:401)
	at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:434)
	at java.net.http/jdk.internal.net.http.SocketTube.readAvailable(SocketTube.java:1178)
	at java.net.http/jdk.internal.net.http.SocketTube$InternalReadPublisher$InternalReadSubscription.read(SocketTube.java:841)
	at java.net.http/jdk.internal.net.http.SocketTube$SocketFlowTask.run(SocketTube.java:181)
	at java.net.http/jdk.internal.net.http.common.SequentialScheduler$SchedulableTask.run(SequentialScheduler.java:207)
	at java.net.http/jdk.internal.net.http.common.SequentialScheduler.runOrSchedule(SequentialScheduler.java:280)
	at java.net.http/jdk.internal.net.http.common.SequentialScheduler.runOrSchedule(SequentialScheduler.java:233)
	at java.net.http/jdk.internal.net.http.SocketTube$InternalReadPublisher$InternalReadSubscription.signalReadable(SocketTube.java:782)
	at java.net.http/jdk.internal.net.http.SocketTube$InternalReadPublisher$ReadEvent.signalEvent(SocketTube.java:965)
	at java.net.http/jdk.internal.net.http.SocketTube$SocketFlowEvent.handle(SocketTube.java:253)
	at java.net.http/jdk.internal.net.http.HttpClientImpl$SelectorManager.handleEvent(HttpClientImpl.java:1469)
	at java.net.http/jdk.internal.net.http.HttpClientImpl$SelectorManager.lambda$run$3(HttpClientImpl.java:1414)
	at java.base/java.util.ArrayList.forEach(ArrayList.java:1596)
	at java.net.http/jdk.internal.net.http.HttpClientImpl$SelectorManager.run(HttpClientImpl.java:1414)
02:39:35.383 INFO [SessionSlot.stop] - Stopping session c1a53540374088557a8c27905f0cf44a
02:39:35.383 INFO [LocalNode.stopTimedOutSession] - Node draining complete!

@VietND96
Copy link
Member

There are some nodes that up and running for 30h, while they should be downscaled, or failed to register:

I am checking this case, looks like it went through 02:39:35.383 INFO [LocalNode.stopTimedOutSession] - Node draining complete! but didn't count drain-after-session-count reached 1 and shut down the Node completely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants