Skip to content

1026274: Worker Framework Shutdown hanging #226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from
Open

Conversation

andyreidz
Copy link
Member

@andyreidz andyreidz commented Jun 13, 2025

Deadlock was caused by the use of System.exit in the test worker, using this opportunity to simplify shutdown logic and use a non-configurable timeout.

Update the shutdownIncoming method to make the shutdown permanent so GatedHealthProvider does not trigger a reconnect.

[2025-08-05 13:27:46.140Z #110.028 INFO  -            -   ] c.g.w.core.WorkerApplication: Worker stop requested.
[2025-08-05 13:27:46.145Z #110.028 INFO  -            -   ] c.g.w.core.WorkerApplication: Allowing 10 backlog tasks to complete, 1 currently active.
[2025-08-05 13:28:01.150Z #110.028 INFO  -            -   ] c.g.w.core.WorkerApplication: Allowing 7 backlog tasks to complete, 1 currently active.
...
[2025-08-05 13:28:31.160Z #110.028 INFO  -            -   ] c.g.w.core.WorkerApplication: Allowing 1 backlog tasks to complete, 1 currently active.
[2025-08-05 13:28:46.197Z #110.028 INFO  -            -   ] c.g.w.core.WorkerApplication: Worker stopped.

@buildmachine-sou-jenkins2
Copy link
Contributor

@andyreidz andyreidz marked this pull request as ready for review June 13, 2025 12:45
@andyreidz andyreidz self-assigned this Jun 13, 2025
@buildmachine-sou-jenkins2
Copy link
Contributor

The Documentation QA site for this branch has been built:
https://workerframework-ci-worker-framework-1026274-bafe1a.glpages.otxlab.net

@andyreidz andyreidz requested a review from dermot-hardy June 13, 2025 13:00
@andyreidz andyreidz marked this pull request as draft June 13, 2025 13:25
@andyreidz andyreidz removed the request for review from dermot-hardy June 13, 2025 13:25
@andyreidz
Copy link
Member Author

More thought is required for BulkWorkerThreadPool

@andyreidz andyreidz marked this pull request as ready for review August 5, 2025 13:32
@andyreidz andyreidz requested a review from dermot-hardy August 6, 2025 08:41
@andyreidz andyreidz requested a review from dermot-hardy August 7, 2025 09:58

final long startTime = System.currentTimeMillis();

while(wtp.getBacklogSize() > 0 && System.currentTimeMillis() - startTime < SHUTDOWN_DURATION) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure about this strategy of trying to clear the backlog during shutdown? Trying to complete in-progress tasks makes sense but if I'm understanding this right it is intentionally allowing new tasks to begin that haven't been passed to the worker code yet. Wouldn't we be better to clear these backlog queues and let the next worker instance pick them up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prefetched message have their delivery count increased and this goes towards the poison message identification.
I take your point that starting more messages increases the chances they get terminated after the timeout.
Let's aim to talk it through at some stage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants