Skip to content

Avoid stack overflow in IndicesClusterStateService applyClusterState #132536

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

albertzaharovits
Copy link
Contributor

@albertzaharovits albertzaharovits commented Aug 7, 2025

Every cluster state applied in the IndicesClusterStateService has the potential to chain a new RefCountingListener to a chain of such listeners. If the chain is too long, the unlucky thread that decreases the ref count to 0 for the head of the listeners chain, ends up calling each listener in turn, and, assuming all ref counts are hence decreased to 0, traversing the whole chain on its thread stack, possibly resulting in a Stackoverflow exception.

This fix chains max 8 RefCountingListener, the 11th one is forked on a generic thread when it gets to execution.

@albertzaharovits albertzaharovits self-assigned this Aug 7, 2025
@albertzaharovits albertzaharovits added >bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v9.2.0 v8.19.2 v9.1.2 labels Aug 7, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Aug 7, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

@elasticsearchmachine
Copy link
Collaborator

Hi @albertzaharovits, I've created a changelog YAML for you.

@albertzaharovits
Copy link
Contributor Author

Honestly, I think I prefer that every chained listener be executed on a generic thread, for code simplicity's sake.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather we didn't extend the chain in the (overwhelmingly common) case where the cluster state update doesn't close any more shards.

Also can you cover this in a test?

@@ -274,8 +275,26 @@ public synchronized void applyClusterState(final ClusterChangedEvent event) {
lastClusterStateShardsClosedListener = new SubscribableListener<>();
currentClusterStateShardsClosedListeners = new RefCountingListener(lastClusterStateShardsClosedListener);
try {
previousShardsClosedListener.addListener(currentClusterStateShardsClosedListeners.acquire());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm are you sure we should move all this listener stuff below doApplyClusterState()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of any impact to execution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I've put it back at the original place.

@albertzaharovits
Copy link
Contributor Author

I'd rather we didn't extend the chain in the (overwhelmingly common) case where the cluster state update doesn't close any more shards.

Pushed 3a00599

@albertzaharovits
Copy link
Contributor Author

@DaveCTurner can you take another look please?

I've changed the code to avoid linking listeners when the applied cluster state doesn't close any shards.
I've also added a test that asserts that all the runnables before the oldest shard close listener that's not complete are run, while the others are not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed Coordination Meta label for Distributed Coordination team v8.19.3 v9.1.3 v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants