Avoid stack overflow in IndicesClusterStateService applyClusterState #132536

albertzaharovits · 2025-08-07T13:11:40Z

Every cluster state applied in the IndicesClusterStateService has the potential to chain a new RefCountingListener to a chain of such listeners. If the chain is too long, the unlucky thread that decreases the ref count to 0 for the head of the listeners chain, ends up calling each listener in turn, and, assuming all ref counts are hence decreased to 0, traversing the whole chain on its thread stack, possibly resulting in a Stackoverflow exception.

This fix chains max 8 RefCountingListener, the 11th one is forked on a generic thread when it gets to execution.

elasticsearchmachine · 2025-08-07T13:12:06Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

elasticsearchmachine · 2025-08-07T13:12:06Z

Hi @albertzaharovits, I've created a changelog YAML for you.

albertzaharovits · 2025-08-07T13:12:54Z

Honestly, I think I prefer that every chained listener be executed on a generic thread, for code simplicity's sake.

DaveCTurner

I'd rather we didn't extend the chain in the (overwhelmingly common) case where the cluster state update doesn't close any more shards.

Also can you cover this in a test?

DaveCTurner · 2025-08-07T14:36:04Z

server/src/main/java/org/elasticsearch/indices/cluster/IndicesClusterStateService.java

@@ -274,8 +275,26 @@ public synchronized void applyClusterState(final ClusterChangedEvent event) {
        lastClusterStateShardsClosedListener = new SubscribableListener<>();
        currentClusterStateShardsClosedListeners = new RefCountingListener(lastClusterStateShardsClosedListener);
        try {
-            previousShardsClosedListener.addListener(currentClusterStateShardsClosedListeners.acquire());


Hmm are you sure we should move all this listener stuff below doApplyClusterState()?

I can't think of any impact to execution.

But I've put it back at the original place.

albertzaharovits · 2025-08-08T09:58:57Z

I'd rather we didn't extend the chain in the (overwhelmingly common) case where the cluster state update doesn't close any more shards.

Pushed 3a00599

albertzaharovits · 2025-08-11T14:42:13Z

@DaveCTurner can you take another look please?

I've changed the code to avoid linking listeners when the applied cluster state doesn't close any shards.
I've also added a test that asserts that all the runnables before the oldest shard close listener that's not complete are run, while the others are not.

sometimes fork the thread

054f5ee

albertzaharovits requested a review from DaveCTurner August 7, 2025 13:11

albertzaharovits self-assigned this Aug 7, 2025

albertzaharovits added >bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v9.2.0 v8.19.2 v9.1.2 labels Aug 7, 2025

elasticsearchmachine added the Team:Distributed Coordination Meta label for Distributed Coordination team label Aug 7, 2025

Update docs/changelog/132536.yaml

73e3304

DaveCTurner reviewed Aug 7, 2025

View reviewed changes

albertzaharovits added 2 commits August 8, 2025 10:24

Merge branch 'main' into fix-3855

fa49097

Avoid chaining when no shard has been closed

3a00599

albertzaharovits added 8 commits August 8, 2025 12:59

Merge branch 'main' into fix-3855

e18cbd5

ooops

e07e36d

Test skeleton

c6f1d0b

Test WIP

447e140

WIP no threadpool

40bfdf7

Merge branch 'main' into fix-3855

027380e

test done

13eecf9

nit

b6d6742

albertzaharovits force-pushed the fix-3855 branch from f4c5977 to b6d6742 Compare August 11, 2025 14:35

albertzaharovits requested a review from DaveCTurner August 11, 2025 14:36

[CI] Auto commit changes from spotless

fa44494

elasticsearchmachine removed the v8.19.2 label Aug 11, 2025

elasticsearchmachine added v8.19.3 v9.1.3 and removed v9.1.2 labels Aug 11, 2025

Merge branch 'main' into fix-3855

2d5d9ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid stack overflow in IndicesClusterStateService applyClusterState #132536

Avoid stack overflow in IndicesClusterStateService applyClusterState #132536

albertzaharovits commented Aug 7, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Aug 7, 2025

Uh oh!

elasticsearchmachine commented Aug 7, 2025

Uh oh!

albertzaharovits commented Aug 7, 2025

Uh oh!

DaveCTurner left a comment

Uh oh!

DaveCTurner Aug 7, 2025

Uh oh!

albertzaharovits Aug 8, 2025

Uh oh!

albertzaharovits Aug 8, 2025

Uh oh!

albertzaharovits commented Aug 8, 2025

Uh oh!

albertzaharovits commented Aug 11, 2025

Uh oh!

Uh oh!

Avoid stack overflow in IndicesClusterStateService applyClusterState #132536

Are you sure you want to change the base?

Avoid stack overflow in IndicesClusterStateService applyClusterState #132536

Conversation

albertzaharovits commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Aug 7, 2025

Uh oh!

elasticsearchmachine commented Aug 7, 2025

Uh oh!

albertzaharovits commented Aug 7, 2025

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

albertzaharovits Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

albertzaharovits Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

albertzaharovits commented Aug 8, 2025

Uh oh!

albertzaharovits commented Aug 11, 2025

Uh oh!

Uh oh!

albertzaharovits commented Aug 7, 2025 •

edited

Loading