Skip to content

Conversation

liuml07
Copy link
Member

@liuml07 liuml07 commented Oct 11, 2025

What is the purpose of the change

https://issues.apache.org/jira/browse/FLINK-38499

Currently, the Curator framework used by ZK based HA is using the exponential backoff retry policy. However, the max sleep time is unbounded. That could cause unbounded sleep time when the retryCount is large. When that happens, recovery from ZK issues may be unreasonably slow.

In my day job, we have a critical patch that limits the max sleep time after seeing multiple ZK issues in the past. In other Apache projects, the BoundedExponentialBackoffRetry is widely used, such as fluss, druid, hudi, bookeeper, phoeniex to name a few.

This Jira proposes to limit the max sleep time by leveraging BoundedExponentialBackoffRetry, with a pretty high default value for starters. Users can change this via a new config option.

Brief change log

  1. Added new configuration option for HA:
  • Key: high-availability.zookeeper.client.max-retry-wait
  • Type: Duration
  • Default: 30 seconds (30000ms)
  • Description: Caps exponential backoff to prevent excessively long waits between retries
  1. Updated retry policy in ZooKeeperUtils
  2. Updated test files to use the new retry policy

Verifying this change

Updated existing tests. Ported from internally tested patch.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

@flinkbot
Copy link
Collaborator

flinkbot commented Oct 11, 2025

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@liuml07 liuml07 force-pushed the FLINK-38499 branch 2 times, most recently from 00f293c to b8964bf Compare October 14, 2025 00:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants