Skip to content

[SPARK-53157][CORE] Decouple driver and executor heartbeat polling intervals #51885

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ForVic
Copy link

@ForVic ForVic commented Aug 6, 2025

What changes were proposed in this pull request?

Add a config spark.driver.heartbeatInterval, and schedule driver heartbeats at that schedule.

Why are the changes needed?

Decouple driver and executor heartbeat intervals. Due to sampling frequencies in memory metric reporting intervals we do not have a 100% accurate view of stats at drivers and executors. This is particularly observed at the driver, where we don't have the benefit of a larger sample size of metrics from N executors in application.

Here we can provide a way increase (or change in general) the rate of collection of metrics at the driver, to aid in overcoming the sampling problem, without requiring users to also increase executor heartbeat frequencies.

Does this PR introduce any user-facing change?

Yes, introduces a spark config

How was this patch tested?

Verified that metric collection was improved when sampling rates were increased, and verified that the number of events were expected when rate was changed.

Methodology for validating that increased driver heartbeat intervals would improve memory collection:

  1. Using a 6gb driver heap, wrote a job to broadcast a table, gradually increasing the size of the table until OOM.
  2. Increased driver memory to 10gb, large enough for the same broadcast to succeed.
  3. Repeated this job and tracked the peak memory usage that was written to event log.
  4. After repeated experiments, witnessed that the median peak heap typical usage was tracked at <=5GiB.
  5. Added my change, and decreased the heartbeat interval.
  6. Re-ran same jobs with 10gb heap, and saw that the typical peak memory usage tracked was ~8GiB, more accurately reflecting the increased memory needs.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Aug 6, 2025
@ForVic ForVic marked this pull request as ready for review August 6, 2025 21:40
@ForVic
Copy link
Author

ForVic commented Aug 6, 2025

cc @mridulm @robreeves @shardulm94

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant