KAFKA-19606: Fix anomaly of requestHandlerAvgIdleMetric in kraft combined mode #20356

0xffff-zhiyan · 2025-08-14T23:40:30Z

JMX metrics RequestHandlerAvgIdlePercent reports a value close to 2 in
combined kraft mode but it's expected to be b/w 0 and 1.

This is an issue with kraft combined mode specifically because both
controller + broker are using the same Meter object in combined
mode, defined in RequestThreadIdleMeter#requestThreadIdleMeter, but
the controller and broker are using separate KafkaRequestHandlerPool
objects, where each object's threadPoolSize == KafkaConfig.numIoThreads. This means when calculating idle time, each
pool divides by its own numIoThreadsvalue before reporting to the
shared meter and Meter calculates the final result by accumulating
all the values reported by all threads. However, since there are
actually 2 × numIoThreads total threads contributing to the metric,
the denominator should be doubled to get the correct average.

POC in local environment:

Changes:

When creating dataPlaneRequestHandlerPool, we pass
KafkaConfig.isKRaftCombinedMode; in combined mode the pool adjusts the
idle calculation (dividing by 2 * totalHandlerThreads), using the
total thread count from both pools as the denominator
When resizing threadpool, update the pool size each time a thread is
created/destroyed to ensure the size is up to date
add a unit test mocking the combined mode where broker & controller
share the Meter and use separate RequestHandlerPool
add unit test for threadpool resizing

https://issues.apache.org/jira/browse/KAFKA-19606

kevin-wu24

Thanks for the changes @0xffff-zhiyan. Left a couple of comments. Can you attach the screenshots showcasing this as a comment in this thread?

kevin-wu24 · 2025-08-15T15:51:34Z

core/src/test/scala/kafka/server/KafkaRequestHandlerTest.scala

+      assertEquals(3, pool.threadPoolSize.get)
+
+      // grow to 5
+      pool.resizeThreadPool(5)
+      assertEquals(5, pool.threadPoolSize.get)
+
+      // shrink to 2
+      pool.resizeThreadPool(2)
+      assertEquals(2, pool.threadPoolSize.get)


I'm not sure this test tests anything new, since the previous implementation would pass it as well.

This only tests the pool’s resize, including expansion and shrinking, since that’s the part I modified. The previous tests didn’t specifically cover this, so this ensures the pool resizes to the correct size.

kevin-wu24 · 2025-08-15T15:51:38Z

core/src/test/scala/kafka/server/KafkaRequestHandlerTest.scala

+        Thread.sleep(200)
+        value = sharedMeter.oneMinuteRate()
+      }
+      assertTrue(value >= 0.0 && value <= 1.05, s"idle percent should be within [0,1], got $value")


Shouldn't this assert be [0,1], not [0,1.05]?

Yes, but I want to allow a small amount of fluctuation to tolerate minor measurement errors.

Why would there be a measurement error? Ever having a measurement of >1 indicates a bug in our logic determining the denominator.

jsancio

Thanks for the improvement @0xffff-zhiyan . I have one high-level suggestion.

jsancio · 2025-08-15T16:10:39Z

core/src/main/scala/kafka/server/KafkaRequestHandler.scala

+  nodeName: String = "broker",
+  val isCombinedMode: Boolean = false,


This is a very ad-hoc solution which assumes one or two request handler pool(s). To solve this problem, I see two options:

Keep track of the number of threads in a pool and the number of threads across all of the request handler pools. The metrics would use the global count when computing the average thread idle ratio.

Have two different metrics for the broker's request handler versus the controller's request handler. This would of course require a KIP.

I am leaning toward options 1 for now. In the future we can always do option 2. which would at a high-level define 3 metrics:

RequestHandlerAvgIdlePercent which reports the thread idle ratio for all of the request pools

BrokerRequestHandlerAvgIdlePercent which reports the thread idle ratio for the broker request pool

ControllerRequestHandlerAvgIdlePercent which reports the thread idle ration for the controller request pool.

fix requestHandlerAvgIdleMetric in kraft combined mode

ef70d4a

github-actions bot added triage PRs from the community core Kafka Broker labels Aug 14, 2025

jsancio added the ci-approved label Aug 15, 2025

kevin-wu24 reviewed Aug 15, 2025

View reviewed changes

jsancio reviewed Aug 15, 2025

View reviewed changes

github-actions bot removed the triage PRs from the community label Aug 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KAFKA-19606: Fix anomaly of requestHandlerAvgIdleMetric in kraft combined mode #20356

KAFKA-19606: Fix anomaly of requestHandlerAvgIdleMetric in kraft combined mode #20356

Uh oh!

0xffff-zhiyan commented Aug 14, 2025 •

edited by github-actions bot

Loading

Uh oh!

kevin-wu24 left a comment •

edited

Loading

Uh oh!

kevin-wu24 Aug 15, 2025

Uh oh!

0xffff-zhiyan Aug 15, 2025

Uh oh!

kevin-wu24 Aug 15, 2025

Uh oh!

0xffff-zhiyan Aug 15, 2025

Uh oh!

kevin-wu24 Aug 15, 2025 •

edited

Loading

Uh oh!

jsancio left a comment

Uh oh!

jsancio Aug 15, 2025

Uh oh!

Uh oh!

		nodeName: String = "broker",
		val isCombinedMode: Boolean = false,

KAFKA-19606: Fix anomaly of requestHandlerAvgIdleMetric in kraft combined mode #20356

Are you sure you want to change the base?

KAFKA-19606: Fix anomaly of requestHandlerAvgIdleMetric in kraft combined mode #20356

Uh oh!

Conversation

0xffff-zhiyan commented Aug 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevin-wu24 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kevin-wu24 Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

0xffff-zhiyan Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

kevin-wu24 Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

0xffff-zhiyan Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

kevin-wu24 Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jsancio left a comment

Choose a reason for hiding this comment

Uh oh!

jsancio Aug 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

0xffff-zhiyan commented Aug 14, 2025 •

edited by github-actions bot

Loading

kevin-wu24 left a comment •

edited

Loading

kevin-wu24 Aug 15, 2025 •

edited

Loading