-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-53690][SS] Fix exponential formatting of avgOffsetsBehindLatest in Kafka sources object in progress json #52445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…t in Kafka's source metrics in json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me. It'd be ideal if we could come up with test in Kafka data source to verify the change in e2e. If the number should be huge and it's not appropriate for testing, please let me know and we can skip adding e2e test.
("metrics" -> safeMapToJValue[String]( | ||
metrics, | ||
(metricsName, s) => | ||
// SPARK-53690: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically speaking, we shouldn't have a Kafka specific handling in here and instead should build some sort of API to indicate whether we have to remove exponential format on specific metric or not.
But it's a single occurrence and Kafka is a built-in connector, so I guess it is OK. Safer approach is to check whether the source is Kafka, but maybe the metric name is not very common to be conflicted with others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, i agree. Since its just one metrics from kafka, it should be fine to be part of this.
| "total" : 100 | ||
| }, | ||
| "stateOperators" : [ ], | ||
| "sources" : [ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a consistent way to trigger scientific notation in Kafka data source test? I understand this is the simplest way to test this, but we'd like to make sure it works with actual Kafka data source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, i gave a try to run a very large input batch from kafka, however, the test case got stuck for 10+ minutes. I am unsure if having a real bulky test case would be optimal for code builds. What are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's OK to skip it. Instead, could you please update the json format string to reflect the real Kafka read? Could you please paste the event json (you can redact some information) you saw the issue?
You can change the number for the test but let's not make this be arbitrary in overall and make it look like a real one.
What changes were proposed in this pull request?
This PR fixes an issue where
avgOffsetsBehindLatest
metric of Kafka sources object from streaming progress metrics JSON were displayed in scientific notation (e.g., 2.70941269E8). The fix uses safe Decimal casting to ensure values are displayed in a more human-readable format.Before change:
After change:
Why are the changes needed?
Current formatting is not user-friendly. A user can easily interpret
2.70941269E8
as2.7
instead of270,941,269
, as E can be missed to be spotted. This fix will improve the readability of Spark Structured Streaming progress metrics JSON.Does this PR introduce any user-facing change?
No
How was this patch tested?
Run this Maven test:
Results:
Was this patch authored or co-authored using generative AI tooling?
No