-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-53696][PYTHON][CONNECT][SQL]Default to bytes for BinaryType in PySpark #52467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[SPARK-53696][PYTHON][CONNECT][SQL]Default to bytes for BinaryType in PySpark #52467
Conversation
safecheck, | ||
input_types, | ||
int_to_decimal_coercion_enabled=False, | ||
int_to_decimal_coercion_enabled, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default value for int_to_decimal_coercion_enabled
is not used at all
python/pyspark/sql/tests/arrow/test_arrow_binary_as_bytes_udf.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add this change to the migration guide?
where shall I add it? I found the pyspark migration guide is archived https://github.com/apache/spark/blob/master/docs/pyspark-migration-guide.md |
def _get_binary_as_bytes(self) -> bool: | ||
"""Get the binary_as_bytes configuration value from Spark session.""" | ||
conf_value = self._session.conf.get("spark.sql.execution.pyspark.binaryAsBytes", "true") | ||
return (conf_value or "true").lower() == "true" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which case is (conf_value or "true")
for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method self._session.conf.get
returns only an Optional[str]
. I think the conf is retrieved from Spark Connect Server. (code pointer)
This way, mypy does not complain about type incompatibility.
What changes were proposed in this pull request?
Currently,
BinaryType
is mapped inconsistently in PySpark:Cases when it is mapped to
bytearray
:Cases when it is mapped to
bytes
:This complicates the data mapping model. With this PR,
BinaryType
will be consistently mapped tobytes
in all aforementioned cases.We gate the change with a SQL Conf, and enable the conversion to bytes by default.
This PR is based on #52370
Why are the changes needed?
bytes
is more efficient as it is immutable and requires zero copy.Does this PR introduce any user-facing change?
Yes. For the aforementioned cases where
BinaryType
is mapped tobytearray
, we changed the mapping tobytes
.How was this patch tested?
Many tests.
Was this patch authored or co-authored using generative AI tooling?
Yes, with the help of claude code.