Skip to content

Conversation

xianzhe-databricks
Copy link

@xianzhe-databricks xianzhe-databricks commented Sep 26, 2025

What changes were proposed in this pull request?

Currently, BinaryType is mapped inconsistently in PySpark:

Cases when it is mapped to bytearray:

  1. regular UDF without arrow optimization.
  2. regular UDF with arrow optimization, and without legacy pandas conversion.
  3. Spark Connect Dataframe APIs, e.g. df.collect(), df.toLocalIterator(), df.foreachPartition()
  4. Data source read & write.

Cases when it is mapped to bytes:

  1. regular UDF with arrow optimization and legacy pandas conversion.
  2. Classic Dataframe APIs, e.g. df.collect(), df.toLocalIterator(), df.foreachPartition().

This complicates the data mapping model. With this PR, BinaryType will be consistently mapped to bytes in all aforementioned cases.
We gate the change with a SQL Conf, and enable the conversion to bytes by default.

This PR is based on #52370

Why are the changes needed?

bytes is more efficient as it is immutable and requires zero copy.

Does this PR introduce any user-facing change?

Yes. For the aforementioned cases where BinaryType is mapped to bytearray, we changed the mapping to bytes.

How was this patch tested?

Many tests.

Was this patch authored or co-authored using generative AI tooling?

Yes, with the help of claude code.

safecheck,
input_types,
int_to_decimal_coercion_enabled=False,
int_to_decimal_coercion_enabled,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default value for int_to_decimal_coercion_enabled is not used at all

@github-actions github-actions bot added the DOCS label Sep 29, 2025
@xianzhe-databricks xianzhe-databricks changed the title [SPARK-53696][PYTHON]Default to bytes for BinaryType in PySpark UDF [SPARK-53696][PYTHON]Default to bytes for BinaryType in PySpark arrow UDF Sep 29, 2025
@github-actions github-actions bot added the BUILD label Sep 29, 2025
Copy link
Contributor

@allisonwang-db allisonwang-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add this change to the migration guide?

@xianzhe-databricks xianzhe-databricks changed the title [SPARK-53696][PYTHON]Default to bytes for BinaryType in PySpark arrow UDF [SPARK-53696][PYTHON]Default to bytes for BinaryType in PySpark Sep 30, 2025
@github-actions github-actions bot removed the BUILD label Sep 30, 2025
@xianzhe-databricks xianzhe-databricks marked this pull request as ready for review September 30, 2025 15:06
@xianzhe-databricks
Copy link
Author

Can we also add this change to the migration guide?

where shall I add it? I found the pyspark migration guide is archived https://github.com/apache/spark/blob/master/docs/pyspark-migration-guide.md

@github-actions github-actions bot added the AVRO label Oct 1, 2025
def _get_binary_as_bytes(self) -> bool:
"""Get the binary_as_bytes configuration value from Spark session."""
conf_value = self._session.conf.get("spark.sql.execution.pyspark.binaryAsBytes", "true")
return (conf_value or "true").lower() == "true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which case is (conf_value or "true") for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method self._session.conf.get returns only an Optional[str]. I think the conf is retrieved from Spark Connect Server. (code pointer)

This way, mypy does not complain about type incompatibility.

@xianzhe-databricks xianzhe-databricks changed the title [SPARK-53696][PYTHON]Default to bytes for BinaryType in PySpark [SPARK-53696][PYTHON][CONNECT]Default to bytes for BinaryType in PySpark Oct 2, 2025
@xianzhe-databricks xianzhe-databricks changed the title [SPARK-53696][PYTHON][CONNECT]Default to bytes for BinaryType in PySpark [SPARK-53696][PYTHON][CONNECT][SQL]Default to bytes for BinaryType in PySpark Oct 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants