[SPARK-53738][SQL] PlannedWrite should preserve custom sort order when query output contains literal #52474

pan3793 · 2025-09-27T01:57:29Z

What changes were proposed in this pull request?

This PR fixes a bug in plannedWrite, where the query has a literal output of the partition column.

CREATE TABLE t (i INT, j INT, k STRING) USING PARQUET PARTITIONED BY (k);

INSERT OVERWRITE t SELECT j AS i, i AS j, '0' as k FROM t0 SORT BY k, i;

The evaluation of FileFormatWriter.orderingMatched fails because SortOrder(Literal) is eliminated by EliminateSorts.

The idea is to expose and keep "constant order" expressions from child.outputOrdering

Why are the changes needed?

V1Writes will override the custom sort order when the query output ordering does not satisfy the required ordering. Before SPARK-53707, when the query's output contains literals in partition columns, the judgment produces a false-negative result, thus causing the sort order not to take effect.

SPARK-53707 fixes the issue accidentally(and partially) by adding a Project of query in V1Writes.

Before SPARK-53707

Sort [0 ASC NULLS FIRST, i#280 ASC NULLS FIRST], false
+- Project [j#287 AS i#280, i#286 AS j#281, 0 AS k#282]
   +- Relation spark_catalog.default.t0[i#286,j#287,k#288] parquet

After SPARK-53707

Project [i#284, j#285, 0 AS k#290]
+- Sort [0 ASC NULLS FIRST, i#284 ASC NULLS FIRST], false
   +- Project [i#284, j#285]
      +- Relation spark_catalog.default.t0[i#284,j#285,k#286] parquet

This PR fixes the issue thoroughly, with a new UT added.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT is added.

Was this patch authored or co-authored using generative AI tooling?

No.

pan3793 · 2025-09-27T04:45:27Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

+    val literalColumns = queryOutput.flatMap { ne => isLiteral(ne, ne.name) }
+
+    // We should first sort by dynamic partition columns, then bucket id, and finally sorting
+    // columns, then drop literal columns


Here, SortOrder(Literal) has been eliminated by EliminateSorts

So you basically change requiredOrdering and drop those columns that are defined as literals in outputExpressions of the top OrderPreservingUnaryExecNode node. But what if a literal definition is not in the top node?

Shouldn't we do the other way around and fix actualOrdering? I mean if we have a query:

+- Project [i, j, 0 AS k] + ... +- Sort [i] +- Relation

Then shouldn't actualOrdering (outputOrdering of the Project) be Seq(SortOrder(k), SortOrder(i)) as we know that k is a constant? I.e. Project could prepend its contants to the alias transformed child.outputOrdering.

And similarly, when we have:

+- Sort [i] + ... +- Project [i, j, 0 AS k] +- Relation

Then shoudn't outputOrdering of the Project node be Seq(SortOrder(k)), and outputOrdering of the Sort be Seq(SortOrder(k), SortOrder(i))? I.e. Project could somehow mark that SortOrder(k) as "constant order", and Sort should just extend "constant order" expressions from child.outputOrdering with the new order expressions (i).

@peter-toth thanks for your tips, that sounds reasonable, and I have updated the code in this approach, it's effective both w/ and w/o SPARK-53707

pan3793 · 2025-09-27T04:47:31Z

cc @ulysses-you @peter-toth

…n query output contains literal

pan3793 · 2025-09-27T23:46:32Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/V1WriteCommandSuite.scala


    val listener = new QueryExecutionListener {
      override def onSuccess(funcName: String, qe: QueryExecution, durationNs: Long): Unit = {
+        val conf = qe.sparkSession.sessionState.conf


this is a bugfix, the listener runs in another thread, without this change, conf.getConf actually gets conf from the thread local, thus may cause issues on concurrency running tests

pan3793 · 2025-09-28T03:54:07Z

cc @cloud-fan, could you please take a look?

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

peter-toth · 2025-09-28T07:59:51Z

@pan3793 , I can take a deeper look at this PR tomorrow, but the OrderPreservingUnaryNode related changes look a bit unexpected to me at first sight. Can you please elaborate on why are those needed?

pan3793 · 2025-09-29T03:08:31Z

@peter-toth Let me take this example to explain what I'm trying to do,

CREATE TABLE t (i INT, j INT, k STRING) USING PARQUET PARTITIONED BY (k);

INSERT OVERWRITE t SELECT j AS i, i AS j, '0' as k FROM t0 SORT BY k, i;

In V1Writes.prepareQuery, the query looks like

Sort [0 ASC NULLS FIRST, i#280 ASC NULLS FIRST], false
+- Project [j#287 AS i#280, i#286 AS j#281, 0 AS k#282]
   +- Relation spark_catalog.default.t0[i#286,j#287,k#288] parquet

and query.outputOrdering is [0 ASC NULLS FIRST, i#280 ASC NULLS FIRST], while requiredOrdering is [k#282 ASC NULLS FIRST], thus orderingMatched will be false, then Sort(requiredOrdering, global = false, empty2NullPlan) will be added on top.

the idea is to leverage the alias information in Sort to make outputOrdering knows 0 is alias of k, thus outputOrdering can satisfy requiredOrdering.

BUT, when I debugged it last night, I found the issue had gone magically, and in V1Writes.prepareQuery, the query looked like:

Project [i#284, j#285, 0 AS k#290]
+- Sort [0 ASC NULLS FIRST, i#284 ASC NULLS FIRST], false
   +- Project [i#284, j#285]
      +- Relation spark_catalog.default.t0[i#284,j#285,k#286] parquet

After some investigation, I found this was accidentally fixed by SPARK-53707 (#52449), which got merged just a few days ago (I happened to start constructing the UT before it got in ...), it fixes the issue by adding a Project on the Sort, in the PreprocessTableInsertion rule.

Note: the physics plan change in this PR is still required to satisfy the UT.

Now, I'm not sure if this is still an issue ...

peter-toth · 2025-09-29T11:44:56Z

I see, thanks for the details @pan3793.

I feel that a more comprehensive fix would be to not change requiredOrdering but deal with constant expressions in outputOrdering of both Project and Sort: #52474 (comment)

But this is a complex topic, so @ulysses-you or @cloud-fan or others might have better ideas.

peter-toth · 2025-09-30T13:22:39Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala

+    newOrdering.takeWhile(_.isDefined).flatten.toSeq ++ outputExpressions.filter {
+      case Alias(child, _) => child.foldable
+      case expr => expr.foldable
+    }.map(SortOrder(_, Ascending).copy(isConstant = true))


Hm, do we need to add the whole Alias to SortOrder expression or could adding only the generated attribute work?
Also, I wonder if it would be a breaking change to add Constant as a new SortDirection instead of using a boolean flag?

@peter-toth the Constant SortDirection sounds like a good idea, I have updated the code to use it.

while I have tried to add only the generated attribute (I left the code as comments)

newOrdering.takeWhile(_.isDefined).flatten.toSeq ++ outputExpressions.flatMap { case alias @ Alias(child, _) if child.foldable => Some(SortOrder(alias.toAttribute, Constant)) case expr if expr.foldable => Some(SortOrder(expr, Constant)) case _ => None }

there are two tests fail (haven't figured out the root cause)

[info] CachedTableSuite: ... [info] - SPARK-36120: Support cache/uncache table with TimestampNTZ type *** FAILED *** (43 milliseconds) [info] AttributeSet(TIMESTAMP_NTZ '2021-01-01 00:00:00'#17739) was not empty The optimized logical plan has missing inputs: [info] InMemoryRelation [TIMESTAMP_NTZ '2021-01-01 00:00:00'#17776], StorageLevel(disk, memory, deserialized, 1 replicas) [info] +- *(1) Project [2021-01-01 00:00:00 AS TIMESTAMP_NTZ '2021-01-01 00:00:00'#17739] [info] +- *(1) Scan OneRowRelation[] (QueryTest.scala:241) ... [info] - SPARK-52692: Support cache/uncache table with Time type *** FAILED *** (58 milliseconds) [info] AttributeSet(TIME '22:00:00'#18852) was not empty The optimized logical plan has missing inputs: [info] InMemoryRelation [TIME '22:00:00'#18889], StorageLevel(disk, memory, deserialized, 1 replicas) [info] +- *(1) Project [22:00:00 AS TIME '22:00:00'#18852] [info] +- *(1) Scan OneRowRelation[] (QueryTest.scala:241) ...

The problem seems to be that InMemoryRelation.withOutput() doesn't remap outputOrdering. And because outputOrdering is present in InMemoryRelation as case class argument the unmapped ordering attributes are considered missing inputs.

This seems to be another hidden issue with InMemoryRelation.outputOrdering and got exposed with this change.

I opened a small PR into this PR: pan3793#2, hopefully it helps fixing the above tests.

@peter-toth Many thanks for your professionalism and patience! I tested locally, and it did fix the issue. Have educated a lot from your review.

It a pleasure working with you @pan3793!

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/V1Writes.scala

peter-toth · 2025-09-30T13:46:52Z

Thank you @pan3793. I like the new approach, just have a minor suggestions.
Can you please leave some comments in SortOrder to describe the new flag/ordering + update the PR description how constant order and its propagation solves the issue?

pan3793 · 2025-09-30T14:35:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala


+  override def makeCopy(newArgs: Array[AnyRef]): LogicalPlan = {
+    val copied = super.makeCopy(newArgs).asInstanceOf[InMemoryRelation]
+    copied.statsOfPlanToCache = this.statsOfPlanToCache


I feel this is a hidden bug just exposed by this change.

peter-toth · 2025-09-30T18:47:52Z

We should adjust SortOrder.orderingSatisfies() too.

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala

InMemoryRelation.withOutput fix

pan3793 · 2025-10-01T11:26:06Z

Suppose all code issues are fixed, let's wait for another round CI, I will update the comment soon.

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala

peter-toth · 2025-10-02T09:56:25Z

cc @cloud-fan , @ulysses-you

github-actions bot added the SQL label Sep 27, 2025

pan3793 commented Sep 27, 2025

View reviewed changes

pan3793 marked this pull request as ready for review September 27, 2025 04:47

github-actions bot added the BUILD label Sep 27, 2025

[SPARK-53738][SQL] PlannedWrite should preserve custom sort order whe…

a0aa9f4

…n query output contains literal

pan3793 force-pushed the SPARK-53738 branch from 9b549cd to a0aa9f4 Compare September 27, 2025 23:21

github-actions bot removed the BUILD label Sep 27, 2025

pan3793 commented Sep 27, 2025

View reviewed changes

peter-toth reviewed Sep 28, 2025

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala Outdated Show resolved Hide resolved

peter-toth reviewed Sep 28, 2025

View reviewed changes

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala Outdated Show resolved Hide resolved

semantic

2a9613b

pan3793 marked this pull request as draft September 28, 2025 16:32

pan3793 mentioned this pull request Sep 29, 2025

[SPARK-53707] Improve attribute metadata handling. #52449

Closed

revert unnecessary changes

2a7361c

pan3793 marked this pull request as ready for review September 29, 2025 10:11

constant order

8fdb230

peter-toth reviewed Sep 30, 2025

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/V1Writes.scala Outdated Show resolved Hide resolved

fix npe

376e2b6

pan3793 commented Sep 30, 2025

View reviewed changes

constant order direction

d439be1

peter-toth reviewed Oct 1, 2025

View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/AliasAwareOutputExpression.scala Outdated Show resolved Hide resolved

fix

ad09914

Merge pull request #2 from peter-toth/SPARK-53738

0430f18

InMemoryRelation.withOutput fix

Fix InMemoryRelation.doCanonicalize

2b1f8a5

peter-toth reviewed Oct 1, 2025

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala Outdated Show resolved Hide resolved

pan3793 added 2 commits October 1, 2025 22:51

simplify

9ace506

comment

7ee4d92

peter-toth approved these changes Oct 2, 2025

View reviewed changes

[SPARK-53738][SQL] PlannedWrite should preserve custom sort order when query output contains literal #52474

Are you sure you want to change the base?

[SPARK-53738][SQL] PlannedWrite should preserve custom sort order when query output contains literal #52474

Conversation

pan3793 commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

pan3793 Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

pan3793 commented Sep 27, 2025

Uh oh!

pan3793 Sep 27, 2025

Choose a reason for hiding this comment

Uh oh!

pan3793 commented Sep 28, 2025

Uh oh!

Uh oh!

Uh oh!

peter-toth commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pan3793 commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peter-toth commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

peter-toth Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

pan3793 Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

peter-toth Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

peter-toth Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

peter-toth commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pan3793 Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

peter-toth commented Sep 30, 2025

Uh oh!

Uh oh!

pan3793 commented Oct 1, 2025

Uh oh!

Uh oh!

peter-toth commented Oct 2, 2025

Uh oh!

Uh oh!

pan3793 commented Sep 27, 2025 •

edited

Loading

pan3793 Sep 27, 2025 •

edited

Loading

peter-toth Sep 29, 2025 •

edited

Loading

peter-toth commented Sep 28, 2025 •

edited

Loading

pan3793 commented Sep 29, 2025 •

edited

Loading

peter-toth commented Sep 29, 2025 •

edited

Loading

pan3793 Oct 1, 2025 •

edited

Loading

peter-toth Oct 1, 2025 •

edited

Loading

peter-toth Oct 1, 2025 •

edited

Loading

peter-toth commented Sep 30, 2025 •

edited

Loading