Skip to content

test: add SQL tests documenting Spark encode behavior#3975

Draft
andygrove wants to merge 1 commit intoapache:mainfrom
andygrove:test-encode-sql
Draft

test: add SQL tests documenting Spark encode behavior#3975
andygrove wants to merge 1 commit intoapache:mainfrom
andygrove:test-encode-sql

Conversation

@andygrove
Copy link
Copy Markdown
Member

@andygrove andygrove commented Apr 17, 2026

Which issue does this PR close?

Closes #.

Rationale for this change

There is an upstream DataFusion PR to add a Spark-compatible encode expression: apache/datafusion#21331.

This PR adds tests to Comet to make it easier to review the DataFusion PR.

What changes are included in this PR?

Two files under spark/src/test/resources/sql-tests/expressions/string/:

  • encode.sql runs on every supported Spark version. It covers UTF-8, US-ASCII, ISO-8859-1, UTF-16 (with BOM), UTF-16BE, UTF-16LE, and UTF-32 (no BOM), plus emoji / surrogate pairs, empty strings, NULL inputs for both arguments, case-insensitive charset names, column versus literal arguments, and binary input with both valid and invalid UTF-8 bytes.
  • encode_strict.sql is gated by MinSparkVersion: 4.0. It pins Spark's charset whitelist (rejecting UTF-32BE, UTF-32LE, UTF8, UTF16, UTF16BE, ASCII, LATIN1, ISO88591, and EBCDIC with expect_error(INVALID_PARAMETER_VALUE.CHARSET)) and the raise-on-unmappable behavior (expect_error(MALFORMED_CHARACTER_CODING) for é in US-ASCII, Ā in ISO-8859-1, and an emoji in US-ASCII).

All positive queries use query spark_answer_only because Comet currently falls back to Spark for encode, and error cases use query expect_error(...) which works through the fallback path as well.

How are these changes tested?

Ran the new tests locally against both the default Spark 3.5 profile and the Spark 4.0 profile:

  • ./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite encode" -Dtest=none passes encode.sql and skips encode_strict.sql (as expected, since the strict file is gated by MinSparkVersion: 4.0).
  • ./mvnw -Pspark-4.0 test -Dsuites="org.apache.comet.CometSqlFileTestSuite encode" -Dtest=none passes both files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant