test: add SQL tests documenting Spark encode behavior#3975
Draft
andygrove wants to merge 1 commit intoapache:mainfrom
Draft
test: add SQL tests documenting Spark encode behavior#3975andygrove wants to merge 1 commit intoapache:mainfrom
andygrove wants to merge 1 commit intoapache:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Closes #.
Rationale for this change
There is an upstream DataFusion PR to add a Spark-compatible
encodeexpression: apache/datafusion#21331.This PR adds tests to Comet to make it easier to review the DataFusion PR.
What changes are included in this PR?
Two files under
spark/src/test/resources/sql-tests/expressions/string/:encode.sqlruns on every supported Spark version. It covers UTF-8, US-ASCII, ISO-8859-1, UTF-16 (with BOM), UTF-16BE, UTF-16LE, and UTF-32 (no BOM), plus emoji / surrogate pairs, empty strings, NULL inputs for both arguments, case-insensitive charset names, column versus literal arguments, and binary input with both valid and invalid UTF-8 bytes.encode_strict.sqlis gated byMinSparkVersion: 4.0. It pins Spark's charset whitelist (rejectingUTF-32BE,UTF-32LE,UTF8,UTF16,UTF16BE,ASCII,LATIN1,ISO88591, andEBCDICwithexpect_error(INVALID_PARAMETER_VALUE.CHARSET)) and the raise-on-unmappable behavior (expect_error(MALFORMED_CHARACTER_CODING)foréin US-ASCII,Āin ISO-8859-1, and an emoji in US-ASCII).All positive queries use
query spark_answer_onlybecause Comet currently falls back to Spark forencode, and error cases usequery expect_error(...)which works through the fallback path as well.How are these changes tested?
Ran the new tests locally against both the default Spark 3.5 profile and the Spark 4.0 profile:
./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite encode" -Dtest=nonepassesencode.sqland skipsencode_strict.sql(as expected, since the strict file is gated byMinSparkVersion: 4.0)../mvnw -Pspark-4.0 test -Dsuites="org.apache.comet.CometSqlFileTestSuite encode" -Dtest=nonepasses both files.