Skip to content

Add Spark 4.0 support via deequ:2.0.14-spark-4.0#259

Open
m-aciek wants to merge 5 commits intoawslabs:masterfrom
m-aciek:spark-4-support
Open

Add Spark 4.0 support via deequ:2.0.14-spark-4.0#259
m-aciek wants to merge 5 commits intoawslabs:masterfrom
m-aciek:spark-4-support

Conversation

@m-aciek
Copy link
Copy Markdown

@m-aciek m-aciek commented Mar 26, 2026

Closes #258

Summary

  • Add "4.0": "com.amazon.deequ:deequ:2.0.14-spark-4.0" to SPARK_TO_DEEQU_COORD_MAPPING in configs.py
  • Widen pyspark optional dep from >=2.4.7,<3.4.0 to >=2.4.7,<5.0.0 in pyproject.toml
  • Replace scala.collection.JavaConversions (removed in Scala 2.13) with JavaConverters in scala_utils.py and profiles.py
  • Replace scala.collection.Seq.empty() (inaccessible via Py4J in Scala 2.13) with an empty Java list converted via to_scala_seq in analyzers.py and checks.py
  • Add Spark 4.0.0 to the CI matrix with Java 17; restructure matrix to use include: style so each Spark version carries its required Java version

Root causes fixed

Spark 4 uses Scala 2.13, which introduced two breaking changes affecting pydeequ:

  1. scala.collection.JavaConversions was removed — replaced by JavaConverters with explicit .asScala()/.asJava() calls
  2. scala.collection.Seq.empty() is not accessible via Py4J reflection — replaced with to_scala_seq(jvm, jvm.java.util.ArrayList()) which constructs an empty Scala Seq via the already-fixed converter

Test plan

  • All 99 existing tests pass with SPARK_VERSION=4.0.0 / pyspark==4.0.0
  • CI matrix extended to cover Spark 4.0.0 with Java 17
  • Existing Spark 3.x matrix entries unchanged

PR authored with assistance from Claude Code

m-aciek added 4 commits March 26, 2026 16:14
- Add "4.0" entry to SPARK_TO_DEEQU_COORD_MAPPING in configs.py
- Widen pyspark optional dep bound to <5.0.0 in pyproject.toml
- Replace scala.collection.JavaConversions (removed in Scala 2.13) with
  JavaConverters in scala_utils.py and profiles.py
- Replace scala.collection.Seq.empty() (inaccessible via Py4J in Scala 2.13)
  with to_scala_seq(jvm, jvm.java.util.ArrayList()) in analyzers.py and checks.py
- Add Spark 4.0.0 to CI matrix with Java 17; use include: style to pair
  each Spark version with its required Java version

Fixes awslabs#258
PySpark 4.0 requires Python >=3.9. Update the CI matrix to carry a
PYTHON_VERSION per entry (3.8 for Spark 3.x, 3.9 for Spark 4.x) and
use it in the setup-python step.

Split the pyspark optional dep in pyproject.toml into two
version-marker entries so poetry can resolve correctly on both
Python 3.8 (pyspark <4.0) and Python 3.9+ (pyspark <5.0).
The previous fix passed Stream$Empty$ to deequ constructors/methods via
to_scala_seq(jvm, ArrayList()), which Py4J's reflection-based overload
resolution rejects in Scala 2.12 (Spark 3.x).

Add empty_scala_seq() helper that uses JavaConverters.toList() instead
of toSeq(). This produces immutable.Nil (an empty List), which deequ
accepts as Seq[_] in both Scala 2.12 and 2.13, and is correctly matched
by Py4J constructor/method lookup in both versions.
Affirm-Skill: acli-jira
Affirm-Skill: att-deploy-check
Affirm-Skill: att-test
Affirm-Skill: buildkite-debug
Affirm-Skill: capture-context
Affirm-Skill: cmt-writer
Affirm-Skill: commit-and-push
Affirm-Skill: create-adr
Affirm-Skill: create-cmt-ticket
Affirm-Skill: create-tech-spec
Affirm-Skill: export-conversation-record
Affirm-Skill: gather-best-practices
Affirm-Skill: mcp-debug
Affirm-Skill: multi-thor-provisioner
Affirm-Skill: resolve-pr-comments
Affirm-Skill: save-output
Affirm-Skill: security-scanner-for-skill-md
Affirm-Skill: skill-evaluator
Affirm-Skill: thor-control
Affirm-Skill: tickets-from-plan
Affirm-Skill: write-implementation-plan
Affirm-Skill: acli-jira
Affirm-Skill: att-deploy-check
Affirm-Skill: att-test
Affirm-Skill: buildkite-debug
Affirm-Skill: capture-context
Affirm-Skill: cmt-writer
Affirm-Skill: commit-and-push
Affirm-Skill: create-adr
Affirm-Skill: create-cmt-ticket
Affirm-Skill: create-tech-spec
Affirm-Skill: export-conversation-record
Affirm-Skill: gather-best-practices
Affirm-Skill: mcp-debug
Affirm-Skill: multi-thor-provisioner
Affirm-Skill: resolve-pr-comments
Affirm-Skill: save-output
Affirm-Skill: security-scanner-for-skill-md
Affirm-Skill: skill-evaluator
Affirm-Skill: thor-control
Affirm-Skill: tickets-from-plan
Affirm-Skill: write-implementation-plan
@m-aciek
Copy link
Copy Markdown
Author

m-aciek commented Apr 10, 2026

This is now ready for review; CI tests pass on my fork: https://github.com/m-aciek/python-deequ/actions/runs/24196839467

Copy link
Copy Markdown
Contributor

@chenliu0831 chenliu0831 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm not sure if we would like to keep maintaining the Py4j approach though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Spark 4.0 support via deequ:2.0.14-spark-4.0

2 participants