Skip to content

fix: reject vacuous equivalence and deserialization error false matches in Java comparator#1651

Merged
mashraf-222 merged 4 commits intoomni-javafrom
fix/java-comparator-vacuous-equivalence
Mar 3, 2026
Merged

fix: reject vacuous equivalence and deserialization error false matches in Java comparator#1651
mashraf-222 merged 4 commits intoomni-javafrom
fix/java-comparator-vacuous-equivalence

Conversation

@mashraf-222
Copy link
Contributor

Problems fixed

The Java Comparator (Comparator.java) could return equivalent=true despite having zero evidence of actual equivalence. Two distinct correctness gaps existed:

Issue 1 — Vacuous equivalence: If both SQLite databases were empty (e.g., test infrastructure failed before any function ran), or if every invocation was skipped due to placeholder or deserialization errors, the diffs list stayed empty and equivalent=true was returned. No guard existed anywhere in the chain — not in the Java comparator, not in the Python wrapper, and not in the function optimizer. This could silently approve a broken optimization that produced no return values at all.

Issue 2 — DeserializationError false match: When Kryo deserialization fails (e.g., class version mismatch after optimization changed imports), safeDeserialize() returns a synthetic Map{"__type":"DeserializationError","error":"..."}. If deserialization failed identically on both sides, the two error maps would pass compareMaps() as equivalent — even though the actual return values were never compared. This is analogous to comparing two broken thermometers that both show "ERR" and concluding the temperature is the same.

Both gaps are modeled after guards already present in the Python equivalence pipeline: equivalence.py:32-33 rejects empty results, equivalence.py:145-146 rejects all-timeout scenarios, and comparator.py:103-107 rejects PicklePlaceholderAccessError.

Root causes

  1. The equivalence check was diffs.isEmpty() with no minimum-comparison requirement. An empty diff list is the default state before any comparisons run, so zero comparisons trivially satisfied the condition.

  2. safeDeserialize() converts deserialization failures into a Map with a __type key, but no code checked for this sentinel value before feeding it into compareMaps(). Two identical error maps passed deep comparison as a legitimate match.

  3. There was no observability into what the comparator actually did — no counters for how many invocations were compared vs skipped, making these silent failures invisible in logs.

Solutions implemented

Defense-in-depth across both Java and Python layers:

Java Comparator (Comparator.java):

  • Added three tracking counters: actualComparisons, skippedPlaceholders, skippedDeserializationErrors
  • Changed the equivalence condition from diffs.isEmpty() to diffs.isEmpty() && actualComparisons > 0 — equivalence now requires at least one real comparison
  • Added isDeserializationError() helper that detects synthetic error maps before they reach compareMaps()
  • Invocations with deserialization errors on either side are skipped (not counted as comparisons, not counted as diffs)
  • All three counters are included in the JSON output and in a stderr diagnostic line

Python wrapper (comparator.py):

  • Independent zero-comparisons guard: if actualComparisons == 0, return NOT equivalent regardless of what Java reported
  • Backward compatible: old JARs that don't output actualComparisons default to -1, which doesn't trigger the guard
  • Enhanced logging with all new counter fields for debugging
  • Fixed stale docstring that said "Gson" when the implementation uses Kryo

Testability refactor:

  • Extracted compareDatabases() from main() as a package-private static method, since main() calls System.exit() which kills the test runner
  • main() is now a thin wrapper that calls compareDatabases(), prints the result, and exits

Code changes

codeflash-java-runtime/src/main/java/com/codeflash/Comparator.java

  • Extracted core logic from main() into static String compareDatabases(String, String) for testability
  • Added isDeserializationError(Object) helper — checks if an object is a synthetic {"__type":"DeserializationError",...} map
  • Added actualComparisons, skippedPlaceholders, skippedDeserializationErrors counters
  • Changed equivalence condition: diffs.isEmpty() && actualComparisons > 0
  • Added all counters to JSON output fields
  • Added [codeflash-comparator] stderr diagnostic line with all counter values

codeflash/languages/java/comparator.py

  • Extract actualComparisons, skippedPlaceholders, skippedDeserializationErrors from Java JSON output
  • Added zero-comparisons guard: if actual_comparisons == 0: return False, [] with WARNING log
  • Added stderr debug logging for Java comparator output
  • Updated info log to include all new counter fields
  • Fixed docstring: "Gson" → "Kryo"

codeflash-java-runtime/src/test/java/com/codeflash/ComparatorCorrectnessTest.java (new)

9 test cases exercising all correctness scenarios via temporary SQLite databases:

  • testEmptyDatabases — vacuous equivalence guard (zero rows)
  • testAllPlaceholderSkips — all invocations skipped due to KryoPlaceholder
  • testDeserializationErrorSkipped — corrupted Kryo data triggers deserialization error skip
  • testMixedRealAndPlaceholder — mix of real comparisons and placeholder skips
  • testNormalHappyPath — standard matching return values
  • testNormalMismatch — differing return values produce diffs
  • testVoidMethodsBothNull — void methods count as real comparisons
  • testOneSideEmpty — asymmetric failure (one DB empty)
  • testIsDeserializationError — unit test for the helper method

Testing

Unit tests:

mvn test (in codeflash-java-runtime/)
Tests run: 236, Failures: 0, Errors: 0, Skipped: 0 — BUILD SUCCESS

All 9 new tests pass alongside 227 existing tests.

E2E validation — Fibonacci (--no-pr):

[codeflash-comparator] total=10 compared=10 skipped_placeholders=0 skipped_deser_errors=0 diffs=0 equivalent=true
Java comparison: equivalent (10 invocations, 10 compared, 0 placeholder skips, 0 deser skips, 0 diffs)

Optimization found and marked as success.

E2E validation — Aerospike Utf8.encodedLength (with PR):

[codeflash-comparator] total=3 compared=3 skipped_placeholders=0 skipped_deser_errors=0 diffs=0 equivalent=true
Java comparison: equivalent (3 invocations, 3 compared, 0 placeholder skips, 0 deser skips, 0 diffs)

Completed without errors (no optimization found — function already optimal).

E2E validation — Calculator.calculateStats (Map return type, --no-pr):

[codeflash-comparator] total=1 compared=1 skipped_placeholders=0 skipped_deser_errors=0 diffs=0 equivalent=true
Java comparison: equivalent (1 invocations, 1 compared, 0 placeholder skips, 0 deser skips, 0 diffs)

Complex object (Map<String, Double>) through Kryo round-trip. Optimization found and succeeded.

E2E validation — NetworkStats.getConnectionInfo (Socket in return value, --no-pr):

[codeflash-comparator] total=2 compared=0 skipped_placeholders=2 skipped_deser_errors=0 diffs=0 equivalent=false
Java comparison: no actual comparisons performed (total=2, skipped_placeholders=2, skipped_deser_errors=0). Treating as NOT equivalent.

Confirmed the placeholder skip path and the Python zero-comparisons guard both work correctly in a real E2E flow.

Impact

  • Prevents silently accepting broken optimizations that produce no return values or only deserialization errors
  • Adds observability into the comparison process — every run now logs exactly how many invocations were compared, skipped, or diffed
  • Backward compatible — old JARs without the new fields continue to work with the updated Python wrapper
  • Modeled after existing Python-side guards for consistency across language pipelines

🤖 Generated with Claude Code

@mashraf-222 mashraf-222 requested review from HeshamHM28 and misrasaurabh1 and removed request for misrasaurabh1 February 24, 2026 20:01
mashraf-222 and others added 2 commits March 3, 2026 19:40
…es in Java comparator

Guard against two correctness gaps where the comparator could return
equivalent=true with zero evidence: empty databases and identical
DeserializationError maps. Add tracking counters and stderr logging.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…guard

The test previously expected empty databases to return equivalent=True,
which was the exact bug being fixed. Updated to assert equivalent=False.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mashraf-222 mashraf-222 force-pushed the fix/java-comparator-vacuous-equivalence branch from fde20ed to e6ca15c Compare March 3, 2026 19:41
@claude
Copy link
Contributor

claude bot commented Mar 3, 2026

Claude encountered an error —— View job


PR Review

  • Gathering context
  • Triage — substantive code changes found
  • Lint & typecheck — formatting issue found in comparator.py, fixing now
  • Resolve stale threads — none found
  • Code review
  • Duplicate detection
  • Test coverage
  • Summary comment
  • Simplify pass
  • Check optimization PRs

mashraf-222 and others added 2 commits March 3, 2026 19:59
The resources JAR used by CI unit tests was stale and didn't include
the Comparator changes from this PR. This caused test_comparator_empty_table
to fail on all CI runners.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The unit-tests workflow relied on a pre-committed JAR binary in
resources/ which could become stale when Comparator.java changes.
Now the workflow builds the JAR from source and installs it to the
local Maven repo, matching what java-e2e and fibonacci-nogit already do.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the workflow-modified This PR modifies GitHub Actions workflows label Mar 3, 2026
@mashraf-222 mashraf-222 merged commit dc7caa3 into omni-java Mar 3, 2026
25 of 31 checks passed
@mashraf-222 mashraf-222 deleted the fix/java-comparator-vacuous-equivalence branch March 3, 2026 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

workflow-modified This PR modifies GitHub Actions workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant