fix: reject vacuous equivalence and deserialization error false matches in Java comparator#1651
Merged
mashraf-222 merged 4 commits intoomni-javafrom Mar 3, 2026
Merged
Conversation
…es in Java comparator Guard against two correctness gaps where the comparator could return equivalent=true with zero evidence: empty databases and identical DeserializationError maps. Add tracking counters and stderr logging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…guard The test previously expected empty databases to return equivalent=True, which was the exact bug being fixed. Updated to assert equivalent=False. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fde20ed to
e6ca15c
Compare
Contributor
|
Claude encountered an error —— View job PR Review
|
The resources JAR used by CI unit tests was stale and didn't include the Comparator changes from this PR. This caused test_comparator_empty_table to fail on all CI runners. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The unit-tests workflow relied on a pre-committed JAR binary in resources/ which could become stale when Comparator.java changes. Now the workflow builds the JAR from source and installs it to the local Maven repo, matching what java-e2e and fibonacci-nogit already do. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Problems fixed
The Java Comparator (
Comparator.java) could returnequivalent=truedespite having zero evidence of actual equivalence. Two distinct correctness gaps existed:Issue 1 — Vacuous equivalence: If both SQLite databases were empty (e.g., test infrastructure failed before any function ran), or if every invocation was skipped due to placeholder or deserialization errors, the
diffslist stayed empty andequivalent=truewas returned. No guard existed anywhere in the chain — not in the Java comparator, not in the Python wrapper, and not in the function optimizer. This could silently approve a broken optimization that produced no return values at all.Issue 2 — DeserializationError false match: When Kryo deserialization fails (e.g., class version mismatch after optimization changed imports),
safeDeserialize()returns a syntheticMap{"__type":"DeserializationError","error":"..."}. If deserialization failed identically on both sides, the two error maps would passcompareMaps()as equivalent — even though the actual return values were never compared. This is analogous to comparing two broken thermometers that both show "ERR" and concluding the temperature is the same.Both gaps are modeled after guards already present in the Python equivalence pipeline:
equivalence.py:32-33rejects empty results,equivalence.py:145-146rejects all-timeout scenarios, andcomparator.py:103-107rejectsPicklePlaceholderAccessError.Root causes
The equivalence check was
diffs.isEmpty()with no minimum-comparison requirement. An empty diff list is the default state before any comparisons run, so zero comparisons trivially satisfied the condition.safeDeserialize()converts deserialization failures into a Map with a__typekey, but no code checked for this sentinel value before feeding it intocompareMaps(). Two identical error maps passed deep comparison as a legitimate match.There was no observability into what the comparator actually did — no counters for how many invocations were compared vs skipped, making these silent failures invisible in logs.
Solutions implemented
Defense-in-depth across both Java and Python layers:
Java Comparator (
Comparator.java):actualComparisons,skippedPlaceholders,skippedDeserializationErrorsdiffs.isEmpty()todiffs.isEmpty() && actualComparisons > 0— equivalence now requires at least one real comparisonisDeserializationError()helper that detects synthetic error maps before they reachcompareMaps()Python wrapper (
comparator.py):actualComparisons == 0, returnNOT equivalentregardless of what Java reportedactualComparisonsdefault to-1, which doesn't trigger the guardTestability refactor:
compareDatabases()frommain()as a package-private static method, sincemain()callsSystem.exit()which kills the test runnermain()is now a thin wrapper that callscompareDatabases(), prints the result, and exitsCode changes
codeflash-java-runtime/src/main/java/com/codeflash/Comparator.javamain()intostatic String compareDatabases(String, String)for testabilityisDeserializationError(Object)helper — checks if an object is a synthetic{"__type":"DeserializationError",...}mapactualComparisons,skippedPlaceholders,skippedDeserializationErrorscountersdiffs.isEmpty() && actualComparisons > 0[codeflash-comparator]stderr diagnostic line with all counter valuescodeflash/languages/java/comparator.pyactualComparisons,skippedPlaceholders,skippedDeserializationErrorsfrom Java JSON outputif actual_comparisons == 0: return False, []with WARNING logcodeflash-java-runtime/src/test/java/com/codeflash/ComparatorCorrectnessTest.java(new)9 test cases exercising all correctness scenarios via temporary SQLite databases:
testEmptyDatabases— vacuous equivalence guard (zero rows)testAllPlaceholderSkips— all invocations skipped due to KryoPlaceholdertestDeserializationErrorSkipped— corrupted Kryo data triggers deserialization error skiptestMixedRealAndPlaceholder— mix of real comparisons and placeholder skipstestNormalHappyPath— standard matching return valuestestNormalMismatch— differing return values produce diffstestVoidMethodsBothNull— void methods count as real comparisonstestOneSideEmpty— asymmetric failure (one DB empty)testIsDeserializationError— unit test for the helper methodTesting
Unit tests:
All 9 new tests pass alongside 227 existing tests.
E2E validation — Fibonacci (--no-pr):
Optimization found and marked as success.
E2E validation — Aerospike Utf8.encodedLength (with PR):
Completed without errors (no optimization found — function already optimal).
E2E validation — Calculator.calculateStats (Map return type, --no-pr):
Complex object (Map<String, Double>) through Kryo round-trip. Optimization found and succeeded.
E2E validation — NetworkStats.getConnectionInfo (Socket in return value, --no-pr):
Confirmed the placeholder skip path and the Python zero-comparisons guard both work correctly in a real E2E flow.
Impact
🤖 Generated with Claude Code