Skip to content

fix: add SketchHashable trait to fix inconsistent hashing across types#111

Open
ZENOTME wants to merge 1 commit intoapache:mainfrom
ZENOTME:theta-canonicalization
Open

fix: add SketchHashable trait to fix inconsistent hashing across types#111
ZENOTME wants to merge 1 commit intoapache:mainfrom
ZENOTME:theta-canonicalization

Conversation

@ZENOTME
Copy link
Copy Markdown
Contributor

@ZENOTME ZENOTME commented Mar 24, 2026

XXXSketch::update previously accepted arbitrary T: Hash and fed Rust-native Hash semantics directly into the hash path. This diverges from the Java/C++ datasketches implementations, which use explicit canonical update semantics for numeric, floating-point, string, and byte-oriented inputs.

Concretely, this caused several cross-language mismatches:

  • narrow integers were not canonicalized through the same signed-widening path as Java/C++
  • str / byte-oriented inputs followed Rust structural Hash semantics instead of raw-byte semantics
  • the default integer-stream path (for example for i in 0..n { sketch.update(i) }) could produce different retained hashes / estimates from the C++ implementation

This PR introduce SketchHashable trait to unify the hash behaviour to fix inconsistent hashing across types

@ZENOTME ZENOTME force-pushed the theta-canonicalization branch from adb49dd to 5591bcd Compare March 24, 2026 17:10
@ZENOTME
Copy link
Copy Markdown
Contributor Author

ZENOTME commented Mar 24, 2026

Seems CPC update and HLL update may have the same problem. cc @tisonkun @PsiACE

…s sketches

- Updated various sketch implementations (CpcSketch, HllSketch, ThetaSketch) to use the new SketchHashable trait for hashing values, ensuring consistent behavior across different data types.
- Simplified update methods to accept types implementing SketchHashable instead of Hash.
- Added tests to verify that integer, string, and float inputs are handled consistently across sketches.
@ZENOTME ZENOTME force-pushed the theta-canonicalization branch from 5591bcd to f565ade Compare April 11, 2026 04:56
@ZENOTME ZENOTME changed the title fix(theta): add ThetaHashable trait to fix inconsistent hashing across types fix: add SketchHashable trait to fix inconsistent hashing across types Apr 11, 2026
@ZENOTME ZENOTME requested review from PsiACE, notfilippo and tisonkun and removed request for PsiACE and tisonkun April 11, 2026 05:42
Copy link
Copy Markdown
Member

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get the point clear. Could you provide a failing test to show how the original impl is wrong?

@ZENOTME
Copy link
Copy Markdown
Contributor Author

ZENOTME commented Apr 11, 2026

I don't get the point clear. Could you provide a failing test to show how the original impl is wrong?

e.g. for u32 in Rust, the following test fails before this PR

fn test_unsigned_narrow_integer_inputs_follow_cpp_signed_path() {
      let mut u32_sketch = ThetaSketch::builder().build();
      u32_sketch.update(u32::MAX);

      let mut signed_sketch = ThetaSketch::builder().build();
      signed_sketch.update(-1i64);

      assert_eq!(u32_sketch.iter().next(), signed_sketch.iter().next());
 }

but in c++ , follwing is success

TEST_CASE("theta sketch: unsigned narrow integer inputs follow cpp
  signed path", "[theta_sketch]") {
    update_theta_sketch u32_sketch =
      update_theta_sketch::builder().build();
    u32_sketch.update(uint32_t(UINT32_MAX));

    update_theta_sketch signed_sketch =
      update_theta_sketch::builder().build();
    signed_sketch.update(int64_t(-1));

    REQUIRE(u32_sketch.get_num_retained() == 1);
    REQUIRE(signed_sketch.get_num_retained() == 1);
    REQUIRE(*u32_sketch.begin() == *signed_sketch.begin());
  }

because in c++, uint32 will be convert to int32_t and then int32_t. So in Rust to make it compatible with this behaviour, we introduce SketchHashable.

 void update(uint32_t value) {
    update(static_cast<int32_t>(value));
  }

  void update(int32_t value) {
    update(static_cast<int64_t>(value));
  }

  void update(int64_t value) {
    update(&value, sizeof(value));
  }

Copy link
Copy Markdown
Member

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the point now.

But I think we need to consider how Rust's Hash would be recognized in datasketches-rust then, instead of randomly patch some code paths.

That said, we still have sketches like BloomFilter and FrequentItemsSketch that depend on Rust's std::hash::Hash. If we partially "fix" for CPC/HLL/Theta, then it creates a new splitting point among sketches.

To replace std::hash::Hash with datasketches' own Hash trait is one possible way to go, but then we need to consider whether this should be sealed or users can implement for their own types (technically, users can direct their own type to an u64 hashed value and then use our sealed u64 hash impl, but it may or may not desireable).

@ZENOTME
Copy link
Copy Markdown
Contributor Author

ZENOTME commented Apr 11, 2026

That said, we still have sketches like BloomFilter and FrequentItemsSketch that depend on Rust's std::hash::Hash. If we partially "fix" for CPC/HLL/Theta, then it creates a new splitting point among sketches.

Yes, we should ensure that BloomFilter and FrequentItemsSketch are also consistent with the other implementations. However, I noticed that even in the C++ version, BloomFilter and Theta follow different canonicalization paths:

// bloom_filter
template<typename A>
void bloom_filter_alloc<A>::update(uint32_t item) {
  update(static_cast<uint64_t>(item));
}

// theta
template<typename A>
void update_theta_sketch_alloc<A>::update(uint32_t value) {
  update(static_cast<int32_t>(value));
}

template<typename A>
void update_theta_sketch_alloc<A>::update(int32_t value) {
  update(static_cast<int64_t>(value));
}

Instead of introducing multiple hashing behaviors using multiple type, I think it maybe more reasonable to define a single canonical hashing model and make all sketches follow it.

To replace std::hash::Hash with datasketches' own Hash trait is one possible way to go, but then we need to consider whether this should be sealed or users can implement for their own types

For now, I made SketchHashable a sealed trait to ensure behavior consistent with the Java and C++ implementations. Users can only call update with the supported types.

@tisonkun
Copy link
Copy Markdown
Member

Loop in @AlexanderSaydakov @leerho @proost @freakyzoidberg for general discussion on how DataSketches' sketches should handle hashable value.

Certain programming languages can have their own Hash/Hashable trait/interface and it seems the current impls vary one another.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants