fix: add SketchHashable trait to fix inconsistent hashing across types by ZENOTME · Pull Request #111 · apache/datasketches-rust

ZENOTME · 2026-03-24T16:57:18Z

XXXSketch::update previously accepted arbitrary T: Hash and fed Rust-native Hash semantics directly into the hash path. This diverges from the Java/C++ datasketches implementations, which use explicit canonical update semantics for numeric, floating-point, string, and byte-oriented inputs.

Concretely, this caused several cross-language mismatches:

narrow integers were not canonicalized through the same signed-widening path as Java/C++
str / byte-oriented inputs followed Rust structural Hash semantics instead of raw-byte semantics
the default integer-stream path (for example for i in 0..n { sketch.update(i) }) could produce different retained hashes / estimates from the C++ implementation

This PR introduce SketchHashable trait to unify the hash behaviour to fix inconsistent hashing across types

ZENOTME · 2026-03-24T17:19:03Z

Seems CPC update and HLL update may have the same problem. cc @tisonkun @PsiACE

…s sketches - Updated various sketch implementations (CpcSketch, HllSketch, ThetaSketch) to use the new SketchHashable trait for hashing values, ensuring consistent behavior across different data types. - Simplified update methods to accept types implementing SketchHashable instead of Hash. - Added tests to verify that integer, string, and float inputs are handled consistently across sketches.

tisonkun

I don't get the point clear. Could you provide a failing test to show how the original impl is wrong?

ZENOTME · 2026-04-11T06:44:40Z

I don't get the point clear. Could you provide a failing test to show how the original impl is wrong?

e.g. for u32 in Rust, the following test fails before this PR

fn test_unsigned_narrow_integer_inputs_follow_cpp_signed_path() {
      let mut u32_sketch = ThetaSketch::builder().build();
      u32_sketch.update(u32::MAX);

      let mut signed_sketch = ThetaSketch::builder().build();
      signed_sketch.update(-1i64);

      assert_eq!(u32_sketch.iter().next(), signed_sketch.iter().next());
 }

but in c++ , follwing is success

TEST_CASE("theta sketch: unsigned narrow integer inputs follow cpp
  signed path", "[theta_sketch]") {
    update_theta_sketch u32_sketch =
      update_theta_sketch::builder().build();
    u32_sketch.update(uint32_t(UINT32_MAX));

    update_theta_sketch signed_sketch =
      update_theta_sketch::builder().build();
    signed_sketch.update(int64_t(-1));

    REQUIRE(u32_sketch.get_num_retained() == 1);
    REQUIRE(signed_sketch.get_num_retained() == 1);
    REQUIRE(*u32_sketch.begin() == *signed_sketch.begin());
  }

because in c++, uint32 will be convert to int32_t and then int32_t. So in Rust to make it compatible with this behaviour, we introduce SketchHashable.

 void update(uint32_t value) {
    update(static_cast<int32_t>(value));
  }

  void update(int32_t value) {
    update(static_cast<int64_t>(value));
  }

  void update(int64_t value) {
    update(&value, sizeof(value));
  }

tisonkun

I get the point now.

But I think we need to consider how Rust's Hash would be recognized in datasketches-rust then, instead of randomly patch some code paths.

That said, we still have sketches like BloomFilter and FrequentItemsSketch that depend on Rust's std::hash::Hash. If we partially "fix" for CPC/HLL/Theta, then it creates a new splitting point among sketches.

To replace std::hash::Hash with datasketches' own Hash trait is one possible way to go, but then we need to consider whether this should be sealed or users can implement for their own types (technically, users can direct their own type to an u64 hashed value and then use our sealed u64 hash impl, but it may or may not desireable).

ZENOTME · 2026-04-11T11:06:56Z

That said, we still have sketches like BloomFilter and FrequentItemsSketch that depend on Rust's std::hash::Hash. If we partially "fix" for CPC/HLL/Theta, then it creates a new splitting point among sketches.

Yes, we should ensure that BloomFilter and FrequentItemsSketch are also consistent with the other implementations. However, I noticed that even in the C++ version, BloomFilter and Theta follow different canonicalization paths:

// bloom_filter
template<typename A>
void bloom_filter_alloc<A>::update(uint32_t item) {
  update(static_cast<uint64_t>(item));
}

// theta
template<typename A>
void update_theta_sketch_alloc<A>::update(uint32_t value) {
  update(static_cast<int32_t>(value));
}

template<typename A>
void update_theta_sketch_alloc<A>::update(int32_t value) {
  update(static_cast<int64_t>(value));
}

Instead of introducing multiple hashing behaviors using multiple type, I think it maybe more reasonable to define a single canonical hashing model and make all sketches follow it.

To replace std::hash::Hash with datasketches' own Hash trait is one possible way to go, but then we need to consider whether this should be sealed or users can implement for their own types

For now, I made SketchHashable a sealed trait to ensure behavior consistent with the Java and C++ implementations. Users can only call update with the supported types.

tisonkun · 2026-04-11T11:31:46Z

Loop in @AlexanderSaydakov @leerho @proost @freakyzoidberg for general discussion on how DataSketches' sketches should handle hashable value.

Certain programming languages can have their own Hash/Hashable trait/interface and it seems the current impls vary one another.

ZENOTME force-pushed the theta-canonicalization branch from adb49dd to 5591bcd Compare March 24, 2026 17:10

ZENOTME force-pushed the theta-canonicalization branch from 5591bcd to f565ade Compare April 11, 2026 04:56

ZENOTME changed the title ~~fix(theta): add ThetaHashable trait to fix inconsistent hashing across types~~ fix: add SketchHashable trait to fix inconsistent hashing across types Apr 11, 2026

ZENOTME requested review from PsiACE, notfilippo and tisonkun and removed request for PsiACE and tisonkun April 11, 2026 05:42

tisonkun requested changes Apr 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add SketchHashable trait to fix inconsistent hashing across types#111

fix: add SketchHashable trait to fix inconsistent hashing across types#111
ZENOTME wants to merge 1 commit intoapache:mainfrom
ZENOTME:theta-canonicalization

ZENOTME commented Mar 24, 2026 •

edited

Loading

Uh oh!

ZENOTME commented Mar 24, 2026

Uh oh!

tisonkun left a comment

Uh oh!

ZENOTME commented Apr 11, 2026

Uh oh!

tisonkun left a comment

Uh oh!

ZENOTME commented Apr 11, 2026

Uh oh!

tisonkun commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ZENOTME commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZENOTME commented Mar 24, 2026

Uh oh!

tisonkun left a comment

Choose a reason for hiding this comment

Uh oh!

ZENOTME commented Apr 11, 2026

Uh oh!

tisonkun left a comment

Choose a reason for hiding this comment

Uh oh!

ZENOTME commented Apr 11, 2026

Uh oh!

tisonkun commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ZENOTME commented Mar 24, 2026 •

edited

Loading