fix: add SketchHashable trait to fix inconsistent hashing across types#111
fix: add SketchHashable trait to fix inconsistent hashing across types#111ZENOTME wants to merge 1 commit intoapache:mainfrom
Conversation
adb49dd to
5591bcd
Compare
…s sketches - Updated various sketch implementations (CpcSketch, HllSketch, ThetaSketch) to use the new SketchHashable trait for hashing values, ensuring consistent behavior across different data types. - Simplified update methods to accept types implementing SketchHashable instead of Hash. - Added tests to verify that integer, string, and float inputs are handled consistently across sketches.
5591bcd to
f565ade
Compare
tisonkun
left a comment
There was a problem hiding this comment.
I don't get the point clear. Could you provide a failing test to show how the original impl is wrong?
e.g. for u32 in Rust, the following test fails before this PR but in c++ , follwing is success because in c++, uint32 will be convert to int32_t and then int32_t. So in Rust to make it compatible with this behaviour, we introduce |
tisonkun
left a comment
There was a problem hiding this comment.
I get the point now.
But I think we need to consider how Rust's Hash would be recognized in datasketches-rust then, instead of randomly patch some code paths.
That said, we still have sketches like BloomFilter and FrequentItemsSketch that depend on Rust's std::hash::Hash. If we partially "fix" for CPC/HLL/Theta, then it creates a new splitting point among sketches.
To replace std::hash::Hash with datasketches' own Hash trait is one possible way to go, but then we need to consider whether this should be sealed or users can implement for their own types (technically, users can direct their own type to an u64 hashed value and then use our sealed u64 hash impl, but it may or may not desireable).
Yes, we should ensure that BloomFilter and FrequentItemsSketch are also consistent with the other implementations. However, I noticed that even in the C++ version, BloomFilter and Theta follow different canonicalization paths: Instead of introducing multiple hashing behaviors using multiple type, I think it maybe more reasonable to define a single canonical hashing model and make all sketches follow it.
For now, I made |
|
Loop in @AlexanderSaydakov @leerho @proost @freakyzoidberg for general discussion on how DataSketches' sketches should handle hashable value. Certain programming languages can have their own Hash/Hashable trait/interface and it seems the current impls vary one another. |
XXXSketch::updatepreviously accepted arbitraryT: Hashand fed Rust-native Hash semantics directly into the hash path. This diverges from the Java/C++ datasketches implementations, which use explicit canonical update semantics for numeric, floating-point, string, and byte-oriented inputs.Concretely, this caused several cross-language mismatches:
for example for i in 0..n { sketch.update(i) }) could produce different retained hashes / estimates from the C++ implementationThis PR introduce SketchHashable trait to unify the hash behaviour to fix inconsistent hashing across types