Conversation
notfilippo
left a comment
There was a problem hiding this comment.
The use case makes sense.
I was thinking we could also potentially accept in the sketch any value that implements a trait with a coupon() method. Then we could implement this trait by default for all Hash, and leave it implementable by users to support this use case with a bit more intentionality.
Anyhow that's for a follow-up. This change does not break the current API so it looks good on my side.
|
Thank you! |
|
We've historically treated this as an anti-pattern. It makes it really, really easy to destroy the contents of a sketch by, say, mixing hash functions. For this use case, I'd have said you're better off just submitting hashes and letting the sketch hash again. |
|
@jmalkin I definitely agree in principle. Would a better comment on the method be enough to warn about the side effects of this? In my opinion preventing advanced use cases like this hinders adoption and I would prefer an approach where we supply all the tools, while warning about the risks. |
We need update_with_coupon to be public for the following reason.
We are running the equivalent of a
some_field is dictionary encoded. Fetching the actual string value from value id associated to it is not a cheap operation.
For this reason, we first perform the group by, collecting, for each bucket the value ids encounterred.
We then compute once and for all the mapping from term_id to term and store it in a cache.
We can then reuse this mapping when building the hll sketch for each independent bucket.
In order to avoid dealing with strings, we would like to store, not the strings, but their coupons.