fix: render binary columns as hex in DataFrame::describe()#21728
Open
diegoQuinas wants to merge 5 commits intoapache:mainfrom
Open
fix: render binary columns as hex in DataFrame::describe()#21728diegoQuinas wants to merge 5 commits intoapache:mainfrom
diegoQuinas wants to merge 5 commits intoapache:mainfrom
Conversation
) `describe()` previously returned `null` for min/max on `Binary` columns (due to an exclusion filter) and crashed with a cast error on `LargeBinary`, `BinaryView`, and `FixedSizeBinary` columns. Stop excluding `Binary` from the min/max aggregations and render binary results using Arrow's `ArrayFormatter`, which produces lowercase hex. This gives users a meaningful value range for columns holding hashes, UUIDs, or fingerprints, while matching Arrow's default display.
Contributor
|
Does this supersede |
Author
|
Yes, I think it does. A quick comparison:
So the |
Jefffrey
reviewed
Apr 21, 2026
Co-authored-by: Jeffrey Vo <jeffrey.vo.australia@gmail.com>
…afusion into feat/describe-binary-hex
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
DataFrame::describe()is the standard way to get a statistical summary of a DataFrame (count, null_count, mean, std, min, max, median per column). Today it handles binary-like columns poorly:Binary, an exclusion filter inmin/maxaggregations caused both to be reported asnull, losing useful information for columns that hold hashes, UUIDs, fingerprints, or other content-addressed identifiers.LargeBinary,BinaryView, andFixedSizeBinary, the filter did not apply, somin/maxran successfully but then the display step tried tocast(column, Utf8), which Arrow correctly rejects, producing anArrowError::CastErrorthat bubbled up and failed the wholedescribe()call.The fix in this PR is aligned with what the issue proposes: stop filtering
Binaryfrom the aggregations and render binary outputs as lowercase hex (matching Arrow's default display of binary arrays).What changes are included in this PR?
datafusion/core/src/dataframe/mod.rs:DataType::Binaryfrom themin/maxexclusion filter (now onlyBooleanis excluded, which is still meaningful for a statistical summary).Binary,LargeBinary,BinaryView, andFixedSizeBinarythat usesarrow::util::display::ArrayFormatterwith default options, which renders bytes as lowercase hex.FormatOptionsunqualified inDataFrame::to_string()for consistency.Are these changes tested?
Yes, a new integration test
describe_binary_columnsindatafusion/core/tests/dataframe/describe.rsbuilds an in-memoryRecordBatchwith one column per binary-like type and asserts the fulldescribe()output via an inlineinstasnapshot. The test covers non-null values and a null row per column, so it exercises bothnull_countand the hex rendering path formin/max.All existing
describetests continue to pass unchanged.Are there any user-facing changes?
Yes — this is a visible behavior change for
DataFrame::describe():min/maxonBinarycolumns werenull; other binary-like types caused a cast error.min/maxon all binary-like types render as lowercase hex strings (e.g."0001","ffee").No public API changes.