Skip to content

fix: render binary columns as hex in DataFrame::describe()#21728

Open
diegoQuinas wants to merge 5 commits intoapache:mainfrom
diegoQuinas:feat/describe-binary-hex
Open

fix: render binary columns as hex in DataFrame::describe()#21728
diegoQuinas wants to merge 5 commits intoapache:mainfrom
diegoQuinas:feat/describe-binary-hex

Conversation

@diegoQuinas
Copy link
Copy Markdown

Which issue does this PR close?

Rationale for this change

DataFrame::describe() is the standard way to get a statistical summary of a DataFrame (count, null_count, mean, std, min, max, median per column). Today it handles binary-like columns poorly:

  • For Binary, an exclusion filter in min/max aggregations caused both to be reported as null, losing useful information for columns that hold hashes, UUIDs, fingerprints, or other content-addressed identifiers.
  • For LargeBinary, BinaryView, and FixedSizeBinary, the filter did not apply, so min/max ran successfully but then the display step tried to cast(column, Utf8), which Arrow correctly rejects, producing an ArrowError::CastError that bubbled up and failed the whole describe() call.

The fix in this PR is aligned with what the issue proposes: stop filtering Binary from the aggregations and render binary outputs as lowercase hex (matching Arrow's default display of binary arrays).

What changes are included in this PR?

  • datafusion/core/src/dataframe/mod.rs:
    • Drop DataType::Binary from the min/max exclusion filter (now only Boolean is excluded, which is still meaningful for a statistical summary).
    • Add a dedicated display branch for Binary, LargeBinary, BinaryView, and FixedSizeBinary that uses arrow::util::display::ArrayFormatter with default options, which renders bytes as lowercase hex.
    • Tidy a now-stale comment that referenced the previous binary filter.
    • Drive-by: use the newly imported FormatOptions unqualified in DataFrame::to_string() for consistency.

Are these changes tested?

Yes, a new integration test describe_binary_columns in datafusion/core/tests/dataframe/describe.rs builds an in-memory RecordBatch with one column per binary-like type and asserts the full describe() output via an inline insta snapshot. The test covers non-null values and a null row per column, so it exercises both null_count and the hex rendering path for min/max.

All existing describe tests continue to pass unchanged.

Are there any user-facing changes?

Yes — this is a visible behavior change for DataFrame::describe():

  • Before: min/max on Binary columns were null; other binary-like types caused a cast error.
  • After: min/max on all binary-like types render as lowercase hex strings (e.g. "0001", "ffee").

No public API changes.

)

`describe()` previously returned `null` for min/max on `Binary` columns (due to an exclusion filter) and crashed with a cast error on `LargeBinary`, `BinaryView`, and `FixedSizeBinary` columns.

Stop excluding `Binary` from the min/max aggregations and render binary results using Arrow's `ArrayFormatter`, which produces lowercase hex. This gives users a meaningful value range for columns holding hashes, UUIDs, or fingerprints, while matching Arrow's default display.
@github-actions github-actions Bot added the core Core DataFusion crate label Apr 19, 2026
@Jefffrey
Copy link
Copy Markdown
Contributor

@diegoQuinas
Copy link
Copy Markdown
Author

Yes, I think it does. A quick comparison:

So the FixedSizeBinary case #21455 targets is handled here, plus LargeBinary/BinaryView, and users get real values instead of null. Happy to close #21455 in favor of this one if reviewers agree — or, if the null fallback is preferred over hex rendering, I can rework this PR on top of #21455.

Copy link
Copy Markdown
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my comment on #21455, I think we should take this opportunity to fix the docstring of describe here as it states only numeric types are supported

Comment thread datafusion/core/tests/dataframe/describe.rs Outdated
Comment thread datafusion/core/src/dataframe/mod.rs Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

describe() returns null for min/max on binary columns — render as hex instead

2 participants