[SPARK-55568][SQL] Separate schema construction from field stats collection by qlong · Pull Request #54343 · apache/spark

qlong · 2026-02-17T05:01:53Z

Why are the changes needed?

Variant shredding schema inference is expensive and can take over 100ms per file. Replace fold-based schema merging with deferred schema construction using single-pass field statistics collection.

Previous approach:

Used foldLeft to build and merge complete schemas for each row
Merged schemas repeatedly across 4096 rows
High allocation overhead from recursive schema construction

New approach:

Separate schema construction from field statistics collection to avoid excessive intermediate allocations and repeated merges.
Single-pass field traversal with flat statistics registry to track field types and row counts
Using lastSeenRow for deduplication
Defers schema construction until after all rows processed

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Functional test:

Pass all existing unit tests

Performance vs master:

Tested with scenarios with different field counts, array sizes, and batch sizes(1-4096 rows, 10-200 fields, varying nesting depths and sparsity patterns).
Average 1.5x speedup across test scenarios
1.5x-1.6x faster on array-heavy workloads
11.5x faster on sparse data (10% field presence)
Consistent performance across multiple runs
96% of tests show improvement

Was this patch authored or co-authored using generative AI tooling?

Co-authored with Claude Sonnet 4.5

…ection Variant shredding schema inference is expensive and can take over 100ms per file. Replace fold-based schema merging with deferred schema construction using single-pass field statistics collection. Previous approach: - Used foldLeft to build and merge complete schemas for each row - Merged schemas repeatedly across 4096 rows - High allocation overhead from recursive schema construction New approach: - Separate schema construction from field statistics collection to avoid excessive intermediate allocations and repeated merges. - Single-pass field traversal with flat statistics registry to track field types and row counts - Using lastSeenRow for deduplication - Defers schema construction until after all rows processed Performance vs master: - Tested with scenarios with different field counts, array sizes, and batch sizes(1-4096 rows, 10-200 fields, varying nesting depths and sparsity patterns). - Average 1.5x speedup across test scenarios - 1.5x-1.6x faster on array-heavy workloads - 11.5x faster on sparse data (10% field presence) - Consistent performance across multiple runs - 96% of tests show improvement All existing unit tests pass. Issue: https://issues.apache.org/jira/browse/SPARK-55568

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[SPARK-55568][SQL] Separate schema construction from field stats collection#54343

[SPARK-55568][SQL] Separate schema construction from field stats collection#54343
qlong wants to merge 1 commit intoapache:masterfrom
qlong:SPARK-55568-optimize-variant-schema-inference

qlong commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

qlong commented Feb 17, 2026

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant