Skip to content

Comments

[SPARK-55568][SQL] Separate schema construction from field stats collection#54343

Open
qlong wants to merge 1 commit intoapache:masterfrom
qlong:SPARK-55568-optimize-variant-schema-inference
Open

[SPARK-55568][SQL] Separate schema construction from field stats collection#54343
qlong wants to merge 1 commit intoapache:masterfrom
qlong:SPARK-55568-optimize-variant-schema-inference

Conversation

@qlong
Copy link

@qlong qlong commented Feb 17, 2026

Why are the changes needed?

Variant shredding schema inference is expensive and can take over 100ms per file. Replace fold-based schema merging with deferred schema construction using single-pass field statistics collection.

Previous approach:

  • Used foldLeft to build and merge complete schemas for each row
  • Merged schemas repeatedly across 4096 rows
  • High allocation overhead from recursive schema construction

New approach:

  • Separate schema construction from field statistics collection to avoid excessive intermediate allocations and repeated merges.
  • Single-pass field traversal with flat statistics registry to track field types and row counts
  • Using lastSeenRow for deduplication
  • Defers schema construction until after all rows processed

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Functional test:

  • Pass all existing unit tests

Performance vs master:

  • Tested with scenarios with different field counts, array sizes, and batch sizes(1-4096 rows, 10-200 fields, varying nesting depths and sparsity patterns).
  • Average 1.5x speedup across test scenarios
  • 1.5x-1.6x faster on array-heavy workloads
  • 11.5x faster on sparse data (10% field presence)
  • Consistent performance across multiple runs
  • 96% of tests show improvement

Was this patch authored or co-authored using generative AI tooling?

Co-authored with Claude Sonnet 4.5

…ection

Variant shredding schema inference is expensive and can take over 100ms
per file. Replace fold-based schema merging with deferred schema
construction using single-pass field statistics collection.

Previous approach:
- Used foldLeft to build and merge complete schemas for each row
- Merged schemas repeatedly across 4096 rows
- High allocation overhead from recursive schema construction

New approach:
- Separate schema construction from field statistics collection to avoid
  excessive intermediate allocations and repeated merges.
- Single-pass field traversal with flat statistics registry to track
  field types and row counts
- Using lastSeenRow for deduplication
- Defers schema construction until after all rows processed

Performance vs master:
- Tested with scenarios with different field counts, array sizes, and
  batch sizes(1-4096 rows, 10-200 fields, varying nesting depths and
  sparsity patterns).
- Average 1.5x speedup across test scenarios
- 1.5x-1.6x faster on array-heavy workloads
- 11.5x faster on sparse data (10% field presence)
- Consistent performance across multiple runs
- 96% of tests show improvement

All existing unit tests pass.

Issue: https://issues.apache.org/jira/browse/SPARK-55568
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant