Skip to content

feat: abstract FormatFileReader and introduce ORC & avro readers#225

Open
JingsongLi wants to merge 5 commits intoapache:mainfrom
JingsongLi:formats
Open

feat: abstract FormatFileReader and introduce ORC & avro readers#225
JingsongLi wants to merge 5 commits intoapache:mainfrom
JingsongLi:formats

Conversation

@JingsongLi
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi commented Apr 7, 2026

Purpose

PR introduced the FormatFileReader trait abstraction layer, which splits the reader.rs into:

  • reader.rs - Universal Read Organization (schema evolution, DV, row ranges merge)
  • format/parquet.rs - Parquet format reading
  • format/orc.rs - ORC format reading (new)
  • format/avro.rs - Avro format reading (new)

Core trait interface is:

/// Format-agnostic file reader that produces Arrow RecordBatch streams.
///
/// Each implementation (Parquet, ORC, ...) handles:
/// - Column projection
/// - Predicate pushdown (row-group/stripe pruning + row-level filtering)
/// - Row range selection
#[async_trait]
pub(crate) trait FormatFileReader: Send + Sync {
    /// Read a single data file, returning a stream of RecordBatches
    /// containing only the projected columns (using names from the file's schema).
    ///
    /// `row_selection` is a pre-merged list of 0-based inclusive row ranges
    /// (DV + row_ranges already combined by the caller).
    async fn read_batch_stream(
        &self,
        reader: Box<dyn FileRead>,
        file_size: u64,
        read_fields: &[DataField],
        predicates: Option<&FilePredicates>,
        batch_size: Option<usize>,
        row_selection: Option<Vec<RowRange>>,
    ) -> crate::Result<ArrowRecordBatchStream>;
}

Brief change log

Tests

API and Format

Documentation

@JingsongLi JingsongLi changed the title [WIP] feat: Abstract FormatFileReader and introduce ORC & avro readers feat: Abstract FormatFileReader and introduce ORC & avro readers Apr 7, 2026
@JingsongLi JingsongLi changed the title feat: Abstract FormatFileReader and introduce ORC & avro readers feat: abstract FormatFileReader and introduce ORC & avro readers Apr 7, 2026
.collect();
Arc::new(arr)
}
DataType::Decimal(d) => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This decimal path is not safe. By the time we get here, the Avro value has already been converted into serde_json::Value, so as_i64() can silently turn valid decimal values into nulls. We should decode the decimal with a typed path instead of going through JSON values here.

.collect();
Arc::new(arr)
}
DataType::Binary(_) | DataType::VarBinary(_) => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binary values do not look correct on this path either. bytes and fixed should stay as raw bytes, but this branch assumes they come in as Value::String. That can either fail during deserialization or lose the original bytes semantics.

op,
literals,
} => {
let file_index = mapping.get(*index).copied().flatten()?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping the leaf here changes the schema-evolution semantics for missing columns. For an older file that does not contain new_col, filters like new_col = 1, new_col > 0, or new_col IS NOT NULL should become “cannot match”, not disappear entirely. Once we remove the leaf at remap time, the downstream pruning/read path never sees that constraint and can still read rows from old files.

fn unwrap_avro_union(v: &AvroValue) -> Option<&AvroValue> {
match v {
AvroValue::Null => None,
AvroValue::Object(map) if map.len() == 1 => {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Object path is no longer safe after adding map/row support. unwrap_avro_union() still treats any single-entry object as a union wrapper, but a valid Avro map with one entry or a row with one field has the same shape here. That means values like {"c": 30} can be unwrapped to 30, and the later map/row builders will read them back as an empty map or an all-null struct. We need to separate union wrappers from real object values instead of using len() == 1 here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants