feat: abstract FormatFileReader and introduce ORC & avro readers#225
feat: abstract FormatFileReader and introduce ORC & avro readers#225JingsongLi wants to merge 5 commits intoapache:mainfrom
Conversation
| .collect(); | ||
| Arc::new(arr) | ||
| } | ||
| DataType::Decimal(d) => { |
There was a problem hiding this comment.
This decimal path is not safe. By the time we get here, the Avro value has already been converted into serde_json::Value, so as_i64() can silently turn valid decimal values into nulls. We should decode the decimal with a typed path instead of going through JSON values here.
| .collect(); | ||
| Arc::new(arr) | ||
| } | ||
| DataType::Binary(_) | DataType::VarBinary(_) => { |
There was a problem hiding this comment.
Binary values do not look correct on this path either. bytes and fixed should stay as raw bytes, but this branch assumes they come in as Value::String. That can either fail during deserialization or lose the original bytes semantics.
crates/paimon/src/arrow/filtering.rs
Outdated
| op, | ||
| literals, | ||
| } => { | ||
| let file_index = mapping.get(*index).copied().flatten()?; |
There was a problem hiding this comment.
Dropping the leaf here changes the schema-evolution semantics for missing columns. For an older file that does not contain new_col, filters like new_col = 1, new_col > 0, or new_col IS NOT NULL should become “cannot match”, not disappear entirely. Once we remove the leaf at remap time, the downstream pruning/read path never sees that constraint and can still read rows from old files.
| fn unwrap_avro_union(v: &AvroValue) -> Option<&AvroValue> { | ||
| match v { | ||
| AvroValue::Null => None, | ||
| AvroValue::Object(map) if map.len() == 1 => { |
There was a problem hiding this comment.
This Object path is no longer safe after adding map/row support. unwrap_avro_union() still treats any single-entry object as a union wrapper, but a valid Avro map with one entry or a row with one field has the same shape here. That means values like {"c": 30} can be unwrapped to 30, and the later map/row builders will read them back as an empty map or an all-null struct. We need to separate union wrappers from real object values instead of using len() == 1 here.
Purpose
PR introduced the FormatFileReader trait abstraction layer, which splits the reader.rs into:
Core trait interface is:
Brief change log
Tests
API and Format
Documentation