[HUDI-18383] Support selective meta field population#18384
[HUDI-18383] Support selective meta field population#18384prashantwason wants to merge 2 commits intoapache:masterfrom
Conversation
Add hoodie.meta.fields.to.exclude config to selectively skip meta field population. Excluded fields are written as null for optimal Parquet storage savings while retaining incremental query capability via _hoodie_commit_time. Covers all 4 write paths: Avro, Spark InternalRow, Spark SQL row-writer, and Flink. Uses pre-computed boolean[5] for zero-overhead per-row checks. Closes apache#18383 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| : null; | ||
| metaFields[2] = populateField[2] ? recordKey : null; | ||
| metaFields[3] = populateField[3] ? row.getUTF8String(HoodieRecord.PARTITION_PATH_META_FIELD_ORD) : null; | ||
| metaFields[4] = populateField[4] ? fileName : null; |
There was a problem hiding this comment.
So the metadata fields are still in the table schema, it's just not populated selectively.
suryaprasanna
left a comment
There was a problem hiding this comment.
Can we have unit tests for this change?
| return getBooleanOrDefault(HoodieTableConfig.POPULATE_META_FIELDS); | ||
| } | ||
|
|
||
| public Set<String> getMetaFieldsToExclude() { |
There was a problem hiding this comment.
Should this be private?
| flags[2] = !excluded.contains(HoodieRecord.RECORD_KEY_METADATA_FIELD); | ||
| flags[3] = !excluded.contains(HoodieRecord.PARTITION_PATH_METADATA_FIELD); | ||
| flags[4] = !excluded.contains(HoodieRecord.FILENAME_METADATA_FIELD); | ||
| return flags; |
There was a problem hiding this comment.
Do we need to include OPERATION_METADATA_FIELD as well?
| private final String fileId; | ||
| private final boolean preserveHoodieMetadata; | ||
| private final boolean skipMetadataWrite; | ||
| private final boolean[] populateField; |
There was a problem hiding this comment.
Should this be populateIndividualMetaFields something like that, instead of populateField, what do you think?
| private final UTF8String instantTime; | ||
|
|
||
| private final boolean populateMetaFields; | ||
| private final boolean[] populateField; |
There was a problem hiding this comment.
Should this be populateIndividualMetaFields something like that, instead of populateField, what do you think?
| row.update(COMMIT_SEQNO_METADATA_FIELD.ordinal(), UTF8String.fromString(seqIdGenerator.apply(recordCount))); | ||
| row.update(RECORD_KEY_METADATA_FIELD.ordinal(), recordKey); | ||
| row.update(PARTITION_PATH_METADATA_FIELD.ordinal(), UTF8String.fromString(partitionPath)); | ||
| row.update(FILENAME_METADATA_FIELD.ordinal(), fileName); |
There was a problem hiding this comment.
Do we need to include OPERATION_METADATA_FIELD as well?
| sparkKeyGenerator | ||
| } | ||
|
|
||
| val populateField = config.getMetaFieldPopulationFlags |
There was a problem hiding this comment.
Should this be populateIndividualMetaFields something like that, instead of populateField, what do you think?
| .withDocumentation("When enabled, populates all meta fields. When disabled, no meta fields are populated " | ||
| + "and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing"); | ||
|
|
||
| public static final ConfigProperty<String> META_FIELDS_TO_EXCLUDE = ConfigProperty |
There was a problem hiding this comment.
This would require table version upgrade, not sure how we want to track it as part of next version.
| } | ||
|
|
||
| public HoodieAvroOrcWriter(String instantTime, StoragePath file, HoodieOrcConfig config, HoodieSchema schema, | ||
| TaskContextSupplier taskContextSupplier, boolean[] populateField) throws IOException { |
There was a problem hiding this comment.
Should this be populateIndividualMetaFields something like that, instead of populateField, what do you think?
…LDS_TO_EXCLUDE Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Describe the issue this Pull Request addresses
Closes #18383
Discussion: #17959
Currently
hoodie.populate.meta.fieldsis all-or-nothing: either all 5 meta columns are populated, or none are (all get empty strings). Users who disable it to save storage lose incremental query capability (which requires_hoodie_commit_time). Fields like_hoodie_record_key,_hoodie_partition_path, and_hoodie_file_namecan be virtualized and don't need physical storage.Summary and Changelog
Adds
hoodie.meta.fields.to.excludeconfig for selective meta field population. Excluded meta fields are written as null (not empty string) for optimal Parquet storage savings (nulls take zero data bytes, stored as bit flags in definition levels).Changes:
hoodie.meta.fields.to.excludeconfig property inHoodieTableConfiggetMetaFieldPopulationFlags()inHoodieWriteConfigreturning a pre-computedboolean[5]array indexed by meta field ordinalHoodieAvroParquetWriter,HoodieAvroOrcWriter,HoodieAvroHFileWriter) via newprepRecordWithMetadata()overloadHoodieSparkParquetWriter) via conditionalupdateRecordMetadata()HoodieRowCreateHandle,HoodieDatasetBulkInsertHelper) via conditional meta field arrayHoodieRowDataCreateHandle) via conditional values inHoodieRowDataCreation.create()_hoodie_record_keyis excluded (bloom filter indexes record keys)AbstractHoodieRowData.getString()to handle null meta columns without NPEExample config:
Impact
New config
hoodie.meta.fields.to.exclude(default: empty). No behavior change for existing users. When configured, excluded meta fields are written as null instead of computed values. Public API addition only (new config property).Risk Level
low - Additive change. Default behavior is unchanged (empty exclude list = all fields populated). The
boolean[5]array is pre-computed once per writer constructor with zero per-row allocation overhead.Documentation Update
New config
hoodie.meta.fields.to.excludeadded with inline documentation. Valid values are the 5 meta field names:_hoodie_commit_time,_hoodie_commit_seqno,_hoodie_record_key,_hoodie_partition_path,_hoodie_file_name. Only effective whenhoodie.populate.meta.fields=true.Contributor's checklist