Summary
The US population simulation path accepts custom/synthetic datasets whose person table uses bare relationship ID columns such as household_id, tax_unit_id, spm_unit_id, family_id, and marital_unit_id. However, when writing the output dataset, PolicyEngineUSLatest.run() only preserves person relationship IDs when the input columns use the prefixed convention, e.g. person_household_id.
That means a custom dataset using the bare-ID convention can successfully build and run a US simulation, but the resulting output person table loses the relationship columns needed for downstream entity mapping.
This is unlikely to affect the official managed/enhanced US datasets, which appear to use the prefixed person_*_id convention. It is most likely to show up in custom, synthetic, hand-built, or test datasets where bare IDs are simpler and are already supported elsewhere in .py.
Why this looks unintended
The broader codebase explicitly supports both conventions:
build_entity_relationships() resolves person_{entity}_id first and falls back to {entity}_id for custom datasets.
PolicyEngineUSLatest._build_simulation_from_dataset() also handles both person_X_id and X_id naming conventions.
- Several US fixtures/tests use bare
household_id, tax_unit_id, spm_unit_id, etc.
The output writer is the inconsistent path: it comments that person-level group ID columns are needed for downstream joins, but only copies columns that start with person_.
Observed failure mode
Using a small custom US dataset with bare person relationship IDs:
Simulation.run() completes.
sim.output_dataset.data.person lacks household_id, tax_unit_id, spm_unit_id, family_id, and marital_unit_id.
calculate_us_poverty_rates(sim) then fails when mapping SPM-unit poverty to people with:
ValueError: Unsupported mapping from spm_unit to person
Likely scope
Low likelihood for official production/managed US datasets. Medium likelihood for custom US population simulations. High likelihood for small hand-built/synthetic datasets that call poverty outputs or economic_impact_analysis.
Possible fix
When writing the US output person table, preserve the relationship IDs regardless of whether the input used the prefixed or bare convention. For each US group entity, copy from person_{entity}_id if present, otherwise copy from {entity}_id if present, into the canonical bare output column used by map_to_entity().
Summary
The US population simulation path accepts custom/synthetic datasets whose person table uses bare relationship ID columns such as
household_id,tax_unit_id,spm_unit_id,family_id, andmarital_unit_id. However, when writing the output dataset,PolicyEngineUSLatest.run()only preserves person relationship IDs when the input columns use the prefixed convention, e.g.person_household_id.That means a custom dataset using the bare-ID convention can successfully build and run a US simulation, but the resulting output person table loses the relationship columns needed for downstream entity mapping.
This is unlikely to affect the official managed/enhanced US datasets, which appear to use the prefixed
person_*_idconvention. It is most likely to show up in custom, synthetic, hand-built, or test datasets where bare IDs are simpler and are already supported elsewhere in.py.Why this looks unintended
The broader codebase explicitly supports both conventions:
build_entity_relationships()resolvesperson_{entity}_idfirst and falls back to{entity}_idfor custom datasets.PolicyEngineUSLatest._build_simulation_from_dataset()also handles bothperson_X_idandX_idnaming conventions.household_id,tax_unit_id,spm_unit_id, etc.The output writer is the inconsistent path: it comments that person-level group ID columns are needed for downstream joins, but only copies columns that start with
person_.Observed failure mode
Using a small custom US dataset with bare person relationship IDs:
Simulation.run()completes.sim.output_dataset.data.personlackshousehold_id,tax_unit_id,spm_unit_id,family_id, andmarital_unit_id.calculate_us_poverty_rates(sim)then fails when mapping SPM-unit poverty to people with:Likely scope
Low likelihood for official production/managed US datasets. Medium likelihood for custom US population simulations. High likelihood for small hand-built/synthetic datasets that call poverty outputs or
economic_impact_analysis.Possible fix
When writing the US output person table, preserve the relationship IDs regardless of whether the input used the prefixed or bare convention. For each US group entity, copy from
person_{entity}_idif present, otherwise copy from{entity}_idif present, into the canonical bare output column used bymap_to_entity().