Skip to content

US output datasets drop bare person relationship IDs for custom datasets #335

@anth-volk

Description

@anth-volk

Summary

The US population simulation path accepts custom/synthetic datasets whose person table uses bare relationship ID columns such as household_id, tax_unit_id, spm_unit_id, family_id, and marital_unit_id. However, when writing the output dataset, PolicyEngineUSLatest.run() only preserves person relationship IDs when the input columns use the prefixed convention, e.g. person_household_id.

That means a custom dataset using the bare-ID convention can successfully build and run a US simulation, but the resulting output person table loses the relationship columns needed for downstream entity mapping.

This is unlikely to affect the official managed/enhanced US datasets, which appear to use the prefixed person_*_id convention. It is most likely to show up in custom, synthetic, hand-built, or test datasets where bare IDs are simpler and are already supported elsewhere in .py.

Why this looks unintended

The broader codebase explicitly supports both conventions:

  • build_entity_relationships() resolves person_{entity}_id first and falls back to {entity}_id for custom datasets.
  • PolicyEngineUSLatest._build_simulation_from_dataset() also handles both person_X_id and X_id naming conventions.
  • Several US fixtures/tests use bare household_id, tax_unit_id, spm_unit_id, etc.

The output writer is the inconsistent path: it comments that person-level group ID columns are needed for downstream joins, but only copies columns that start with person_.

Observed failure mode

Using a small custom US dataset with bare person relationship IDs:

  • Simulation.run() completes.
  • sim.output_dataset.data.person lacks household_id, tax_unit_id, spm_unit_id, family_id, and marital_unit_id.
  • calculate_us_poverty_rates(sim) then fails when mapping SPM-unit poverty to people with:
ValueError: Unsupported mapping from spm_unit to person

Likely scope

Low likelihood for official production/managed US datasets. Medium likelihood for custom US population simulations. High likelihood for small hand-built/synthetic datasets that call poverty outputs or economic_impact_analysis.

Possible fix

When writing the US output person table, preserve the relationship IDs regardless of whether the input used the prefixed or bare convention. For each US group entity, copy from person_{entity}_id if present, otherwise copy from {entity}_id if present, into the canonical bare output column used by map_to_entity().

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions