Skip to content

Preserve authoritative SPM unit ids through synthesis (#113)#118

Merged
MaxGhenis merged 3 commits into
mainfrom
fix/preserve-spm-unit-ids
May 31, 2026
Merged

Preserve authoritative SPM unit ids through synthesis (#113)#118
MaxGhenis merged 3 commits into
mainfrom
fix/preserve-spm-unit-ids

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Part of #113. Implements the maintainer directive: keep the source SPM unit ids; tax-unit ids are reconstructed (unreliable) not preserved.

The bug

The source records carry eCPS-quality authoritative entity ids — measured on the real CPS ASEC cache: spm_unit_id = 1.0428 units/household, tax_unit_id = 1.3746. But _assign_family_and_spm_units preserved spm_unit_id only when the column was completely present (_normalized_complete_existing_group_ids returns None on any missing id). Synthesis leaves some records without an id, so the whole column was discarded and SPM units were regenerated as one per household (~1.00/hh) — a large share of the #113 SPM gap (output measured at 1.005–1.015/hh).

The fix

_preserve_present_group_ids: keep every present id's grouping; fold rows with a missing id into their household's existing unit (never fabricating a spurious one); give fully-missing households a single fallback unit. SPM assignment now uses it; tax-unit construction is unchanged (reconstructed, not preserved).

Validation (real CPS ASEC, 55,762 households)

SPM ids missing new spm/hh old behavior
0% 1.0428 (exact) kept
20% (random) 1.0304 discarded → ~1.00
50% (random) 1.0148 discarded → ~1.00
30% of whole households 1.0302 discarded → ~1.00

Present households keep 1.04; only genuinely-missing ones degrade — no inflation, no collapse.

Tests

5 new (test_us_spm_preservation.py): present preserved, missing folds (not split), fully-missing→fallback, all-missing→None, end-to-end partial preservation.

🤖 Generated with Claude Code

SPM unit ids from the source are eCPS-quality (~1.04 units/household). The old
all-or-nothing preservation discarded the whole column if ANY id was missing,
collapsing to one SPM unit per household (~1.00) -- a large share of the #113
SPM gap. Add _preserve_present_group_ids: keep every present id's grouping, fold
missing rows into their household's existing unit (never fabricating one), and
give fully-missing households a single fallback unit. Tax-unit ids are NOT
preserved -- they are reconstructed (they are unreliable; see #113).

Validated on real CPS ASEC: spm/hh stays ~1.04 with present ids and degrades
gracefully with missing ones (vs 1.00 before). 5 tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MaxGhenis and others added 2 commits May 31, 2026 16:48
@MaxGhenis MaxGhenis merged commit 525a194 into main May 31, 2026
4 checks passed
@MaxGhenis MaxGhenis deleted the fix/preserve-spm-unit-ids branch May 31, 2026 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant