L0 regularization paper (WIP) by juaristi22 · Pull Request #685 · PolicyEngine/policyengine-us-data

juaristi22 · 2026-04-03T05:50:10Z

No description provided.

baogorek

First round of feedback!

I think Author ordering is perfect
Paragraphs are one long line in a text editor. Hey, we've all got text wrap. But I would recommend having an AI split up the lines very loosely, to 100-120 character length. If you notice, all my comments show up at the beginning of the paragraph, because the paragraph is the line.
Other methods you may want to consider for the background:
a) Covariate Balancing in Causal Inference
b) Small Area Estimation
We've got to figure out how to talk about states + DC and districts|
I'm less sure after reading that GREG and IPF are your "competitors." Let's chat.

baogorek · 2026-04-13T11:20:27Z

+\begin{abstract}
+Tax-benefit microsimulation models typically operate at the national level, using household survey weights calibrated to aggregate population targets. Subnational analysis---at the level of states, congressional districts, or local authorities---requires datasets that simultaneously satisfy geographic distributional constraints while preserving household-level detail. We present a method based on $L_0$ regularization that jointly optimizes survey weight magnitudes and sparsity to produce calibrated subnational microsimulation datasets.
+
+Our approach builds on the Hard Concrete distribution \citep{louizos2018}, which induces exact sparsity by multiplying each household's weight by a learned stochastic gate that collapses to a deterministic zero or one at inference time. We parameterize each gate with a log-alpha and temperature parameter, and jointly optimize these alongside log-transformed weight magnitudes using a single loss function combining scale-invariant relative calibration error, an $L_0$ sparsity penalty on the expected count of active households, and a light $L_2$ regularizer on weight magnitudes.


This is good. There is one thing that we do differently that Louizos et al that I want to talk to you about in person.

baogorek · 2026-04-13T11:21:29Z

+
+Our approach builds on the Hard Concrete distribution \citep{louizos2018}, which induces exact sparsity by multiplying each household's weight by a learned stochastic gate that collapses to a deterministic zero or one at inference time. We parameterize each gate with a log-alpha and temperature parameter, and jointly optimize these alongside log-transformed weight magnitudes using a single loss function combining scale-invariant relative calibration error, an $L_0$ sparsity penalty on the expected count of active households, and a light $L_2$ regularizer on weight magnitudes.
+
+The pipeline begins with the US Current Population Survey. Each household record is cloned multiple times and assigned to random census blocks drawn from a population-weighted distribution. Program participation indicators are re-randomized per geographic assignment using local take-up rates. Each clone is then run through \policyengine{}'s tax-benefit microsimulation engine to generate geography-specific outputs. The $L_0$ optimizer selects which household-geography combinations to retain, calibrating simultaneously against approximately 37,800 targets across three geographic levels. The sparsity penalty is configurable: a higher penalty produces a compact national dataset of approximately 50,000 records, while a lower penalty yields a larger dataset of approximately 3--4 million records covering all 436 congressional districts and 50 states individually. The method is implemented as the open-source \texttt{l0-python} PyTorch package.


Paragraph feels less like an abstract and more like a body paragraph. I know it's still a WIP, but really the abstract needs to have cold, hard stats in it that show performance. (We'll get them.)

Uggh, one other thing we have to tackle: technically there are not 436 congressional districts. I was about to tell you to use 51 "state equivalents," but that's not exactly correct either. A TODO.

baogorek · 2026-04-13T11:31:28Z

+\section{Introduction}
+\label{sec:introduction}
+
+Microsimulation models estimate the effects of tax and benefit policies on households by applying program rules to individual-level microdata. Most operational models---including those maintained by the Congressional Budget Office \citep{cbo2018}, the Joint Committee on Taxation \citep{jct2023}, and the Tax Policy Center \citep{tpc2024}---operate at the national level. They calibrate household survey weights to aggregate administrative totals such as total income tax revenue, program enrollment counts, and demographic benchmarks, then use the reweighted dataset to simulate policy reforms.


"They calibrate" -> do we know that for sure?

baogorek · 2026-04-13T11:38:21Z

+
+Subnational policy analysis introduces a fundamentally different calibration challenge. Rather than matching a single set of national aggregates, the microdata must simultaneously reproduce distributional statistics at multiple geographic levels: congressional districts, states, and the nation as a whole. A dataset calibrated for the state of California must match California-specific IRS income totals, SNAP participation counts, Medicaid enrollment, and age distributions, while remaining consistent with national budget projections from the CBO and tax expenditure estimates from the JCT. Across 436 congressional districts and 50 states, this produces approximately 37,800 simultaneous calibration targets.
+
+Existing calibration methods scale poorly to this setting. Iterative proportional fitting \citep[IPF;][]{deming1940, ireland1968} adjusts weights along one dimension at a time, cycling through marginal constraints until convergence. IPF handles cross-classified tables but does not naturally accommodate hierarchical geographic constraints---district targets must sum to state targets, which must sum to national targets---without ad hoc post-processing. Generalized regression (GREG) estimators \citep{deville1992, sarndal2007} solve a constrained optimization problem that minimizes distance from initial weights subject to exact calibration constraints. GREG produces a closed-form solution for moderate numbers of constraints but becomes computationally intractable and numerically unstable as the constraint count approaches the tens of thousands.


IPF is also just for categorical variables (which you do indicate with "cross-classification tables," but it may make sense to call it out), and also in classical raking there is no "almost." You either get the calibration perfect or it failed to converge. You may want to comment that this is also called "raking." There is a YouTube video called something like "generalized raking" where the presenter talks about relaxing that.

My understanding is that GREG too seeks to exactly match the targets and it's a failure if it can't (not just positive loss), but check me on that. GREG does permit quantitative, in addition to qualitative variables. But it will happily use negative weights, which is annoying.

"becomes computationally intractable and numerically unstable as the constraint count approaches the tens of thousands." A source would be helpful here. Is this because, as I suggested above, the procedure simply failing because it can't match the equations?

baogorek · 2026-04-13T11:42:04Z

+
+Existing calibration methods scale poorly to this setting. Iterative proportional fitting \citep[IPF;][]{deming1940, ireland1968} adjusts weights along one dimension at a time, cycling through marginal constraints until convergence. IPF handles cross-classified tables but does not naturally accommodate hierarchical geographic constraints---district targets must sum to state targets, which must sum to national targets---without ad hoc post-processing. Generalized regression (GREG) estimators \citep{deville1992, sarndal2007} solve a constrained optimization problem that minimizes distance from initial weights subject to exact calibration constraints. GREG produces a closed-form solution for moderate numbers of constraints but becomes computationally intractable and numerically unstable as the constraint count approaches the tens of thousands.
+
+Spatial microsimulation methods take a different approach, often distinguishing between reweighting methods and synthetic reconstruction methods for constructing small-area microdata \citep{tanton2014review}. Within this broader literature, researchers have used combinatorial optimization and simulated annealing \citep{williamson1998, huang2001, harland2012} as well as deterministic reweighting \citep{tanton2011, lovelace2016}. These methods typically operate at a single geographic level and require separate calibration runs for each area, making joint multi-level calibration difficult.


So even though you have "different" in the first sentence, you've switched from classical statistics to Microsimulation literature, which could be jarring. I think flowing through the different conceptual frameworks is going to require some artistic thinking here.

baogorek · 2026-04-13T12:16:39Z

@@ -0,0 +1,158 @@
+\section{Methodology}
+\label{sec:methodology}
+


Pure reader note: By the time we've gotten to methodology, there's been a lot of methodology!

baogorek · 2026-04-13T12:17:49Z

+
+\subsubsection{Block sampling}
+
+Census blocks are the finest geographic unit in the decennial census. Each block maps deterministically to a congressional district, county, tract, and state. The sampling distribution $P_{\text{pop}}(\text{block})$ is proportional to the block's share of the national population. Drawing blocks rather than congressional districts ensures fine-grained geographic variation within districts and enables derivation of county-level variables (Section~\ref{sec:stage4}).


And of course, now we've got adjusted gross income in the formula

Ohhh, I see that's mentioned below. Hmm, I wonder if it's best to just show the formula all at once.

P_{AGI}(b) reads a bit strange since that would imply that AGI is the random variable. I think we want a multinomial distribution over all census blocks, and I do think it would be best to define that multinomial distribution directly.

baogorek · 2026-04-13T12:31:26Z

+
+\subsubsection{Per-state parallel simulation}
+
+The matrix is populated by running each household through \policyengine{}'s tax-benefit microsimulation engine. Because many target variables depend on state-specific tax and benefit rules, a separate simulation is required for each state. A parallel dispatcher sends one job per unique state FIPS code to a pool of worker processes. Each worker creates a fresh \texttt{Microsimulation} instance, overwrites every household's \texttt{state\_fips} with the target state, invalidates cached downstream variables, and calculates all target variables at the household and person levels, accounting for differences in state legislation.


ACA PTC has county specific rules, and eventually other policies are coming that are going to be at lower levels. Though admittedly, in the run I just ran, I'm only using the state level. A plug here for #598 which has been getting buried. We've got to figure out how to set the geographic level. Oh, that's right, Max had the idea for that and it's documented in that issue.

baogorek · 2026-04-13T12:34:29Z

+
+\subsubsection{Hyperparameters}
+
+Table~\ref{tab:hyperparameters} lists the optimization hyperparameters with their values and roles. The stretch parameters $\gamma = -0.1$ and $\zeta = 1.1$ follow the original Hard Concrete paper, placing approximately 9\% of the sigmoid's mass below 0 and above 1, which is what allows clipping to produce exact zeros and ones.


Note that we've kept most of Louizos's parameters. I have been setting beta back to .67, which was what they used, but honestly I don't have a good reason, and .35 works as well. The initial probability is .999, and that's just from trial and error. I'd set it to 1 if I could (it blows up). It's intended to be the proportion of gates that are open at the start. I always saw worse performance when I started with some gates closed. If they start open, they will close (eventually) though it will take more epochs. I know this is a "trust me bro" style of argument.

baogorek · 2026-04-13T12:37:43Z

+
+\subsubsection{Unachievable targets}
+
+Of the approximately 37,800 targets, \tbc[count] are marked unachievable (row sum zero in the calibration matrix). These correspond to congressional districts where no clones carry nonzero values for the target variable. Increasing the clone count from 430 reduces the number of unachievable targets, at the cost of a larger calibration matrix.


I know 37,800 is just a placeholder. This is inflated by the fact that multiple years of the same target can be in the database, etc.

Paper: "L0 regularization for subnational microsimulation calibration" targeting the International Journal of Microsimulation. Current state of the manuscript: - Full paper structure: abstract, introduction, background, data, methodology, results, discussion, conclusion, appendix - Formal survey calibration problem definition with GREG and IPF explained in depth, including benefits, drawbacks, and current practice in operational models (CBO, JCT, TPC, EUROMOD, TAXSIM) - Four-stage pipeline methodology (clone, matrix, L0 optimize, assemble) documented against the pipeline source code - Detailed appendix target tables populated from policy_data.db (37,758 targets: 33,572 district, 4,080 state, 106 national) - All writing in US English, citations linked via plainnat/natbib Still TODO: - [ ] Implement IPF and GREG baselines on the same calibration matrix to populate the comparison table (tables/comparison.tex) - [ ] Run calibration experiments and fill in all [TBC] placeholders in the results section (accuracy, sparsity, convergence, ESS) - [ ] Generate convergence curve figure from calibration_log.csv - [ ] Select and run a subnational policy application example (Section 5.5 — candidate: EITC expansion across CDs) - [ ] Review pipeline methodology section against latest code for accuracy (clone-and-assign, matrix builder, assembly steps) - [ ] Review and deepen background section: verify claims about GREG/IPF limitations, add any missing related work - [ ] Resolve pre-existing overfull hbox warnings (long URLs in conclusion, hyperparameters table width) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Introduce a paper benchmarking scaffold that compares L0 and GREG on the same exported calibration matrix while routing IPF through a separate automatic preprocessing step that reconstructs IPF-ready unit and target inputs from the saved package metadata. The scaffold includes two R runners, manifest-driven bundle export, common scoring against the shared matrix, environment setup helpers, and end-to-end tests for the runner schemas.

Replace the old order-dependent sequential IPF flow with a validated one-call path. The converter now keeps only closed categorical systems, derives complements only from authored parent totals, reports explicit drop reasons for non-runnable target families, and surfaces incompatible-total failures through structured diagnostics instead of silently chaining margin blocks. Update the benchmark harness to export retained-authored IPF scoring artifacts, validate the external-IPF input contract, and let L0/GREG opt into scoring on the same retained-authored subset for apples-to-apples comparisons. Refresh the walkthrough notebook, README, example manifest, and add regression coverage for closure logic, runner behavior, export validation, and the new scoring contract.

Methodological -------------- - Add `--train-on {shared_requested,ipf_retained_authored}` to benchmark_cli so L0 and GREG can be fit on the same target subset IPF was given. Matched runs write `{method}_matched_summary.json` so they don't overwrite the full-info `{method}_summary.json`. Both runs record `training_target_set` and `scoring_target_set` in the summary. - Add `seed` parameter to `fit_l0_weights` (seeds torch, CUDA when present, numpy). Default None preserves prior non-deterministic behavior. The benchmark CLI plumbs `seed` from `method_options.l0` so paper benchmark runs are reproducible. - Align IPF `bound` default in benchmark_cli to 10.0 to match the example manifest; the prior 4.0 was too tight for real district-level fits. Correctness ----------- - Fix `is_authored` dedupe key in `_try_close_binary_subset` — collapsing the dict on `(geo, cell)` is correct; the prior key allowed an authored + derived duplicate to slip through. Add an assertion for the genuine bug case where the same cell would arrive with conflicting authoring. - Fail loudly with `IPFConversionError` (and a diagnostics file) when the package's `targets_df` lacks `target_id`. The previous silent skip left downstream `--score-on ipf_retained_authored` to surface a misleading FileNotFoundError. - Wrap converter `ValueError` / `FileNotFoundError` raise sites in `IPFConversionError` so `ipf_conversion_diagnostics.json` is always written before export aborts. Tests added ----------- - ipf_conversion: `negative_derived_complement` guardrail, `incompatible_totals` consistency check, `unsupported_partial_margin` 3+ category subset case. - benchmarking_runners: seed wire-up (CLI → fit_l0_weights), `--train-on` retained-subset routing + fail-when-subset-missing, one-call IPF on a coherent person + household problem, derived-complement cell fitted through surveysd::ipf. - benchmark_export: missing-`target_id` fail-loud path with diagnostics. Documentation ------------- - README: matched two-fit recipe, L0 seed setting, GREG-linear's negative-weight regime (with `negative_weight_share` as the surfaced metric), strictness contract, open-question 1 resolution. - 02_ipf_walkthrough.ipynb: new section 10g demonstrating binary-subset closure via an authored parent total (the safe counterpart to the dropped open-subset case); one-call statement tightened to "one exported IPF run equals one coherent closed categorical system." Still deferred -------------- - Entity-aware extension for `tax_unit_count` / `spm_unit_count`. - `survey:::grake` private-namespace migration / R package pinning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the deferred-future framing ("`tax_unit_count` and `spm_unit_count` remain outside the core household/person IPF path *in this pass*") with the methodological reason: `surveysd::ipf` supports exactly two entity levels natively (`conP` for row-level / person-style constraints, `conH` for constraints aggregated by `hid`), and there is no generalised mechanism for additional counted entities such as `tax_unit` or `spm_unit`. Producing a single weight vector that simultaneously satisfies targets at three or more entity levels is not possible with classical IPF. Running IPF separately per scope and aggregating would give it more degrees of freedom than L0 / GREG (each pass solves a smaller subproblem with its own freedom) and would not be a like-for-like comparison. The benchmark therefore restricts IPF to `person_count` and `household_count` — together or alone — and drops other count families at the count check. Those targets remain in the shared sparse system that L0 and GREG fit, so the cross-method comparison on the IPF-feasible subset stays apples-to-apples via `--score-on ipf_retained_authored`. Updated the README's methodology section and the IPF inputs subsection to articulate the constraint, and tightened the `_target_scope` error message in `ipf_conversion.py` to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tor, fixes This commit lands the production scaffolding called for by paper-l0/BENCHMARK_PLAN.md so the L0 / GREG / IPF benchmark can be launched end-to-end against the saved calibration package without further code changes. Tier manifests (paper-l0/benchmarking/manifests/tier*.json) All ten paper-reported manifests, each reading from one shared calibration_package.pkl and differing only in target_filters: tier1_mixed.json L0 + GREG on the full filtered slice (count + dollar combined) over a five-state, ten-district geography subset plus national targets. tier1_ipf.json L0 + IPF on the same slice; IPF retains the authored closed subset and the matched runs use --train-on ipf_retained_authored. tier2_scaling_250 ... tier2_scaling_largest.json Scaling ladder (250 / 500 / 1k / 2.5k / 5k / 10k / largest coherent pre-production subset) over the same package, growing geography coverage to grow the target set instead of the unit universe. tier3_production.json Least-filtered view; failures are reportable results (status=failed with notes) rather than missing rows. All tier manifests opt into method_options.ipf.return_na = true so non- convergence surfaces NaN and the runner converts that into a visible runtime error instead of writing NaN-laden weights to disk. Geographic ID format is the convention the saved package actually uses: state FIPS without leading zeros for single-digit states (e.g. CA = "6", CA-01 = "601"). The two pre-existing example manifests are updated to match. Orchestrator (paper-l0/benchmarking/run_benchmark_suite.py) Exports each manifest, runs every method declared in it, and (when IPF is in the manifest) schedules matched-input rows for the other methods via --train-on ipf_retained_authored --score-on ipf_retained_authored. Aggregates per-manifest summaries into one tier_<n>_summary.csv per tier plus a unified suite_summary.csv. Failures (export-time IPFConversionError, runner non-zero exits, missing summary files) are recorded as status=failed rows with the captured reason in notes; the orchestrator never aborts the suite, which is load-bearing for Tier 3's "completed-or-failed" reporting story. Tier 2 rungs are discovered and ordered by target_filters.max_targets (smallest first; uncapped rungs sort last) so the summary reads top-to-bottom in increasing target count regardless of filename sort order. Both run_benchmark_suite.py and benchmark_cli.py prepend the in-tree repo root to sys.path before importing policyengine_us_data. This avoids being shadowed by an editable install of the same package name pointing at a sibling repo, which previously caused fit_l0_weights() to lose the seed parameter at script-invocation time only (not when imported from another module). IPF runner: returnNA=FALSE design task closed paper-l0/benchmarking/runners/ipf_runner.R now accepts an optional 11th positional argument return_na (default TRUE for backwards compatibility). When return_na is TRUE and surveysd::ipf returns any NaN weights -- the silent-non-convergence path called out in the BENCHMARK_PLAN's "Immediate next design tasks" -- the runner exits with a clear error rather than writing NaN weights. The Python CLI plumbs method_options.ipf.return_na through to the runner so manifests control the behavior declaratively. Modal-artifact ingest (paper-l0/benchmarking/patch_package_paths.py) The Modal-built calibration_package.pkl records absolute paths from the build container (/pipeline/artifacts/<run_id>/...) for metadata.dataset_path and metadata.db_path. The new helper rewrites those paths in-place to point at local copies of the dataset H5 and policy_data.db so the rest of the benchmark scaffold (and especially benchmark_export.build_ipf_inputs, which has hard existence checks on both paths) runs unchanged on a local checkout. Documentation (paper-l0/benchmarking/README.md) New "Paper-reported tiers" section: which manifests belong to which tier, what each method does at each rung, the run_benchmark_suite.py entry point with example invocations, and the per-tier summary CSV layout. Verification done before committing Three end-to-end smoke runs were exercised against the production calibration package (17,736 targets x 5,159,570 cloned units, n_clones=430) to verify the full pipeline: 1. count-only smoke (29 targets, 100 epochs L0): all five rows green -- L0 + GREG full-info, IPF on retained-authored subset, plus matched L0 and GREG runs trained and scored on the IPF subset. 2. mixed-target smoke (count + dollar combined): L0 and GREG both completed end-to-end; this exercises the Tier 1 / Tier 2 mixed-target path that count_like_only=true cannot reach. 3. forced non-convergence (max_iter=5, epsP=1e-20): surveysd::ipf returned NaN weights, the runner stopped with the new "did not converge" message, and the orchestrator recorded status=failed with the full R traceback in notes. The smoke manifests, scratch run dirs, and download scratch were removed from the repo before committing. The plans (paper-l0/BENCHMARK_PLAN.md and outline.md) are intentionally not included in this commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

juaristi22 marked this pull request as draft April 3, 2026 05:50

juaristi22 force-pushed the maria/l0-paper branch from c19a0a4 to 7b7994d Compare April 9, 2026 09:49

baogorek reviewed Apr 13, 2026

View reviewed changes

juaristi22 force-pushed the maria/l0-paper branch from cdd0057 to da11870 Compare April 15, 2026 15:16

juaristi22 force-pushed the maria/l0-paper branch 2 times, most recently from b200851 to 6de667d Compare April 23, 2026 16:49

juaristi22 and others added 9 commits April 30, 2026 16:12

update background section

91fead2

Reorganization and feedback integration -- second draft

95405a0

Fixing L0 inputs to IPF inputs conversion

d8be94a

juaristi22 force-pushed the maria/l0-paper branch from 6de667d to de913d9 Compare April 30, 2026 10:44


		Our approach builds on the Hard Concrete distribution \citep{louizos2018}, which induces exact sparsity by multiplying each household's weight by a learned stochastic gate that collapses to a deterministic zero or one at inference time. We parameterize each gate with a log-alpha and temperature parameter, and jointly optimize these alongside log-transformed weight magnitudes using a single loss function combining scale-invariant relative calibration error, an $L_0$ sparsity penalty on the expected count of active households, and a light $L_2$ regularizer on weight magnitudes.

		The pipeline begins with the US Current Population Survey. Each household record is cloned multiple times and assigned to random census blocks drawn from a population-weighted distribution. Program participation indicators are re-randomized per geographic assignment using local take-up rates. Each clone is then run through \policyengine{}'s tax-benefit microsimulation engine to generate geography-specific outputs. The $L_0$ optimizer selects which household-geography combinations to retain, calibrating simultaneously against approximately 37,800 targets across three geographic levels. The sparsity penalty is configurable: a higher penalty produces a compact national dataset of approximately 50,000 records, while a lower penalty yields a larger dataset of approximately 3--4 million records covering all 436 congressional districts and 50 states individually. The method is implemented as the open-source \texttt{l0-python} PyTorch package.


		Subnational policy analysis introduces a fundamentally different calibration challenge. Rather than matching a single set of national aggregates, the microdata must simultaneously reproduce distributional statistics at multiple geographic levels: congressional districts, states, and the nation as a whole. A dataset calibrated for the state of California must match California-specific IRS income totals, SNAP participation counts, Medicaid enrollment, and age distributions, while remaining consistent with national budget projections from the CBO and tax expenditure estimates from the JCT. Across 436 congressional districts and 50 states, this produces approximately 37,800 simultaneous calibration targets.

		Existing calibration methods scale poorly to this setting. Iterative proportional fitting \citep[IPF;][]{deming1940, ireland1968} adjusts weights along one dimension at a time, cycling through marginal constraints until convergence. IPF handles cross-classified tables but does not naturally accommodate hierarchical geographic constraints---district targets must sum to state targets, which must sum to national targets---without ad hoc post-processing. Generalized regression (GREG) estimators \citep{deville1992, sarndal2007} solve a constrained optimization problem that minimizes distance from initial weights subject to exact calibration constraints. GREG produces a closed-form solution for moderate numbers of constraints but becomes computationally intractable and numerically unstable as the constraint count approaches the tens of thousands.


		Existing calibration methods scale poorly to this setting. Iterative proportional fitting \citep[IPF;][]{deming1940, ireland1968} adjusts weights along one dimension at a time, cycling through marginal constraints until convergence. IPF handles cross-classified tables but does not naturally accommodate hierarchical geographic constraints---district targets must sum to state targets, which must sum to national targets---without ad hoc post-processing. Generalized regression (GREG) estimators \citep{deville1992, sarndal2007} solve a constrained optimization problem that minimizes distance from initial weights subject to exact calibration constraints. GREG produces a closed-form solution for moderate numbers of constraints but becomes computationally intractable and numerically unstable as the constraint count approaches the tens of thousands.

		Spatial microsimulation methods take a different approach, often distinguishing between reweighting methods and synthetic reconstruction methods for constructing small-area microdata \citep{tanton2014review}. Within this broader literature, researchers have used combinatorial optimization and simulated annealing \citep{williamson1998, huang2001, harland2012} as well as deterministic reweighting \citep{tanton2011, lovelace2016}. These methods typically operate at a single geographic level and require separate calibration runs for each area, making joint multi-level calibration difficult.

		@@ -0,0 +1,158 @@
		\section{Methodology}
		\label{sec:methodology}


		\subsubsection{Block sampling}

		Census blocks are the finest geographic unit in the decennial census. Each block maps deterministically to a congressional district, county, tract, and state. The sampling distribution $P_{\text{pop}}(\text{block})$ is proportional to the block's share of the national population. Drawing blocks rather than congressional districts ensures fine-grained geographic variation within districts and enables derivation of county-level variables (Section~\ref{sec:stage4}).


		\subsubsection{Per-state parallel simulation}

		The matrix is populated by running each household through \policyengine{}'s tax-benefit microsimulation engine. Because many target variables depend on state-specific tax and benefit rules, a separate simulation is required for each state. A parallel dispatcher sends one job per unique state FIPS code to a pool of worker processes. Each worker creates a fresh \texttt{Microsimulation} instance, overwrites every household's \texttt{state\_fips} with the target state, invalidates cached downstream variables, and calculates all target variables at the household and person levels, accounting for differences in state legislation.


		\subsubsection{Hyperparameters}

		Table~\ref{tab:hyperparameters} lists the optimization hyperparameters with their values and roles. The stretch parameters $\gamma = -0.1$ and $\zeta = 1.1$ follow the original Hard Concrete paper, placing approximately 9\% of the sigmoid's mass below 0 and above 1, which is what allows clipping to produce exact zeros and ones.


		\subsubsection{Unachievable targets}

		Of the approximately 37,800 targets, \tbc[count] are marked unachievable (row sum zero in the calibration matrix). These correspond to congressional districts where no clones carry nonzero values for the target variable. Increasing the clone count from 430 reduces the number of unachievable targets, at the cost of a larger calibration matrix.

Conversation

juaristi22 commented Apr 3, 2026

Uh oh!

baogorek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants