Skip to content

L0 regularization paper (WIP)#685

Draft
juaristi22 wants to merge 9 commits intomainfrom
maria/l0-paper
Draft

L0 regularization paper (WIP)#685
juaristi22 wants to merge 9 commits intomainfrom
maria/l0-paper

Conversation

@juaristi22
Copy link
Copy Markdown
Collaborator

No description provided.

@juaristi22 juaristi22 marked this pull request as draft April 3, 2026 05:50
Copy link
Copy Markdown
Collaborator

@baogorek baogorek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of feedback!

  1. I think Author ordering is perfect
  2. Paragraphs are one long line in a text editor. Hey, we've all got text wrap. But I would recommend having an AI split up the lines very loosely, to 100-120 character length. If you notice, all my comments show up at the beginning of the paragraph, because the paragraph is the line.
  3. Other methods you may want to consider for the background:
    a) Covariate Balancing in Causal Inference
    b) Small Area Estimation
  4. We've got to figure out how to talk about states + DC and districts|
  5. I'm less sure after reading that GREG and IPF are your "competitors." Let's chat.

Comment thread paper-l0/sections/abstract.tex Outdated
\begin{abstract}
Tax-benefit microsimulation models typically operate at the national level, using household survey weights calibrated to aggregate population targets. Subnational analysis---at the level of states, congressional districts, or local authorities---requires datasets that simultaneously satisfy geographic distributional constraints while preserving household-level detail. We present a method based on $L_0$ regularization that jointly optimizes survey weight magnitudes and sparsity to produce calibrated subnational microsimulation datasets.

Our approach builds on the Hard Concrete distribution \citep{louizos2018}, which induces exact sparsity by multiplying each household's weight by a learned stochastic gate that collapses to a deterministic zero or one at inference time. We parameterize each gate with a log-alpha and temperature parameter, and jointly optimize these alongside log-transformed weight magnitudes using a single loss function combining scale-invariant relative calibration error, an $L_0$ sparsity penalty on the expected count of active households, and a light $L_2$ regularizer on weight magnitudes.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good. There is one thing that we do differently that Louizos et al that I want to talk to you about in person.

Comment thread paper-l0/sections/abstract.tex Outdated

Our approach builds on the Hard Concrete distribution \citep{louizos2018}, which induces exact sparsity by multiplying each household's weight by a learned stochastic gate that collapses to a deterministic zero or one at inference time. We parameterize each gate with a log-alpha and temperature parameter, and jointly optimize these alongside log-transformed weight magnitudes using a single loss function combining scale-invariant relative calibration error, an $L_0$ sparsity penalty on the expected count of active households, and a light $L_2$ regularizer on weight magnitudes.

The pipeline begins with the US Current Population Survey. Each household record is cloned multiple times and assigned to random census blocks drawn from a population-weighted distribution. Program participation indicators are re-randomized per geographic assignment using local take-up rates. Each clone is then run through \policyengine{}'s tax-benefit microsimulation engine to generate geography-specific outputs. The $L_0$ optimizer selects which household-geography combinations to retain, calibrating simultaneously against approximately 37,800 targets across three geographic levels. The sparsity penalty is configurable: a higher penalty produces a compact national dataset of approximately 50,000 records, while a lower penalty yields a larger dataset of approximately 3--4 million records covering all 436 congressional districts and 50 states individually. The method is implemented as the open-source \texttt{l0-python} PyTorch package.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Paragraph feels less like an abstract and more like a body paragraph. I know it's still a WIP, but really the abstract needs to have cold, hard stats in it that show performance. (We'll get them.)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uggh, one other thing we have to tackle: technically there are not 436 congressional districts. I was about to tell you to use 51 "state equivalents," but that's not exactly correct either. A TODO.

Comment thread paper-l0/sections/introduction.tex Outdated
\section{Introduction}
\label{sec:introduction}

Microsimulation models estimate the effects of tax and benefit policies on households by applying program rules to individual-level microdata. Most operational models---including those maintained by the Congressional Budget Office \citep{cbo2018}, the Joint Committee on Taxation \citep{jct2023}, and the Tax Policy Center \citep{tpc2024}---operate at the national level. They calibrate household survey weights to aggregate administrative totals such as total income tax revenue, program enrollment counts, and demographic benchmarks, then use the reweighted dataset to simulate policy reforms.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"They calibrate" -> do we know that for sure?

Comment thread paper-l0/sections/introduction.tex Outdated

Subnational policy analysis introduces a fundamentally different calibration challenge. Rather than matching a single set of national aggregates, the microdata must simultaneously reproduce distributional statistics at multiple geographic levels: congressional districts, states, and the nation as a whole. A dataset calibrated for the state of California must match California-specific IRS income totals, SNAP participation counts, Medicaid enrollment, and age distributions, while remaining consistent with national budget projections from the CBO and tax expenditure estimates from the JCT. Across 436 congressional districts and 50 states, this produces approximately 37,800 simultaneous calibration targets.

Existing calibration methods scale poorly to this setting. Iterative proportional fitting \citep[IPF;][]{deming1940, ireland1968} adjusts weights along one dimension at a time, cycling through marginal constraints until convergence. IPF handles cross-classified tables but does not naturally accommodate hierarchical geographic constraints---district targets must sum to state targets, which must sum to national targets---without ad hoc post-processing. Generalized regression (GREG) estimators \citep{deville1992, sarndal2007} solve a constrained optimization problem that minimizes distance from initial weights subject to exact calibration constraints. GREG produces a closed-form solution for moderate numbers of constraints but becomes computationally intractable and numerically unstable as the constraint count approaches the tens of thousands.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IPF is also just for categorical variables (which you do indicate with "cross-classification tables," but it may make sense to call it out), and also in classical raking there is no "almost." You either get the calibration perfect or it failed to converge. You may want to comment that this is also called "raking." There is a YouTube video called something like "generalized raking" where the presenter talks about relaxing that.

My understanding is that GREG too seeks to exactly match the targets and it's a failure if it can't (not just positive loss), but check me on that. GREG does permit quantitative, in addition to qualitative variables. But it will happily use negative weights, which is annoying.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"becomes computationally intractable and numerically unstable as the constraint count approaches the tens of thousands." A source would be helpful here. Is this because, as I suggested above, the procedure simply failing because it can't match the equations?

Comment thread paper-l0/sections/introduction.tex Outdated

Existing calibration methods scale poorly to this setting. Iterative proportional fitting \citep[IPF;][]{deming1940, ireland1968} adjusts weights along one dimension at a time, cycling through marginal constraints until convergence. IPF handles cross-classified tables but does not naturally accommodate hierarchical geographic constraints---district targets must sum to state targets, which must sum to national targets---without ad hoc post-processing. Generalized regression (GREG) estimators \citep{deville1992, sarndal2007} solve a constrained optimization problem that minimizes distance from initial weights subject to exact calibration constraints. GREG produces a closed-form solution for moderate numbers of constraints but becomes computationally intractable and numerically unstable as the constraint count approaches the tens of thousands.

Spatial microsimulation methods take a different approach, often distinguishing between reweighting methods and synthetic reconstruction methods for constructing small-area microdata \citep{tanton2014review}. Within this broader literature, researchers have used combinatorial optimization and simulated annealing \citep{williamson1998, huang2001, harland2012} as well as deterministic reweighting \citep{tanton2011, lovelace2016}. These methods typically operate at a single geographic level and require separate calibration runs for each area, making joint multi-level calibration difficult.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So even though you have "different" in the first sentence, you've switched from classical statistics to Microsimulation literature, which could be jarring. I think flowing through the different conceptual frameworks is going to require some artistic thinking here.

@@ -0,0 +1,158 @@
\section{Methodology}
\label{sec:methodology}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pure reader note: By the time we've gotten to methodology, there's been a lot of methodology!

Comment thread paper-l0/sections/methodology.tex Outdated

\subsubsection{Block sampling}

Census blocks are the finest geographic unit in the decennial census. Each block maps deterministically to a congressional district, county, tract, and state. The sampling distribution $P_{\text{pop}}(\text{block})$ is proportional to the block's share of the national population. Drawing blocks rather than congressional districts ensures fine-grained geographic variation within districts and enables derivation of county-level variables (Section~\ref{sec:stage4}).
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And of course, now we've got adjusted gross income in the formula

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh, I see that's mentioned below. Hmm, I wonder if it's best to just show the formula all at once.

P_{AGI}(b) reads a bit strange since that would imply that AGI is the random variable. I think we want a multinomial distribution over all census blocks, and I do think it would be best to define that multinomial distribution directly.

Comment thread paper-l0/sections/methodology.tex Outdated

\subsubsection{Per-state parallel simulation}

The matrix is populated by running each household through \policyengine{}'s tax-benefit microsimulation engine. Because many target variables depend on state-specific tax and benefit rules, a separate simulation is required for each state. A parallel dispatcher sends one job per unique state FIPS code to a pool of worker processes. Each worker creates a fresh \texttt{Microsimulation} instance, overwrites every household's \texttt{state\_fips} with the target state, invalidates cached downstream variables, and calculates all target variables at the household and person levels, accounting for differences in state legislation.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACA PTC has county specific rules, and eventually other policies are coming that are going to be at lower levels. Though admittedly, in the run I just ran, I'm only using the state level. A plug here for #598 which has been getting buried. We've got to figure out how to set the geographic level. Oh, that's right, Max had the idea for that and it's documented in that issue.

Comment thread paper-l0/sections/methodology.tex Outdated

\subsubsection{Hyperparameters}

Table~\ref{tab:hyperparameters} lists the optimization hyperparameters with their values and roles. The stretch parameters $\gamma = -0.1$ and $\zeta = 1.1$ follow the original Hard Concrete paper, placing approximately 9\% of the sigmoid's mass below 0 and above 1, which is what allows clipping to produce exact zeros and ones.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we've kept most of Louizos's parameters. I have been setting beta back to .67, which was what they used, but honestly I don't have a good reason, and .35 works as well. The initial probability is .999, and that's just from trial and error. I'd set it to 1 if I could (it blows up). It's intended to be the proportion of gates that are open at the start. I always saw worse performance when I started with some gates closed. If they start open, they will close (eventually) though it will take more epochs. I know this is a "trust me bro" style of argument.

Comment thread paper-l0/sections/results.tex Outdated

\subsubsection{Unachievable targets}

Of the approximately 37,800 targets, \tbc[count] are marked unachievable (row sum zero in the calibration matrix). These correspond to congressional districts where no clones carry nonzero values for the target variable. Increasing the clone count from 430 reduces the number of unachievable targets, at the cost of a larger calibration matrix.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know 37,800 is just a placeholder. This is inflated by the fact that multiple years of the same target can be in the database, etc.

@juaristi22 juaristi22 force-pushed the maria/l0-paper branch 2 times, most recently from b200851 to 6de667d Compare April 23, 2026 16:49
juaristi22 and others added 9 commits April 30, 2026 16:12
Paper: "L0 regularization for subnational microsimulation calibration"
targeting the International Journal of Microsimulation.

Current state of the manuscript:
- Full paper structure: abstract, introduction, background, data,
  methodology, results, discussion, conclusion, appendix
- Formal survey calibration problem definition with GREG and IPF
  explained in depth, including benefits, drawbacks, and current
  practice in operational models (CBO, JCT, TPC, EUROMOD, TAXSIM)
- Four-stage pipeline methodology (clone, matrix, L0 optimize,
  assemble) documented against the pipeline source code
- Detailed appendix target tables populated from policy_data.db
  (37,758 targets: 33,572 district, 4,080 state, 106 national)
- All writing in US English, citations linked via plainnat/natbib

Still TODO:
- [ ] Implement IPF and GREG baselines on the same calibration matrix
      to populate the comparison table (tables/comparison.tex)
- [ ] Run calibration experiments and fill in all [TBC] placeholders
      in the results section (accuracy, sparsity, convergence, ESS)
- [ ] Generate convergence curve figure from calibration_log.csv
- [ ] Select and run a subnational policy application example
      (Section 5.5 — candidate: EITC expansion across CDs)
- [ ] Review pipeline methodology section against latest code for
      accuracy (clone-and-assign, matrix builder, assembly steps)
- [ ] Review and deepen background section: verify claims about
      GREG/IPF limitations, add any missing related work
- [ ] Resolve pre-existing overfull hbox warnings (long URLs in
      conclusion, hyperparameters table width)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Introduce a paper benchmarking scaffold that compares L0 and GREG on the same exported calibration matrix while routing IPF through a separate automatic preprocessing step that reconstructs IPF-ready unit and target inputs from the saved package metadata. The scaffold includes two R runners, manifest-driven bundle export, common scoring against the shared matrix, environment setup helpers, and end-to-end tests for the runner schemas.
Replace the old order-dependent sequential IPF flow with a validated one-call path. The converter now keeps only closed categorical systems, derives complements only from authored parent totals, reports explicit drop reasons for non-runnable target families, and surfaces incompatible-total failures through structured diagnostics instead of silently chaining margin blocks.

Update the benchmark harness to export retained-authored IPF scoring artifacts, validate the external-IPF input contract, and let L0/GREG opt into scoring on the same retained-authored subset for apples-to-apples comparisons. Refresh the walkthrough notebook, README, example manifest, and add regression coverage for closure logic, runner behavior, export validation, and the new scoring contract.
Methodological
--------------
- Add `--train-on {shared_requested,ipf_retained_authored}` to
  benchmark_cli so L0 and GREG can be fit on the same target subset IPF
  was given. Matched runs write `{method}_matched_summary.json` so they
  don't overwrite the full-info `{method}_summary.json`. Both runs record
  `training_target_set` and `scoring_target_set` in the summary.
- Add `seed` parameter to `fit_l0_weights` (seeds torch, CUDA when
  present, numpy). Default None preserves prior non-deterministic
  behavior. The benchmark CLI plumbs `seed` from `method_options.l0` so
  paper benchmark runs are reproducible.
- Align IPF `bound` default in benchmark_cli to 10.0 to match the example
  manifest; the prior 4.0 was too tight for real district-level fits.

Correctness
-----------
- Fix `is_authored` dedupe key in `_try_close_binary_subset` — collapsing
  the dict on `(geo, cell)` is correct; the prior key allowed an authored
  + derived duplicate to slip through. Add an assertion for the genuine
  bug case where the same cell would arrive with conflicting authoring.
- Fail loudly with `IPFConversionError` (and a diagnostics file) when the
  package's `targets_df` lacks `target_id`. The previous silent skip left
  downstream `--score-on ipf_retained_authored` to surface a misleading
  FileNotFoundError.
- Wrap converter `ValueError` / `FileNotFoundError` raise sites in
  `IPFConversionError` so `ipf_conversion_diagnostics.json` is always
  written before export aborts.

Tests added
-----------
- ipf_conversion: `negative_derived_complement` guardrail,
  `incompatible_totals` consistency check, `unsupported_partial_margin`
  3+ category subset case.
- benchmarking_runners: seed wire-up (CLI → fit_l0_weights), `--train-on`
  retained-subset routing + fail-when-subset-missing, one-call IPF on a
  coherent person + household problem, derived-complement cell fitted
  through surveysd::ipf.
- benchmark_export: missing-`target_id` fail-loud path with diagnostics.

Documentation
-------------
- README: matched two-fit recipe, L0 seed setting, GREG-linear's
  negative-weight regime (with `negative_weight_share` as the surfaced
  metric), strictness contract, open-question 1 resolution.
- 02_ipf_walkthrough.ipynb: new section 10g demonstrating binary-subset
  closure via an authored parent total (the safe counterpart to the
  dropped open-subset case); one-call statement tightened to "one
  exported IPF run equals one coherent closed categorical system."

Still deferred
--------------
- Entity-aware extension for `tax_unit_count` / `spm_unit_count`.
- `survey:::grake` private-namespace migration / R package pinning.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the deferred-future framing ("`tax_unit_count` and `spm_unit_count`
remain outside the core household/person IPF path *in this pass*") with the
methodological reason: `surveysd::ipf` supports exactly two entity levels
natively (`conP` for row-level / person-style constraints, `conH` for
constraints aggregated by `hid`), and there is no generalised mechanism
for additional counted entities such as `tax_unit` or `spm_unit`.

Producing a single weight vector that simultaneously satisfies targets at
three or more entity levels is not possible with classical IPF. Running
IPF separately per scope and aggregating would give it more degrees of
freedom than L0 / GREG (each pass solves a smaller subproblem with its
own freedom) and would not be a like-for-like comparison. The benchmark
therefore restricts IPF to `person_count` and `household_count` —
together or alone — and drops other count families at the count check.
Those targets remain in the shared sparse system that L0 and GREG fit,
so the cross-method comparison on the IPF-feasible subset stays
apples-to-apples via `--score-on ipf_retained_authored`.

Updated the README's methodology section and the IPF inputs subsection
to articulate the constraint, and tightened the `_target_scope` error
message in `ipf_conversion.py` to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tor, fixes

This commit lands the production scaffolding called for by paper-l0/BENCHMARK_PLAN.md
so the L0 / GREG / IPF benchmark can be launched end-to-end against the saved
calibration package without further code changes.

Tier manifests (paper-l0/benchmarking/manifests/tier*.json)
  All ten paper-reported manifests, each reading from one shared
  calibration_package.pkl and differing only in target_filters:

    tier1_mixed.json            L0 + GREG on the full filtered slice (count +
                                dollar combined) over a five-state, ten-district
                                geography subset plus national targets.
    tier1_ipf.json              L0 + IPF on the same slice; IPF retains the
                                authored closed subset and the matched runs use
                                --train-on ipf_retained_authored.
    tier2_scaling_250 ...
    tier2_scaling_largest.json  Scaling ladder (250 / 500 / 1k / 2.5k / 5k / 10k
                                / largest coherent pre-production subset) over
                                the same package, growing geography coverage to
                                grow the target set instead of the unit universe.
    tier3_production.json       Least-filtered view; failures are reportable
                                results (status=failed with notes) rather than
                                missing rows.

  All tier manifests opt into method_options.ipf.return_na = true so non-
  convergence surfaces NaN and the runner converts that into a visible runtime
  error instead of writing NaN-laden weights to disk.

  Geographic ID format is the convention the saved package actually uses: state
  FIPS without leading zeros for single-digit states (e.g. CA = "6", CA-01 =
  "601"). The two pre-existing example manifests are updated to match.

Orchestrator (paper-l0/benchmarking/run_benchmark_suite.py)
  Exports each manifest, runs every method declared in it, and (when IPF is in
  the manifest) schedules matched-input rows for the other methods via
  --train-on ipf_retained_authored --score-on ipf_retained_authored. Aggregates
  per-manifest summaries into one tier_<n>_summary.csv per tier plus a unified
  suite_summary.csv. Failures (export-time IPFConversionError, runner non-zero
  exits, missing summary files) are recorded as status=failed rows with the
  captured reason in notes; the orchestrator never aborts the suite, which is
  load-bearing for Tier 3's "completed-or-failed" reporting story.

  Tier 2 rungs are discovered and ordered by target_filters.max_targets
  (smallest first; uncapped rungs sort last) so the summary reads top-to-bottom
  in increasing target count regardless of filename sort order.

  Both run_benchmark_suite.py and benchmark_cli.py prepend the in-tree repo
  root to sys.path before importing policyengine_us_data. This avoids being
  shadowed by an editable install of the same package name pointing at a
  sibling repo, which previously caused fit_l0_weights() to lose the seed
  parameter at script-invocation time only (not when imported from another
  module).

IPF runner: returnNA=FALSE design task closed
  paper-l0/benchmarking/runners/ipf_runner.R now accepts an optional 11th
  positional argument return_na (default TRUE for backwards compatibility).
  When return_na is TRUE and surveysd::ipf returns any NaN weights -- the
  silent-non-convergence path called out in the BENCHMARK_PLAN's "Immediate
  next design tasks" -- the runner exits with a clear error rather than
  writing NaN weights. The Python CLI plumbs method_options.ipf.return_na
  through to the runner so manifests control the behavior declaratively.

Modal-artifact ingest (paper-l0/benchmarking/patch_package_paths.py)
  The Modal-built calibration_package.pkl records absolute paths from the
  build container (/pipeline/artifacts/<run_id>/...) for metadata.dataset_path
  and metadata.db_path. The new helper rewrites those paths in-place to point
  at local copies of the dataset H5 and policy_data.db so the rest of the
  benchmark scaffold (and especially benchmark_export.build_ipf_inputs, which
  has hard existence checks on both paths) runs unchanged on a local checkout.

Documentation (paper-l0/benchmarking/README.md)
  New "Paper-reported tiers" section: which manifests belong to which tier,
  what each method does at each rung, the run_benchmark_suite.py entry point
  with example invocations, and the per-tier summary CSV layout.

Verification done before committing
  Three end-to-end smoke runs were exercised against the production
  calibration package (17,736 targets x 5,159,570 cloned units, n_clones=430)
  to verify the full pipeline:

    1. count-only smoke (29 targets, 100 epochs L0): all five rows green --
       L0 + GREG full-info, IPF on retained-authored subset, plus matched L0
       and GREG runs trained and scored on the IPF subset.
    2. mixed-target smoke (count + dollar combined): L0 and GREG both
       completed end-to-end; this exercises the Tier 1 / Tier 2 mixed-target
       path that count_like_only=true cannot reach.
    3. forced non-convergence (max_iter=5, epsP=1e-20): surveysd::ipf returned
       NaN weights, the runner stopped with the new "did not converge"
       message, and the orchestrator recorded status=failed with the full R
       traceback in notes.

  The smoke manifests, scratch run dirs, and download scratch were removed
  from the repo before committing. The plans (paper-l0/BENCHMARK_PLAN.md and
  outline.md) are intentionally not included in this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants