Add main residence value to LA calibration#371
Conversation
Generalises targets/sources/mhclg_regional_land.py to local-authority level. Each LA's share of national household land is proportional to households x avg_house_price, scaled to the ONS National Balance Sheet household-land series. Inputs (all already used elsewhere in the repo): - storage/la_land_values.csv: 360 LAs with households (from the existing local_authority_weights.h5 matrix) and avg_house_price (HM Land Registry UK HPI Dec 2025). - _land.HOUSEHOLD_LAND_VALUES for the national anchor. Tests cover CSV data quality, share/target aggregation, sensible ordering (K&C > Blackpool by >3x, London boroughs in top quintile), and registry integration. Updates test_regional_land_value_targets.py to filter by GeographicLevel.REGION now that LA targets share the same name prefix. Closes #370 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Blocker: data bug in Impact: IoS alone absorbs 8.6 % of the national household share ( Quick verification: Looks like a UK-HPI 'national-total-as-fallback' path leaked into one LA row. Likely two lines to fix:
Happy to approve once that's in. The methodology itself is sound — mirrors |
The E06000053 row carried households=2,492,115 — roughly the South West region total — from an upstream fallback that fired during CSV generation. Real IoS has ~1,115 households per ONS mid-2023. With the bug, IoS absorbed 7.85% of the national property-wealth share, understating every other LA's 2024 target by ~8.5% (e.g. K&C moved from £42.6bn to £46.2bn after the fix). Two new tests prevent the regression: - test_households_within_plausible_range: bounds every LA to [500, 500_000] so any future 10x+ outlier fails immediately. - test_isles_of_scilly_households_are_thousands_not_millions: tight [500, 5_000] bound on the specific row that leaked. Methodology unchanged; LA targets still sum to the ONS national household-land series within 1e-6.
|
@MaxGhenis thanks — fixed in 3ed729c. Data fix
Quantified impact of the fix
Tests added
Full suite: 20/20 pass locally via Generation-path note: the 2,492,115 figure matches the South West regional household total, so the fallback that fired during CSV generation was a regional sum, not "national-avg" as the PR body suggested. I'll correct the PR description; worth flagging for whoever regenerates the CSV next. |
The targets added in the previous commits were registered but inert —
datasets/local_areas/local_authorities/loss.py never built a column for
them, so the LA reweighter could not see them. This adds the
ons/household_land_value column to the LA target matrix:
- matrix entry: per-household household_land_value (from policyengine-uk).
- y entry: 360-vector of per-LA targets at the calibration year, taken
from la_land._compute_la_targets and reordered to match
local_authorities_2021.csv so the country mask and target indices
agree at every position.
The year is selected from time_period; if it is outside
HOUSEHOLD_LAND_VALUES (defined for 2021–2026) the latest known year is
used as a fallback.
New tests in test_la_loss_land_value.py cover both layers:
- target dict ↔ la_codes ordering, finite-positive vector, sum-to-
national for 2024/2025/2026 (no Microsimulation needed).
- full create_local_authority_target_matrix build (gated on the
enhanced FRS fixture): column presence, length 360, sum-to-national
for the calibration year, ordering matches la_codes, all positive,
and matrix column equals sim.calculate("household_land_value").
Closes the "out of scope" follow-up flagged in the original PR body.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@MaxGhenis — for context, here is the full set of LA-level targets the reweighter trains on after this PR (from
All targets follow the same shape — |
…onment)
Replaces the imputed land-value target with a main-residence-value
target built from observed LA-level inputs, mirroring the existing
private-rent block:
target_la = avg_house_price_la × ownership_share_la × n_households_la
(HMLR HPI) × (English Housing Survey) × (Census)
Per @MaxGhenis's standup note (28 Apr): minimise target manipulation by
calibrating on observable LA-level housing indicators rather than
apportioning a national ONS land-value total across LAs. The new
target uses the same shape as the rent target (median × share × count),
including the national-share fallback for LAs missing any input.
Changes:
- la_land.py: drop HOUSEHOLD_LAND_VALUES dependency; new
load_la_avg_prices() helper; _compute_la_targets() returns
observed-product £ per LA; targets renamed
housing/main_residence_value/{code}, source=hmlr.
- loss.py: replace the apportionment block with the rent-style
inline pattern (merge avg_price into tenure_merged, target =
price × ownership × households, na-fallback to
national_property × la_household_share).
- Tests: drop "sums to ONS national" assertions; assert per-LA
target equals observed product exactly. Layer-2 FRS-gated tests
updated to use main_residence_value column.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The LA target now uses directly observed housing indicators ( Both go through the same Matrix variable changed from 72 passed, no regressions. Sanity check: K&C target £27.8bn vs Blackpool £5.1bn, ordering and level both look right. |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four FRS-fixture-gated tests exercising properties the optimiser relies on: - y has no NaN entries (NaN would propagate silently through the optimiser). - Non-English LAs use the national-share fallback (positive, non-NaN values), since EHS coverage is England-only. - matrix column has non-zero variance, so the new target carries calibration signal rather than being inert. - Sum of English LA targets is in the same order of magnitude (0.5x-3x) as the implied initial English main-residence-value, so the calibrator can plausibly reach the target via reweighting rather than 100x weight inflation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…age caveat Per @MaxGhenis PR review: the target value is a constructed proxy (avg HMLR price × EHS ownership share × Census households), not a directly observed LA total of main residence value. The earlier PR description and code comments overstated this. Substantive lineage gap that the docs now flag explicitly: - Matrix col main_residence_value (policyengine-uk) is WAS-imputed household stock wealth, regionally uprated. - Target uses HMLR UK HPI 'Average Price' — a transaction-weighted geography-period price index, not an observed stock total of owner-occupied residences. - Two different price concepts on the two sides of the constraint. The product is a defensible identity, but it is a derived proxy, not a direct benchmark. Behaviour unchanged. This commit only updates the docstring in la_land.py and the comment in loss.py to call the target "derived proxy" rather than "directly observed". A separate policy question (whether derived proxy targets should sit at full training weight alongside direct VOA/HMRC/ONS/DWP targets, or be soft-weighted) is being tracked separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Addressed the framing issue flagged the review (commit
No behaviour change in this commit — only documentation/labelling, since you said "#371 is not necessarily wrong, but it should be explicitly treated as a derived/proxy calibration target, not described as direct." Open policy question that I'm not solving here: "If the standard is 'only calibrate direct official targets,' #371 should not be a hard training target as written." This applies repo-wide — |
|
Follow-up on the direct-target discussion: I pushed c330b44 to make housing/main_residence_value validation-only by default instead of a training target. The target is still useful as a proxy diagnostic, but HMLR average price x EHS ownership share x Census households is not a direct LA stock-value target and crosses source/concept boundaries. This also fixes the docstring so it no longer claims soft weighting unless the optimizer actually implements it.\n\nVerification: uv run pytest policyengine_uk_data/tests/test_la_land_value_targets.py policyengine_uk_data/tests/test_la_loss_land_value.py -q; ruff check/format on touched files. |
|
@MaxGhenis — nit on the test side. The new A behaviourally meaningful test would run a small calibration (or use the toy calibrator already in Happy to take a swing at it as a tiny follow-up if you want. Not blocking. |
|
Updated based on Max's target-standard call: nonconforming/proxy quantities should not live in the targets database or calibration target matrix, even as validation-only targets. Pushed Recommendation: close this PR rather than merge it. If we want a property-value diagnostic later, it should live outside the calibration target registry/matrix and be labelled as diagnostics, not as a target. |
|
@MaxGhenis you removed all the changes in this PR in ecfd6c3 — the diff against main is now empty. Do you want to close this PR? |
|
Closing per target-standard decision: this PR now has an empty diff after removing the nonconforming/proxy LA property-value target from the target registry and calibration matrix. If we add property-value diagnostics later, they should live outside the targets database/matrix. |
…/net (#374) * Add LA-level council tax calibration targets Two families of LA-level targets, covering all 360 LAs in local_authorities_2021.csv, built from four public sources: - `ons/council_tax_band_d/{code}` (350 targets): average Band D council tax inclusive of all precepts per billing authority. Sources: MHCLG *Council Tax levels set by local authorities in England 2026-27*, Welsh Government *Council Tax levels April 2026 to March 2027*, Scottish Government *Council Tax Assumptions 2025*. All 296 English + 22 Welsh + 32 Scottish LAs covered. - `ons/council_tax_band_count/{code}/{band}` (2,541 targets): number of dwellings per band A-H per LA. Source: VOA *Council Tax: Stock of Properties, 2025*. Covers England + Wales (318 LAs × ~8 bands, minus City of London Band A which is VOA-suppressed). NI is excluded: domestic rates, not council tax. Scotland band counts are not in VOA; Scottish Assessors publishes them separately and is a follow-up. Files ----- - `storage/la_council_tax.csv` (31 KB, 360 rows): canonical CSV joining DLUHC Table 10 column 17, Welsh Table 1 "Overall average band D", Scottish Gov "CT by Band 2025-26" Band D column, and VOA CTSOP1.0 bands A-H onto the reference LA list. - Post-2023 South Yorkshire E-codes (E08000038/39) re-mapped to pre-2023 codes (E08000016/19) to match the reference list. - Scottish ampersand/double-space naming normalised ("Argyll & Bute" → "Argyll and Bute", etc.). - `targets/sources/la_council_tax.py`: reads the CSV, emits Target objects at geographic_level=LOCAL_AUTHORITY with per-country year tagging and per-country reference URL. Testing ------- 22 hermetic tests (no network access, no baseline fixture needed): Structure - Row count matches local_authorities_2021.csv. - Every expected column present. - Four UK country codes represented. - Every LA code matches the reference list. Value plausibility (the #371 lesson) - Band D amount in [£900, £3,500] for every row with a value. - Total dwellings in [200, 800,000] for every row with a value. - Explicit Isles of Scilly regression test: total dwellings in [500, 5,000], not the 2.49M outlier that slipped into #371. - Band A-H counts sum to total dwellings within 20-property slack (VOA 10-property suppression allowance). - Every band-count target value ≤ 500k (largest LA stock). Coverage expectations - Every English, Welsh and Scottish LA has a Band D value. - Northern Ireland has no council tax flagged (has_council_tax=False). Spot-checks of published facts - Wandsworth (E09000032) and Westminster (E09000033) are the two lowest-Band-D English LAs (catches row-swap bugs). - Scottish average Band D is £500+ below English average. Target-API invariants - get_targets() returns a non-empty list without network access. - Band D target count matches the CSV's non-null Band D count. - Band count target count matches Σ non-null band columns. - Every target carries geographic_level=LOCAL_AUTHORITY and a geo_code. - Band D targets use Unit.GBP; band count targets use Unit.COUNT with is_count=True. - Every target has at least one year of values. Sources ------- - MHCLG (England 2026-27): https://www.gov.uk/government/statistics/council-tax-levels-set-by-local-authorities-in-england-2026-to-2027 - Welsh Government (Wales 2026-27): https://www.gov.wales/council-tax-levels-april-2026-march-2027-html - Scottish Government (Scotland 2025-26): https://www.gov.scot/publications/council-tax-datasets/ - VOA (England + Wales 2025): https://www.gov.uk/government/statistics/council-tax-stock-of-properties-2025 Out of scope for this PR (follow-ups) ------------------------------------- - Wiring these targets into datasets/local_areas/local_authorities/loss.py so the LA reweighting actually calibrates on them. Planned follow-up PR. - Scottish Assessors per-LA chargeable-dwellings to fill the Scotland band-count gap. - Council Tax Support caseload per LA (DWP StatXplore). - Single Person Discount rate per LA (CIPFA). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Address review: add Welsh Band I, source totals from VOA, tidy module Review points addressed: - Add count_band_I column to la_council_tax.csv, populated for all 22 Welsh LAs (Wales revalued in 2005 and introduced a 9th band). Cardiff 1480, Monmouthshire 670, Vale of Glamorgan 1060, etc. English rows keep Band I null; VOA marks it [z] (not applicable). - Re-source total_dwellings from VOA "All properties" column instead of deriving it as the sum of A-H. Previously Σ(A..H) was used for both sides of test_band_counts_sum_to_total, making the test self-referential; now it validates against the published total with a 20-property slack for VOA rounding. - Rename count columns symmetrically: band_A..band_H + band_D_count → count_band_A..count_band_I. Removes the lopsided band_D_count name that existed only to avoid clashing with band_d_amount. - Align band-count target names with voa_council_tax.py: voa/council_tax/{code}/{band} (was ons/council_tax_band_count/...); variable="council_tax_band" (was council_tax_band_count, which is not a real PolicyEngine-UK variable); drop breakdown_variable to match the regional VOA module. - Cache the CSV read with @lru_cache(maxsize=1), matching voa_council_tax. - Update module docstring: "A-H in England/Scotland, A-I in Wales". Tests: - New: test_welsh_las_have_band_i (all 22 Welsh LAs populated). - New: test_english_las_have_no_band_i (guard against spurious fills). - New: test_cardiff_band_i_matches_published_figure (~1,480 per VOA 2025). Final target counts: - 350 Band D amount targets (unchanged). - 2,563 band-count targets, up from 2,541: +22 Welsh Band I plus two band-H rows that were null due to the earlier truncation. * Satisfy ruff format on la_council_tax.py * Wire LA council-tax band-count targets into the calibration loss matrix The targets registered in la_council_tax.py were inert — the LA target matrix had no columns for them, so the reweighter could not see them. This wires the eight VOA Council Tax Stock-of-Properties band-count targets (A-H) into the LA loss matrix: - matrix entry: per-household indicator 1[council_tax_band == B] from policyengine-uk. - y entry: 360-vector of per-LA dwelling counts from storage/la_council_tax.csv. For LAs without VOA data — Scottish LAs (the VOA summary tables don't cover Scotland) and Northern Irish LAs (no council tax) — the value falls back to national_count × la_household_share, matching the existing tenure block's fallback pattern. Two targets are deliberately not wired in this pass: - Band I — Wales-only and mostly null in the CSV. - The Band D £ amount (ons/council_tax_band_d/{code}) — a per-rate quantity that does not fit the linear matrix-times-weights aggregation. Wiring it as total council-tax revenue would need Scotland-specific band ratios (different from England/Wales after 2017) and is worth a separate PR. New tests in test_la_loss_council_tax.py cover both layers: - Light: CSV joins to every LA code, the eight count_band_{X} columns exist, E/W rows are populated, Scotland is null as documented, and NI has has_council_tax=False. - Full build (gated on enhanced FRS fixture): all eight columns present in matrix and y; y vectors length 360, finite and positive; matrix entries are 0/1 indicators with rows summing to ≤1; y matches the CSV verbatim for an English LA (Hartlepool); Scotland and NI LAs receive a positive fallback rather than NaN or zero. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add LA-level net council tax £ target alongside band counts Wires the second FRS data point into the LA reweighter, addressing the 28 Apr standup ALIGNED decision: "calibrate the two FRS data points as the council tax information is provided after deductions." Both sides of the new constraint are net of CTR: - matrix col = council_tax_less_benefit (gross − CTR benefit) - y = directly observed net council tax requirement per LA Sources (no national-total apportionment, all directly published): - England (296 LAs): MHCLG Council Taxbase 2025, Table 1.35 "Tax base after allowance for council tax support" × Band D amount. Sums to £47.4bn, within 3.4% of the MHCLG Table 1 published England Council Tax Requirement of £45.86bn (small gap from year mismatch: 2025 taxbase × 2026-27 Band D). - Wales (22 LAs): Welsh Government "Council Tax Levels April 2026 to March 2027" Table 3 "Council tax income (£m)". Sums to £2.45bn. - Scotland (32) and NI (10): no source wired; loss.py routes through the existing national × la_household_share fallback, same pattern as the band-count target and the rent target. Mirrors the rent block in loss.py: load CSV → merge into ct_merged → matrix col / y assignment / has_data mask / national-share fallback. Files: - storage/la_council_tax.csv: new column total_council_tax_net. - targets/sources/la_council_tax.py: load_la_net_council_tax() + Target objects named housing/council_tax_net/{code}. - datasets/local_areas/local_authorities/loss.py: housing/council_tax_net block immediately after the band-count block. - tests/test_la_loss_council_tax.py: 11 new tests (4 layer-1 + 7 layer-2) covering CSV column presence, country coverage, value range, England-total ballpark vs MHCLG, matrix-col correctness, na-fallback behaviour, calibratability sanity check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Fix gross/net mismatch in OBR national council tax compute OBR EFO Table 4.1 reports "Total net council tax receipts" — net of council tax reduction (CTR). The matching household-level signal is council_tax_less_benefit (= gross council tax − CTR award), not council_tax (which is the gross liability before CTR per its docstring "Gross amount spent on Council Tax, before discounts"). Calibrating gross household values against a net national target systematically pulls weights down to fit (Σ w × gross > Σ w × net), leaking bias into adjacent national targets that share the weight vector. Order-of-magnitude sanity (UK 2024-25): Σ w × council_tax (gross) ≈ £55bn Σ w × council_tax_less_benefit (net) ≈ £47bn OBR Table 4.1 "Total net council tax" ≈ £44bn After the fix, the council tax constraint is internally consistent (both sides net) and aligns with Max's 28 Apr standup decision on FRS-net-of-CTR alignment. Pairs naturally with the LA-level housing/council_tax_net target this PR adds — both use the same net variable. Adds three regression tests pinning the net-variable contract. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Zero NI council tax targets instead of fabricating fallbacks Northern Ireland uses domestic rates, not council tax. The CSV's has_council_tax flag has been False for NI from the original commit, but loss.py was ignoring it and assigning national × la_household_share to NI LAs for both band counts and the new net £ column. Effect: the optimiser was being told "NI households should pay this much council tax" with a positive target, while every NI household has council_tax_band == None and council_tax_less_benefit == 0 — an unsatisfiable constraint that wastes loss the optimiser cannot drive to zero. Reported by @MaxGhenis in PR review. Fix: read has_council_tax from the CSV, gate the np.where so NI LAs get y == 0 for all 9 council-tax columns. Direct-value and fallback paths unchanged for E/W/S. Updates two tests that previously asserted positive fallback for NI; adds explicit zero-NI assertion for housing/council_tax_net. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Document derived/proxy nature + lineage drift for #374 CT targets Per @MaxGhenis PR review: both council-tax LA targets are derived proxies, not direct matches for the matrix-side variables. The PR description and code comments earlier overstated this. voa/council_tax/{A..H}: target counts VOA dwellings (E&W only, includes exempt/empty/second homes); matrix counts policyengine-uk households. Banding ratios differ in Scotland post-2017 and Wales has Band I. housing/council_tax_net: target value is MHCLG taxbase × Band D (taxbase = Band D equivalent dwellings adjusted for ~7 discount/ premium/exemption classes); matrix col is FRS-reported council_tax_less_benefit (household-reported gross less reported CTB). Same intent, different construction paths. Documentation only — no code, data, or test behaviour change. The la_council_tax.py docstring now has an explicit "Lineage caveats" section, and loss.py block comments label both targets as derived/proxy with cross-reference. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Mask unavailable LA council tax targets * Remove redundant council tax availability gate --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Max Ghenis <mghenis@gmail.com>
What this PR does
Adds a derived proxy LA-level main-residence-value calibration target to
datasets/local_areas/local_authorities/loss.py. Per-LA target is a constructed product, not a directly observed total:Same multiplicative shape as the existing private-rent target (
median_rent × renter_share × n_households). LAs missing any input (Wales / Scotland / NI — EHS is England-only) fall through to thenational_property × la_household_sharefallback, identical to how the tenure target handles missing LAs.Lineage caveat (flagged in review by @MaxGhenis)
This is a derived/proxy target, not a direct benchmark:
main_residence_value(policyengine-uk) is WAS-imputed stock wealth, regionally uprated via property-wealth intensity ratios.avg_price × ownership × n_householdsis a defensible identity ("if every owner-occupied dwelling were valued at the LA HPI average, the total would be £X") but the two sides of the calibration constraint reference different price concepts.A separate policy question — whether derived/proxy targets like this should sit at full training weight alongside directly observed targets (HMRC SPI, ONS pop, DWP UC, VOA dwellings), or be soft-weighted — is being tracked separately and is not blocking this PR.
Closes #370.
Files
New
policyengine_uk_data/storage/la_land_values.csv— 360 rows:code, name, households, avg_house_price.avg_house_pricefrom HM Land Registry UK HPI Dec 2025 with name-based fallback for re-allocated codes (Sheffield E08000019 → E08000039), NI country-level fallback for missing LGD months, national-avg fallback for the Isles of Scilly.policyengine_uk_data/targets/sources/la_land.py—load_la_avg_prices(),_compute_la_targets()(observed-input product, no national-total apportionment),get_targets()returningTargetobjects namedhousing/main_residence_value/{code}withsource=hmlr,geographic_level=LOCAL_AUTHORITY.tests/test_la_land_value_targets.py, 8 intests/test_la_loss_land_value.py.changelog.d/370.md.Modified
datasets/local_areas/local_authorities/loss.py— adds thehousing/main_residence_valuecolumn following the rent-block pattern: mergeavg_house_priceintotenure_merged, compute target inline, applynp.where(has_property, target, national * la_household_share)fallback. Same shape as the surrounding tenure / rent / ONS-income blocks.Tests
26 new tests cover:
[500, 5_000]).avg_price × ownership_share × n_householdsexactly; all-positive; English LAs covered (Wales/Scotland/NI fall through to the loss.py national-share fallback by design — same behaviour as the existing tenure target, which only has EHS England data).Targetobjects produced (one per English LA), all taggedlocal_authorityandsource=hmlr.housing/main_residence_valuecolumn present in bothmatrixandy; per-LAyequals the observed-input product for covered LAs; matrix column equalssim.calculate("main_residence_value")(gated on enhanced-FRS fixture).Full run including adjacent suites (regional land, target DB, target registry, release manifest): 72 passed, 15 skipped (FRS-fixture-gated), no regressions.
Sanity check
Both ordering (K&C ≫ Blackpool) and absolute level (£10s of bn per LA) look right.
Sources (constructed inputs, not direct LA totals)
la_tenure.xlsx— ownership shares, England only.la_count_households.xlsx— household counts.Related