Skip to content

Comments

Fix flaky prometheus_updater_spec by cleaning up PROCESS_TYPE env var#4868

Open
joyvuu-dave wants to merge 1 commit intocloudfoundry:mainfrom
joyvuu-dave:fix/process-type-env-leak-flaky-prometheus-tests
Open

Fix flaky prometheus_updater_spec by cleaning up PROCESS_TYPE env var#4868
joyvuu-dave wants to merge 1 commit intocloudfoundry:mainfrom
joyvuu-dave:fix/process-type-env-leak-flaky-prometheus-tests

Conversation

@joyvuu-dave
Copy link
Contributor

@joyvuu-dave joyvuu-dave commented Feb 20, 2026

Fix flaky prometheus_updater_spec caused by PROCESS_TYPE env var leaking between tests

Problem

The PrometheusUpdater spec has been flaky since PR #4749 (commit 9c9a9d51d, merged Jan 16 2026), which introduced ExecutionContext and made metric registration conditional on ENV['PROCESS_TYPE']. CI runs frequently fail with errors like:

NoMethodError:
  undefined method `set' for nil
# ./lib/cloud_controller/metrics/prometheus_updater.rb:131:in `update_gauge_metric'

Root Cause

Three spec files set ENV['PROCESS_TYPE'] (via set_process_type_env) but never restore it:

  • connection_metrics_spec.rb — sets it to cc-worker and puma_worker. The cc-worker value is the direct cause of the flakiness: it maps to CC_WORKER, which only registers DB_CONNECTION_POOL_METRICS, DELAYED_JOB_METRICS, and VITAL_METRICS. The puma_worker value happens to be harmless today (API_PUMA_WORKER registers all metrics), but it is still pollution.
  • puma_runner_spec.rb — sets it to puma_worker (via before_worker_boot callback). Also harmless today for the same reason, but still pollution.
  • runner_spec.rb — sets it to main (via Runner#initialize), which maps to API_PUMA_MAIN — the same context the test environment defaults to. Harmless today, but still pollution.

With randomized test ordering, when connection_metrics_spec.rb's cc-worker context runs before prometheus_updater_spec, the leaked value changes the behavior of ExecutionContext.from_process_type_env:

  • Normal state: PROCESS_TYPE is unset → falls back to CC_TEST=true check → returns API_PUMA_MAIN → registers all metrics
  • Polluted state: PROCESS_TYPE=cc-worker → returns CC_WORKER → registers only a subset of metrics → metrics like :cc_deployments_in_progress_total are never registered@registry.get(...) returns nil

Note: connection_metrics_spec.rb was already setting ENV['PROCESS_TYPE'] without cleanup before PR #4749, but it didn't matter then because PrometheusUpdater registered all metrics unconditionally. PR #4749 made registration conditional, which turned the pre-existing env var pollution into a flaky test.

Fix

Add around blocks to all three polluting specs that save and restore ENV['PROCESS_TYPE']:

around do |example|
  original_process_type = ENV.fetch('PROCESS_TYPE', nil)
  example.run
ensure
  if original_process_type.nil?
    ENV.delete('PROCESS_TYPE')
  else
    ENV['PROCESS_TYPE'] = original_process_type
  end
end

Verification

Reproducible with --seed 8:

bundle exec rspec --seed 8 \
  spec/unit/lib/sequel/extensions/connection_metrics_spec.rb \
  spec/unit/lib/cloud_controller/runners/puma_runner_spec.rb \
  spec/unit/lib/cloud_controller/runner_spec.rb \
  spec/unit/lib/cloud_controller/metrics/prometheus_updater_spec.rb
  • Without fix (on main): 74 examples, 11 failures — the exact same 11 prometheus_updater_spec failures seen in CI
  • With fix: 74 examples, 0 failures

Additionally tested with seeds 1–7, 9, 10, 11111, 22222, 33333, 44444, 55555, 66666, 12345, and 67890 — all pass.

  • I have reviewed the contributing guide

  • I have viewed, signed, and submitted the Contributor License Agreement

  • I have made this pull request to the main branch

  • I have run all the unit tests using bundle exec rake

  • I have run CF Acceptance Tests

Three spec files (runner_spec, puma_runner_spec, connection_metrics_spec)
set ENV['PROCESS_TYPE'] via set_process_type_env but never restored it.
With randomized test ordering, a leaked value of 'cc-worker' (from
connection_metrics_spec) caused ExecutionContext.from_process_type_env to
return CC_WORKER instead of API_PUMA_MAIN, which only registers a subset
of Prometheus metrics. Subsequent prometheus_updater_spec tests then got
nil from registry.get() and failed with "undefined method 'set' for nil".

Add around blocks to save/restore PROCESS_TYPE in all three specs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant