Skip to content

fix(ai-proxy-multi): guard construct_upstream in healthcheck timers#13592

Open
shreemaan-abhishek wants to merge 2 commits into
apache:masterfrom
shreemaan-abhishek:fix/healthcheck-construct-upstream-guard
Open

fix(ai-proxy-multi): guard construct_upstream in healthcheck timers#13592
shreemaan-abhishek wants to merge 2 commits into
apache:masterfrom
shreemaan-abhishek:fix/healthcheck-construct-upstream-guard

Conversation

@shreemaan-abhishek

@shreemaan-abhishek shreemaan-abhishek commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Description

ai-proxy-multi's construct_upstream returns nil, err on its normal failure path (unresolvable endpoint, missing host/port, nil instance) and can also raise a Lua error. The two healthcheck-manager timers called it without a pcall or a nil check:

upstream = plugin.construct_upstream(upstream_constructor_config)
upstream.resource_key = resource_path

When construct_upstream returned nil (or threw), the following upstream.resource_key = ... / upstream._nodes_ver access raised attempt to index a nil value inside the timer callback that iterates all healthcheck resources. A single malformed ai-proxy-multi instance could therefore abort checker creation/refresh for every resource in that worker, not just the offending one.

This wraps both call sites in pcall. On failure the timer logs and skips the resource (creation path) or keeps the existing checker (refresh path) instead of crashing.

Tests

Adds t/plugin/ai-proxy-multi-construct-upstream-panic.t covering both timer paths: a construct_upstream override that raises a Lua error during checker creation and during the working-pool check. Proxying stays unaffected and the guarded message is logged instead of the worker timer dying.

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change (N/A: internal robustness fix, no user-facing behavior change)
  • I have verified that this change is backward compatible

construct_upstream returns nil, err on its normal failure path (unresolvable
endpoint, missing host/port, nil instance) and can raise a Lua error. Both
healthcheck timers called it without a pcall or nil check, so one malformed
instance made upstream nil and the subsequent upstream.resource_key /
upstream._nodes_ver access aborted checker creation/refresh for every resource
in that worker, not just the bad one.

Wrap both call sites in pcall; on failure log and skip the resource (creation)
or keep the existing checker (refresh) instead of crashing the timer.
@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels Jun 22, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot couldn't run its full agentic review because no GitHub Actions runner was available. Make sure your repository has a runner available to run Copilot's review, or add a copilot-setup-steps.yml file specifying one with the runs-on attribute. See the docs for more details.

Improves robustness of ai-proxy-multi healthcheck manager timers by preventing timer callbacks from crashing when construct_upstream returns nil, err or throws, and introduces regression tests that simulate panics.

Changes:

  • Wrap construct_upstream calls in both healthcheck-manager timers with pcall and add failure handling/logging.
  • Prevent nil-indexing in timer callbacks by skipping resource creation or handling refresh failures more safely.
  • Add tests covering panic scenarios during checker creation and working-pool checks.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
apisix/healthcheck_manager.lua Adds pcall guards and conditional logic around dynamic upstream construction in healthcheck timers.
t/plugin/ai-proxy-multi-construct-upstream-panic.t Adds regression tests that force construct_upstream to throw to ensure timers don’t die and proxying continues.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +257 to +271
ok, upstream, err = pcall(plugin.construct_upstream, upstream_constructor_config)
if not ok or not upstream then
err = err or upstream
upstream = nil
local err_msg = "[checking checker] unable to construct upstream for plugin: "
.. plugin_name .. ", resource path: " .. resource_path
.. ", json path: " .. json_path .. ", error: " .. err
if not ok then
core.log.error(err_msg)
else
core.log.warn(err_msg)
end
else
upstream.resource_key = resource_path
end
Comment on lines +275 to 283
if upstream then
local current_ver = upstream_utils.version(res_conf.modifiedIndex,
upstream._nodes_ver)
core.log.info("checking working pool for resource: ", resource_path,
" current version: ", current_ver, " item version: ", item.version)
if item.version == current_ver then
need_destroy = false
end
end
Comment on lines +75 to +78
plugin.construct_upstream = function(instance)
local panic_check
panic_check.cnt = 1
end

__DATA__

=== TEST 1: panic in construct_upstream during checker creation must not crash the timer
…afe and non-destructive

- guard nil error value before concatenation in timer_working_pool_check
- retain existing checker on transient construct_upstream failure
- cover nil-return failure paths in both timers
@dosubot dosubot Bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants