Skip to content

Conversation

Revolyssup
Copy link
Contributor

@Revolyssup Revolyssup commented Jul 14, 2025

Description

The current implementation of health checker has few problems.

  1. After creating the checker, it is currently stored in the structure of the object table to which it belongs, adding the checker attribute. However, the associated object is often affected by actions such as DNS resolution, Service Discovery, and configuration file merging, which frequently lead to passive changes in the associated object. At this point, correctly updating the associated object and properly storing the checker object are particularly error-prone and relatively complex.
  2. parent field adds complexity across the source code and needs to be taken care of. This field is only used for healthchecks.
  3. The lifecycle of the checker and the upstream object is not strongly consistent

Solution

  1. upstream and checker relationship has been changed from strong binding to index reference, with the index fields using resource_path and resource_version fields respectively. This will be managed by a separate healthcheck_manager module.
  2. A timer will asynchronously create checkers from waiting pool. The requests no longer directly create health checkers therefore the health checker lifecycle is decoupled from requests.

NOTE: A lot of tests have been modified to either add sleep or add an extra request before the actual request for following reasons:

  1. Since checker is now created asynchronously. Minimum of 1 second wait required for confirming it was created.
  2. Since the first request will just put the checker on waiting pool. Even in active health check, the first request will always not run the health check and just put it in waiting pool. The timer will pick it up and create the healthcheck later.

Benchmark results confirming no anomalies

Below are the results of manual testing done to verify that there are no variations in CPU and memory usage.
Relevant ENV variables:
worker_process: 1
ADMIN_KEY="HaVUyrUtJtpuDuRzHipGziJxdbxOkeqm"
APISIX_ADMIN_URL="http://127.0.0.1:9180"
APISIX_DATA_PLANE="http://127.0.0.1:9080"
BACKEND_PORT=8080 # A go server is run locally for upstream nodes
UPSTREAM_NAME="stress_upstream"
ROUTE_NAME="stress_route"
N_NODES=200
1.1 Create resources with 50% failing node

    NODES="["
    for i in $(seq 1 $((N_NODES/2))); do
        NODES+="{\"host\":\"127.0.0.1\",\"port\":$BACKEND_PORT,\"weight\":1},"
    done
    for i in $(seq 1 $((N_NODES/2))); do
        NODES+="{\"host\":\"127.0.0.1\",\"port\":1,\"weight\":1},"
    done
    NODES="${NODES%,}]"

    # Create upstream
    curl -s -X PUT "$APISIX_ADMIN_URL/apisix/admin/upstreams/1" \
        -H "X-API-KEY: $ADMIN_KEY" \
        -d "{
            \"name\": \"$UPSTREAM_NAME\",
            \"type\": \"roundrobin\",
            \"nodes\": $NODES,
            \"retries\": 2
        }"

    # Create route
    curl -s -X PUT "$APISIX_ADMIN_URL/apisix/admin/routes/1" \
        -H "X-API-KEY: $ADMIN_KEY" \
        -d "{
            \"name\": \"$ROUTE_NAME\",
            \"uri\": \"/*\",
            \"upstream_id\": \"1\"
        }"
    
    echo "Created upstream with $N_NODES nodes (50% failing)"

1.2 Run wrk and pidstat to calculate CPU usage - Baseline case
Wrk results:

wrk

 wrk -c 100 -t 5 -d 60s -R 900 http://localhost:9080
Running 1m test @ http://localhost:9080
  5 threads and 100 connections
  Thread calibration: mean lat.: 2.114ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 2.135ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 2.140ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 2.122ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 2.147ms, rate sampling interval: 10ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     2.10ms  735.04us  13.99ms   70.86%
    Req/Sec   189.76     70.81   444.00     69.94%
  54005 requests in 1.00m, 22.71MB read
  Non-2xx or 3xx responses: 54005
Requests/sec:    900.04
Transfer/sec:    387.62KB

CPU and Memory usage results
 pidstat -ur -p $pid 2 30 #sample 30 times - every 2 seconds
 
 #Output after sampling for 60 seconds
 
Average:      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
Average:     1000   1089374   17.15   13.73    0.00    0.10   30.88     -  openresty

Average:      UID       PID  minflt/s  majflt/s     VSZ     RSS   %MEM  Command
Average:     1000   1089374      5.00      0.00  498812   77144   0.48  openresty

RESULT:
The average CPU% is 30.88% and MEM% is 0.48%

   local health_config='{
            "active": {
                "type": "http",
                "http_path": "/",
                "healthy": {"interval": 1, "successes": 1},
                "unhealthy": {"interval": 1, "http_failures": 1}
            }
        }'
        curl -s -X PATCH "$APISIX_ADMIN_URL/apisix/admin/upstreams/1" \
            -H "X-API-KEY: $ADMIN_KEY" \
            -d "{\"checks\": $health_config}"
        echo "Enabled health checks"

Test with enabled healthcheck + Updating routes in the background for 24 hrs

Wrk results:

 wrk -c 100 -t 5 -d 86400s -R 900 http://localhost:9080

Running 1440m test @ http://localhost:9080
  5 threads and 100 connections
  Thread calibration: mean lat.: 1.322ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 1.314ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 1.302ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 1.307ms, rate sampling interval: 10ms
  Thread calibration: mean lat.: 1.294ms, rate sampling interval: 10ms
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     9.01ms  562.23ms   1.00m    99.97%
    Req/Sec   188.85     68.87     2.33k    73.45%
  77336041 requests in 1440.00m, 13.33GB read
  Socket errors: connect 0, read 0, write 0, timeout 23542
  Non-2xx or 3xx responses: 15186209
Requests/sec:    895.09
Transfer/sec:    161.72KB
CPU and Memory usage results
 pidstat -ur -p $pid 3600 24
Linux 6.14.4-arch1-2 (ashish-82yl)      24/07/25        _x86_64_        (16 CPU)

05:16:54 PM IST   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
06:16:54 PM IST  1000    142261   10.77    5.80    0.00    0.10   16.57     1  openresty

05:16:54 PM IST   UID       PID  minflt/s  majflt/s     VSZ     RSS   %MEM  Command
06:16:54 PM IST  1000    142261      9.35      0.00  409232   42680   0.27  openresty

...

04:16:54 PM IST   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
05:16:54 PM IST  1000    142261    3.71    1.81    0.00    0.02    5.51    14  openresty

04:16:54 PM IST   UID       PID  minflt/s  majflt/s     VSZ     RSS   %MEM  Command
05:16:54 PM IST  1000    142261      0.26      0.00  415664   50452   0.31  openresty


...

Average:      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
Average:     1000    142261    7.10    3.86    0.00    0.11   10.96     -  openresty

Average:      UID       PID  minflt/s  majflt/s     VSZ     RSS   %MEM  Command
Average:     1000    142261      7.71      0.00  411107   44214   0.27  openresty

RESULT:
After 1st hour the CPU usage drops to 16.57% but Memory usage remains 0.27
Towards the end, the CPU usage drops to 5.51% but Memory usage increased to just 0.31
The average memory usage was 0.27 and increase in memory usage in 24hrs was 0.04

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

@Revolyssup Revolyssup marked this pull request as draft July 14, 2025 15:18
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Jul 14, 2025
@Revolyssup Revolyssup marked this pull request as ready for review July 16, 2025 12:15
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jul 16, 2025
@Revolyssup Revolyssup marked this pull request as draft July 16, 2025 20:27
@membphis membphis self-requested a review July 23, 2025 01:31
@nic-6443 nic-6443 requested a review from bzp2010 July 23, 2025 01:31
@@ -27,7 +27,7 @@ local get_last_failure = balancer.get_last_failure
local set_timeouts = balancer.set_timeouts
local ngx_now = ngx.now
local str_byte = string.byte

local healthcheck_manager = require("apisix.healthcheck_manager")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

@@ -75,7 +75,8 @@ local function fetch_health_nodes(upstream, checker)
local port = upstream.checks and upstream.checks.active and upstream.checks.active.port
local up_nodes = core.table.new(0, #nodes)
for _, node in ipairs(nodes) do
local ok, err = checker:get_target_status(node.host, port or node.port, host)
local ok, err = healthcheck_manager.fetch_node_status(checker,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can cache healthcheck_manager.fetch_node_status, a short local function

Copy link
Contributor Author

@Revolyssup Revolyssup Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean using the lrucache with checker as key?

local events = require("apisix.events")
local tab_clone = core.table.clone
local timer_every = ngx.timer.every
local _M = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not safe to export working_pool and waiting_pool.

We will never allow the "working_pool" or "waiting_pool" to be destroyed.

they only can be use in this lua module

@Revolyssup Revolyssup requested a review from nic-chen July 23, 2025 07:30
@membphis
Copy link
Member

it looks good to me now, but we have to wait for more time, for more comments, it is a big change

BTW, the PR title should be refactor ^_^

Copy link
Member

@membphis membphis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Revolyssup Revolyssup changed the title feat: add healthcheck manager to decouple upstream refactor: add healthcheck manager to decouple upstream Jul 24, 2025
@nic-chen
Copy link
Member

Most of our tests are for scenarios which the config_provider is etcd. Could you add health check tests for scenarios which the config_provider is yaml or json?

@Revolyssup
Copy link
Contributor Author

Most of our tests are for scenarios which the config_provider is etcd. Could you add health check tests for scenarios which the config_provider is yaml or json?

A lot of existing tests here already use yaml config_provider. See https://github.com/apache/apisix/pull/12426/files#diff-d89c41bb4b1cc7c97936090edd39c05e1fcbc35c94039a16dbb392515d4f38d3R34

nic-6443
nic-6443 previously approved these changes Jul 30, 2025
membphis
membphis previously approved these changes Jul 31, 2025
Copy link
Member

@membphis membphis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only some minor issues

-- if a checker exists then delete it before creating a new one
local existing_checker = _M.working_pool[resource_path]
if existing_checker then
existing_checker.checker:delayed_clear(10)
Copy link
Member

@membphis membphis Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls remove this magic number

Copy link
Member

@membphis membphis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ^_^

@@ -20,6 +20,7 @@ local plugin = require("apisix.plugin")
local get_routes = require("apisix.router").http_routes
local get_services = require("apisix.http.service").services
local upstream_mod = require("apisix.upstream")
local healthcheck_manager = require("apisix.healthcheck_manager")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might make sense to put it in the same directory as the batch processor manager. The root directory isn't the right place for it.

@@ -66,14 +67,13 @@ function _M.schema()
return 200, schema
end


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a reminder, please try to avoid pointless formatting changes, this noise will affect review efficiency.

local timer_create_checker_running = false
local timer_working_pool_check_running = false
timer_every(1, function ()
if not exiting() then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the premature parameter on the timer callback instead of exiting()?

end
end)
timer_every(1, function ()
if not exiting() then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

local healthcheck_shdict_name = "upstream-healthcheck"
local is_http = ngx.config.subsystem == "http"
if not is_http then
healthcheck_shdict_name = healthcheck_shdict_name .. "-" .. ngx.config.subsystem
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once again, is a separate shdict for stream a must? Can't it share data with http?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment on lines +17 to +29
local require = require
local ipairs = ipairs
local pcall = pcall
local exiting = ngx.worker.exiting
local pairs = pairs
local tostring = tostring
local core = require("apisix.core")
local config_local = require("apisix.core.config_local")
local healthcheck
local events = require("apisix.events")
local tab_clone = core.table.clone
local timer_every = ngx.timer.every
local string_sub = string.sub
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please align their equals signs.

local xxx   = x
local x     = x
local xxxxx = x

Comment on lines +451 to +453
value.resource_key = parent and parent.key
value.resource_version = ((parent and parent.modifiedIndex) or value.modifiedIndex)
value.resource_id = ((parent and parent.value.id) or value.id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned if this is creating some more special case where: I can get upstream metadata via res.resource_id/res.resource_version but can't do the same with the route.
This may reduce readability and maintainability, and developers need to understand why upstream.resource_id is available but route.resource_id and consumer.resource_id are not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have this problem of special case with maintaining .parent everywhere. According to me this .parent is much less maintainable and spread at different places in code.

@Revolyssup Revolyssup merged commit 09f6e36 into apache:master Aug 7, 2025
50 checks passed
@Revolyssup Revolyssup deleted the revolyssup/refactor-healthcheck branch August 7, 2025 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants