-
Notifications
You must be signed in to change notification settings - Fork 2.6k
refactor: add healthcheck manager to decouple upstream #12426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: add healthcheck manager to decouple upstream #12426
Conversation
apisix/balancer.lua
Outdated
@@ -27,7 +27,7 @@ local get_last_failure = balancer.get_last_failure | |||
local set_timeouts = balancer.set_timeouts | |||
local ngx_now = ngx.now | |||
local str_byte = string.byte | |||
|
|||
local healthcheck_manager = require("apisix.healthcheck_manager") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -75,7 +75,8 @@ local function fetch_health_nodes(upstream, checker) | |||
local port = upstream.checks and upstream.checks.active and upstream.checks.active.port | |||
local up_nodes = core.table.new(0, #nodes) | |||
for _, node in ipairs(nodes) do | |||
local ok, err = checker:get_target_status(node.host, port or node.port, host) | |||
local ok, err = healthcheck_manager.fetch_node_status(checker, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can cache healthcheck_manager.fetch_node_status
, a short local function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean using the lrucache with checker
as key?
apisix/healthcheck_manager.lua
Outdated
local events = require("apisix.events") | ||
local tab_clone = core.table.clone | ||
local timer_every = ngx.timer.every | ||
local _M = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not safe to export working_pool
and waiting_pool
.
We will never allow the "working_pool" or "waiting_pool" to be destroyed.
they only can be use in this lua module
it looks good to me now, but we have to wait for more time, for more comments, it is a big change BTW, the PR title should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Most of our tests are for scenarios which the config_provider is etcd. Could you add health check tests for scenarios which the config_provider is yaml or json? |
A lot of existing tests here already use yaml config_provider. See https://github.com/apache/apisix/pull/12426/files#diff-d89c41bb4b1cc7c97936090edd39c05e1fcbc35c94039a16dbb392515d4f38d3R34 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, only some minor issues
apisix/healthcheck_manager.lua
Outdated
-- if a checker exists then delete it before creating a new one | ||
local existing_checker = _M.working_pool[resource_path] | ||
if existing_checker then | ||
existing_checker.checker:delayed_clear(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls remove this magic number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ^_^
@@ -20,6 +20,7 @@ local plugin = require("apisix.plugin") | |||
local get_routes = require("apisix.router").http_routes | |||
local get_services = require("apisix.http.service").services | |||
local upstream_mod = require("apisix.upstream") | |||
local healthcheck_manager = require("apisix.healthcheck_manager") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might make sense to put it in the same directory as the batch processor manager. The root directory isn't the right place for it.
@@ -66,14 +67,13 @@ function _M.schema() | |||
return 200, schema | |||
end | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a reminder, please try to avoid pointless formatting changes, this noise will affect review efficiency.
local timer_create_checker_running = false | ||
local timer_working_pool_check_running = false | ||
timer_every(1, function () | ||
if not exiting() then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the premature
parameter on the timer callback instead of exiting()
?
end | ||
end) | ||
timer_every(1, function () | ||
if not exiting() then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
local healthcheck_shdict_name = "upstream-healthcheck" | ||
local is_http = ngx.config.subsystem == "http" | ||
if not is_http then | ||
healthcheck_shdict_name = healthcheck_shdict_name .. "-" .. ngx.config.subsystem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once again, is a separate shdict for stream a must? Can't it share data with http?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
local require = require | ||
local ipairs = ipairs | ||
local pcall = pcall | ||
local exiting = ngx.worker.exiting | ||
local pairs = pairs | ||
local tostring = tostring | ||
local core = require("apisix.core") | ||
local config_local = require("apisix.core.config_local") | ||
local healthcheck | ||
local events = require("apisix.events") | ||
local tab_clone = core.table.clone | ||
local timer_every = ngx.timer.every | ||
local string_sub = string.sub |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please align their equals signs.
local xxx = x
local x = x
local xxxxx = x
value.resource_key = parent and parent.key | ||
value.resource_version = ((parent and parent.modifiedIndex) or value.modifiedIndex) | ||
value.resource_id = ((parent and parent.value.id) or value.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned if this is creating some more special case where: I can get upstream metadata via res.resource_id/res.resource_version
but can't do the same with the route.
This may reduce readability and maintainability, and developers need to understand why upstream.resource_id
is available but route.resource_id
and consumer.resource_id
are not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have this problem of special case with maintaining .parent
everywhere. According to me this .parent
is much less maintainable and spread at different places in code.
Description
The current implementation of health checker has few problems.
parent
field adds complexity across the source code and needs to be taken care of. This field is only used for healthchecks.Solution
NOTE: A lot of tests have been modified to either add sleep or add an extra request before the actual request for following reasons:
Benchmark results confirming no anomalies
Below are the results of manual testing done to verify that there are no variations in CPU and memory usage.
Relevant ENV variables:
worker_process: 1
ADMIN_KEY="HaVUyrUtJtpuDuRzHipGziJxdbxOkeqm"
APISIX_ADMIN_URL="http://127.0.0.1:9180"
APISIX_DATA_PLANE="http://127.0.0.1:9080"
BACKEND_PORT=8080 # A go server is run locally for upstream nodes
UPSTREAM_NAME="stress_upstream"
ROUTE_NAME="stress_route"
N_NODES=200
1.1 Create resources with 50% failing node
1.2 Run wrk and pidstat to calculate CPU usage - Baseline case
Wrk results:
wrk
RESULT:
The average CPU% is 30.88% and MEM% is 0.48%
Test with enabled healthcheck + Updating routes in the background for 24 hrs
Wrk results:
RESULT:
After 1st hour the CPU usage drops to 16.57% but Memory usage remains 0.27
Towards the end, the CPU usage drops to 5.51% but Memory usage increased to just 0.31
The average memory usage was 0.27 and increase in memory usage in 24hrs was 0.04
Checklist