-
Notifications
You must be signed in to change notification settings - Fork 5.4k
[BUG] Docker Compose --wait not always honoring healthy healthcheck for container that crashed, then became healthy #12424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
Comments
Compose polls engine API to check container reach "healthy" state. But if it detects a container crash, I would not expect it silently ignores and let container restart. IMHO the bug you describe should have the opposite fix: Compose should always detect container crashed then at least warn user or stop |
haydentherapper
added a commit
to haydentherapper/rekor
that referenced
this issue
May 6, 2025
As noted in docker/compose#12424, compose --wait doesn't seem to honor healthchecks with restart:always, when the server crashes and restarts a few times and eventually becomes healthy. This was happening with Rekor: * MySQL was not yet healthy because the healthcheck wasn't working as expected. docker-library/mysql#930 (comment) suggested using 127.0.0.1 instead of localhost * trillian-log-server was not yet healthy even when MySQL reported as healthy, causing trillian-log-server to crash and restart a few times. There was no healthcheck for either Trillian service because the image we're using is based on Distroless, which has no curl/wget. * rekor-server tried to start up with an unhealthy trillian-log-server, and crashed. The healthcheck reported as unhealthy, and even though the server eventually became healthy because of the restart:always policy, the healthcheck reported the startup as unhealthy. This change adds healthchecks to trillian-log-server and log-signer by pulling the binaries out of the images and putting them into Debian 12 containers that include curl, so we can curl the /healthz endpoint. This also fixes the MySQL healthcheck as noted above. Now, docker compose up --wait properly waits for a healthy MySQL before starting trillian-log-server, and a healthy Trillian before starting Rekor. This also bumps the Redis image to latest, since Redis 6.x is quite old. Also fix minor Dockerfile linting errors. Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>
haydentherapper
added a commit
to haydentherapper/rekor
that referenced
this issue
May 6, 2025
As noted in docker/compose#12424, compose --wait doesn't seem to honor healthchecks with restart:always, when the server crashes and restarts a few times and eventually becomes healthy. This was happening with Rekor: * MySQL was not yet healthy because the healthcheck wasn't working as expected. docker-library/mysql#930 (comment) suggested using 127.0.0.1 instead of localhost * trillian-log-server was not yet healthy even when MySQL reported as healthy, causing trillian-log-server to crash and restart a few times. There was no healthcheck for either Trillian service because the image we're using is based on Distroless, which has no curl/wget. * rekor-server tried to start up with an unhealthy trillian-log-server, and crashed. The healthcheck reported as unhealthy, and even though the server eventually became healthy because of the restart:always policy, the healthcheck reported the startup as unhealthy. This change adds healthchecks to trillian-log-server and log-signer by pulling the binaries out of the images and putting them into Debian 12 containers that include curl, so we can curl the /healthz endpoint. This also fixes the MySQL healthcheck as noted above. Now, docker compose up --wait properly waits for a healthy MySQL before starting trillian-log-server, and a healthy Trillian before starting Rekor. Also fix minor Dockerfile linting errors. Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com> wip Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>
haydentherapper
added a commit
to haydentherapper/rekor
that referenced
this issue
May 6, 2025
As noted in docker/compose#12424, compose --wait doesn't seem to honor healthchecks with restart:always, when the server crashes and restarts a few times and eventually becomes healthy. This was happening with Rekor: * MySQL was not yet healthy because the healthcheck wasn't working as expected. docker-library/mysql#930 (comment) suggested using 127.0.0.1 instead of localhost * trillian-log-server was not yet healthy even when MySQL reported as healthy, causing trillian-log-server to crash and restart a few times. There was no healthcheck for either Trillian service because the image we're using is based on Distroless, which has no curl/wget. * rekor-server tried to start up with an unhealthy trillian-log-server, and crashed. The healthcheck reported as unhealthy, and even though the server eventually became healthy because of the restart:always policy, the healthcheck reported the startup as unhealthy. This change adds healthchecks to trillian-log-server and log-signer by pulling the binaries out of the images and putting them into Debian 12 containers that include curl, so we can curl the /healthz endpoint. This also fixes the MySQL healthcheck as noted above. Now, docker compose up --wait properly waits for a healthy MySQL before starting trillian-log-server, and a healthy Trillian before starting Rekor. Also fix minor Dockerfile linting errors. Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>
haydentherapper
added a commit
to haydentherapper/rekor
that referenced
this issue
May 6, 2025
As noted in docker/compose#12424, compose --wait doesn't seem to honor healthchecks with restart:always, when the server crashes and restarts a few times and eventually becomes healthy. This was happening with Rekor: * MySQL was not yet healthy because the healthcheck wasn't working as expected. docker-library/mysql#930 (comment) suggested using 127.0.0.1 instead of localhost * trillian-log-server was not yet healthy even when MySQL reported as healthy, causing trillian-log-server to crash and restart a few times. There was no healthcheck for either Trillian service because the image we're using is based on Distroless, which has no curl/wget. * rekor-server tried to start up with an unhealthy trillian-log-server, and crashed. The healthcheck reported as unhealthy, and even though the server eventually became healthy because of the restart:always policy, the healthcheck reported the startup as unhealthy. This change adds healthchecks to trillian-log-server and log-signer by pulling the binaries out of the images and putting them into Debian 12 containers that include curl, so we can curl the /healthz endpoint. This also fixes the MySQL healthcheck as noted above. Now, docker compose up --wait properly waits for a healthy MySQL before starting trillian-log-server, and a healthy Trillian before starting Rekor. Also fix minor Dockerfile linting errors. Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>
haydentherapper
added a commit
to haydentherapper/rekor
that referenced
this issue
May 6, 2025
As noted in docker/compose#12424, compose --wait doesn't seem to honor healthchecks with restart:always, when the server crashes and restarts a few times and eventually becomes healthy. This was happening with Rekor: * MySQL was not yet healthy because the healthcheck wasn't working as expected. docker-library/mysql#930 (comment) suggested using 127.0.0.1 instead of localhost * trillian-log-server was not yet healthy even when MySQL reported as healthy, causing trillian-log-server to crash and restart a few times. There was no healthcheck for either Trillian service because the image we're using is based on Distroless, which has no curl/wget. * rekor-server tried to start up with an unhealthy trillian-log-server, and crashed. The healthcheck reported as unhealthy, and even though the server eventually became healthy because of the restart:always policy, the healthcheck reported the startup as unhealthy. This change adds healthchecks to trillian-log-server and log-signer by pulling the binaries out of the images and putting them into Debian 12 containers that include curl, so we can curl the /healthz endpoint. This also fixes the MySQL healthcheck as noted above. Now, docker compose up --wait properly waits for a healthy MySQL before starting trillian-log-server, and a healthy Trillian before starting Rekor. Also fix minor Dockerfile linting errors. Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>
haydentherapper
added a commit
to haydentherapper/rekor
that referenced
this issue
May 6, 2025
As noted in docker/compose#12424, compose --wait doesn't seem to honor healthchecks with restart:always, when the server crashes and restarts a few times and eventually becomes healthy. This was happening with Rekor: * MySQL was not yet healthy because the healthcheck wasn't working as expected. docker-library/mysql#930 (comment) suggested using 127.0.0.1 instead of localhost * trillian-log-server was not yet healthy even when MySQL reported as healthy, causing trillian-log-server to crash and restart a few times. There was no healthcheck for either Trillian service because the image we're using is based on Distroless, which has no curl/wget. * rekor-server tried to start up with an unhealthy trillian-log-server, and crashed. The healthcheck reported as unhealthy, and even though the server eventually became healthy because of the restart:always policy, the healthcheck reported the startup as unhealthy. This change adds healthchecks to trillian-log-server and log-signer by pulling the binaries out of the images and putting them into Debian 12 containers that include curl, so we can curl the /healthz endpoint. This also fixes the MySQL healthcheck as noted above. Now, docker compose up --wait properly waits for a healthy MySQL before starting trillian-log-server, and a healthy Trillian before starting Rekor. Also fix minor Dockerfile linting errors. Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>
haydentherapper
added a commit
to haydentherapper/rekor
that referenced
this issue
May 13, 2025
As noted in docker/compose#12424, compose --wait doesn't seem to honor healthchecks with restart:always, when the server crashes and restarts a few times and eventually becomes healthy. This was happening with Rekor: * MySQL was not yet healthy because the healthcheck wasn't working as expected. docker-library/mysql#930 (comment) suggested using 127.0.0.1 instead of localhost * trillian-log-server was not yet healthy even when MySQL reported as healthy, causing trillian-log-server to crash and restart a few times. There was no healthcheck for either Trillian service because the image we're using is based on Distroless, which has no curl/wget. * rekor-server tried to start up with an unhealthy trillian-log-server, and crashed. The healthcheck reported as unhealthy, and even though the server eventually became healthy because of the restart:always policy, the healthcheck reported the startup as unhealthy. This change adds healthchecks to trillian-log-server and log-signer by pulling the binaries out of the images and putting them into Debian 12 containers that include curl, so we can curl the /healthz endpoint. This also fixes the MySQL healthcheck as noted above. Now, docker compose up --wait properly waits for a healthy MySQL before starting trillian-log-server, and a healthy Trillian before starting Rekor. Also fix minor Dockerfile linting errors. Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>
haydentherapper
added a commit
to haydentherapper/rekor
that referenced
this issue
May 13, 2025
As noted in docker/compose#12424, compose --wait doesn't seem to honor healthchecks with restart:always, when the server crashes and restarts a few times and eventually becomes healthy. This was happening with Rekor: * MySQL was not yet healthy because the healthcheck wasn't working as expected. docker-library/mysql#930 (comment) suggested using 127.0.0.1 instead of localhost * trillian-log-server was not yet healthy even when MySQL reported as healthy, causing trillian-log-server to crash and restart a few times. There was no healthcheck for either Trillian service because the image we're using is based on Distroless, which has no curl/wget. * rekor-server tried to start up with an unhealthy trillian-log-server, and crashed. The healthcheck reported as unhealthy, and even though the server eventually became healthy because of the restart:always policy, the healthcheck reported the startup as unhealthy. This change adds healthchecks to trillian-log-server and log-signer by pulling the binaries out of the images and putting them into Debian 12 containers that include curl, so we can curl the /healthz endpoint. This also fixes the MySQL healthcheck as noted above. Now, docker compose up --wait properly waits for a healthy MySQL before starting trillian-log-server, and a healthy Trillian before starting Rekor. Also fix minor Dockerfile linting errors. Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>
haydentherapper
added a commit
to sigstore/rekor
that referenced
this issue
May 14, 2025
…hy (#2473) As noted in docker/compose#12424, compose --wait doesn't seem to honor healthchecks with restart:always, when the server crashes and restarts a few times and eventually becomes healthy. This was happening with Rekor: * MySQL was not yet healthy because the healthcheck wasn't working as expected. docker-library/mysql#930 (comment) suggested using 127.0.0.1 instead of localhost * trillian-log-server was not yet healthy even when MySQL reported as healthy, causing trillian-log-server to crash and restart a few times. There was no healthcheck for either Trillian service because the image we're using is based on Distroless, which has no curl/wget. * rekor-server tried to start up with an unhealthy trillian-log-server, and crashed. The healthcheck reported as unhealthy, and even though the server eventually became healthy because of the restart:always policy, the healthcheck reported the startup as unhealthy. This change adds healthchecks to trillian-log-server and log-signer by pulling the binaries out of the images and putting them into Debian 12 containers that include curl, so we can curl the /healthz endpoint. This also fixes the MySQL healthcheck as noted above. Now, docker compose up --wait properly waits for a healthy MySQL before starting trillian-log-server, and a healthy Trillian before starting Rekor. Also fix minor Dockerfile linting errors. Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description
It seems like occasionally when a container is unhealthy on start and restarts and becomes healthy, the docker compose up -d --wait will fail with an unhealthy error message. This happens when docker compose up -d --wait is run in parallel, and with the policy restart: unless-stopped. Note that this occasionally happens, not all the time.
I would hope that even if the container is unhealthy and crashes on start,
--wait
will account for this as it eventually becomes healthy after restarting itself if it is within the timeout period.Steps To Reproduce
I have 3 config files like so:
docker-compose-redis:
docker-compose-kafka:
docker-compose-relay:
When I run
Relay sometimes fails the to come up with the --wait flag, even if the docker status is technically healthy.
Logs:
Compose Version
Docker Environment
Anything else?
Let me know if there is anything else I can add to help out when reproducing the issue. The contents of the relay configs can be found here:
https://github.com/getsentry/relay/tree/fe3f09fd3accd2361887dd678dbe034f25139fce/devservices/config
The text was updated successfully, but these errors were encountered: