Skip to content

[BUG] Docker Compose --wait not always honoring healthy healthcheck for container that crashed, then became healthy #12424

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
hubertdeng123 opened this issue Dec 30, 2024 · 1 comment

Comments

@hubertdeng123
Copy link

Description

It seems like occasionally when a container is unhealthy on start and restarts and becomes healthy, the docker compose up -d --wait will fail with an unhealthy error message. This happens when docker compose up -d --wait is run in parallel, and with the policy restart: unless-stopped. Note that this occasionally happens, not all the time.

I would hope that even if the container is unhealthy and crashes on start, --wait will account for this as it eventually becomes healthy after restarting itself if it is within the timeout period.

Steps To Reproduce

I have 3 config files like so:

docker-compose-redis:

services:
  redis:
    image: ghcr.io/getsentry/image-mirror-library-redis:5.0-alpine
    healthcheck:
      test: redis-cli ping | grep PONG
      interval: 5s
      timeout: 5s
      retries: 3
    command:
      [
        'redis-server',
        '--appendonly',
        'yes',
        '--save',
        '60',
        '20',
        '--auto-aof-rewrite-percentage',
        '100',
        '--auto-aof-rewrite-min-size',
        '64mb',
      ]
    ports:
      - 127.0.0.1:6379:6379
    volumes:
      - redis-data:/data
    networks:
      - devservices
    extra_hosts:
      - host.docker.internal:host-gateway # Allow host.docker.internal to resolve to the host machine
    labels:
      - orchestrator=devservices
    restart: unless-stopped
networks:
  devservices:
    external: true
volumes:
  redis-data:

docker-compose-kafka:

services:
  kafka:
    image: ghcr.io/getsentry/image-mirror-confluentinc-cp-kafka:7.5.0
    healthcheck:
      test: kafka-topics --bootstrap-server 127.0.0.1:9092 --list
      interval: 5s
      timeout: 5s
      retries: 3
    environment:
      # https://docs.confluent.io/platform/current/installation/docker/config-reference.html#cp-kakfa-example
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1001@127.0.0.1:29093
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_NODE_ID: 1001
      CLUSTER_ID: MkU3OEVBNTcwNTJENDM2Qk
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,INTERNAL://0.0.0.0:9093,EXTERNAL://0.0.0.0:9092,CONTROLLER://0.0.0.0:29093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://127.0.0.1:29092,INTERNAL://kafka:9093,EXTERNAL://127.0.0.1:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT,CONTROLLER:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_OFFSETS_TOPIC_NUM_PARTITIONS: 1
      KAFKA_LOG_RETENTION_HOURS: 24
      KAFKA_MESSAGE_MAX_BYTES: 50000000 # 50MB or bust
      KAFKA_MAX_REQUEST_SIZE: 50000000 # 50MB on requests apparently too
      CONFLUENT_SUPPORT_METRICS_ENABLE: false
      KAFKA_LOG4J_LOGGERS: kafka.cluster=WARN,kafka.controller=WARN,kafka.coordinator=WARN,kafka.log=WARN,kafka.server=WARN,state.change.logger=WARN
      KAFKA_LOG4J_ROOT_LOGLEVEL: WARN
      KAFKA_TOOLS_LOG4J_LOGLEVEL: WARN
    ulimits:
      nofile:
        soft: 4096
        hard: 4096
    ports:
      - 127.0.0.1:9092:9092
      - 127.0.0.1:9093:9093
    volumes:
      - kafka-data:/var/lib/kafka/data
    networks:
      - devservices
    extra_hosts:
      - host.docker.internal:host-gateway # Allow host.docker.internal to resolve to the host machine
    labels:
      - orchestrator=devservices
    restart: unless-stopped
networks:
  devservices:
    external: true
volumes:
  kafka-data:

docker-compose-relay:

services:
  relay:
    image: us-central1-docker.pkg.dev/sentryio/relay/relay:nightly
    ports:
      - 127.0.0.1:7899:7899
    command: [run, --config, /etc/relay]
    healthcheck:
      test: curl -f http://127.0.0.1:7899/api/relay/healthcheck/live/
      interval: 5s
      timeout: 5s
      retries: 3
    volumes:
      - ./config/relay.yml:/etc/relay/config.yml
      - ./config/devservices-credentials.json:/etc/relay/credentials.json
    extra_hosts:
      - host.docker.internal:host-gateway
    networks:
      - devservices
    labels:
      - orchestrator=devservices
    restart: unless-stopped
networks:
  devservices:
    external: true
volumes:
  kafka-data:
  redis-data:

When I run

# Start up commands in parallel
docker compose -p redis -f docker-compose-redis.yml up redis -d --wait > redis_up.log 2>&1 &
kafka_pid=$!
docker compose -p kafka -f docker-compose-kafka.yml up kafka -d --wait > kafka_up.log 2>&1 &
redis_pid=$!
docker compose -p relay -f docker-compose-relay.yml up relay -d --wait > relay_up.log 2>&1 &
relay_pid=$!

# Wait for all up commands to complete
wait $kafka_pid $redis_pid $relay_pid

Relay sometimes fails the to come up with the --wait flag, even if the docker status is technically healthy.

Logs:

Container relay-relay-1  Creating
 Container relay-relay-1  Created
 Container relay-relay-1  Starting
 Container relay-relay-1  Started
 Container relay-relay-1  Waiting
container relay-relay-1 is unhealthy

Compose Version

2.29.7

Docker Environment

Client:
 Version:    27.2.0
 Context:    colima

Anything else?

Let me know if there is anything else I can add to help out when reproducing the issue. The contents of the relay configs can be found here:
https://github.com/getsentry/relay/tree/fe3f09fd3accd2361887dd678dbe034f25139fce/devservices/config

@ndeloof
Copy link
Contributor

ndeloof commented Jan 10, 2025

Compose polls engine API to check container reach "healthy" state. But if it detects a container crash, I would not expect it silently ignores and let container restart. IMHO the bug you describe should have the opposite fix: Compose should always detect container crashed then at least warn user or stop

haydentherapper added a commit to haydentherapper/rekor that referenced this issue May 6, 2025
As noted in docker/compose#12424, compose
--wait doesn't seem to honor healthchecks with restart:always, when the
server crashes and restarts a few times and eventually becomes healthy.
This was happening with Rekor:

* MySQL was not yet healthy because the healthcheck wasn't working as
  expected. docker-library/mysql#930 (comment)
  suggested using 127.0.0.1 instead of localhost
* trillian-log-server was not yet healthy even when MySQL reported as
  healthy, causing trillian-log-server to crash and restart a few times.
  There was no healthcheck for either Trillian service because the image
  we're using is based on Distroless, which has no curl/wget.
* rekor-server tried to start up with an unhealthy trillian-log-server,
  and crashed. The healthcheck reported as unhealthy, and even though
  the server eventually became healthy because of the restart:always
  policy, the healthcheck reported the startup as unhealthy.

This change adds healthchecks to trillian-log-server and log-signer by
pulling the binaries out of the images and putting them into Debian
12 containers that include curl, so we can curl the /healthz endpoint.
This also fixes the MySQL healthcheck as noted above. Now, docker
compose up --wait properly waits for a healthy MySQL before starting
trillian-log-server, and a healthy Trillian before starting Rekor.

This also bumps the Redis image to latest, since Redis 6.x is quite old.
Also fix minor Dockerfile linting errors.

Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>
haydentherapper added a commit to haydentherapper/rekor that referenced this issue May 6, 2025
As noted in docker/compose#12424, compose
--wait doesn't seem to honor healthchecks with restart:always, when the
server crashes and restarts a few times and eventually becomes healthy.
This was happening with Rekor:

* MySQL was not yet healthy because the healthcheck wasn't working as
  expected. docker-library/mysql#930 (comment)
  suggested using 127.0.0.1 instead of localhost
* trillian-log-server was not yet healthy even when MySQL reported as
  healthy, causing trillian-log-server to crash and restart a few times.
  There was no healthcheck for either Trillian service because the image
  we're using is based on Distroless, which has no curl/wget.
* rekor-server tried to start up with an unhealthy trillian-log-server,
  and crashed. The healthcheck reported as unhealthy, and even though
  the server eventually became healthy because of the restart:always
  policy, the healthcheck reported the startup as unhealthy.

This change adds healthchecks to trillian-log-server and log-signer by
pulling the binaries out of the images and putting them into Debian
12 containers that include curl, so we can curl the /healthz endpoint.
This also fixes the MySQL healthcheck as noted above. Now, docker
compose up --wait properly waits for a healthy MySQL before starting
trillian-log-server, and a healthy Trillian before starting Rekor.

Also fix minor Dockerfile linting errors.

Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>

wip

Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>
haydentherapper added a commit to haydentherapper/rekor that referenced this issue May 6, 2025
As noted in docker/compose#12424, compose
--wait doesn't seem to honor healthchecks with restart:always, when the
server crashes and restarts a few times and eventually becomes healthy.
This was happening with Rekor:

* MySQL was not yet healthy because the healthcheck wasn't working as
  expected. docker-library/mysql#930 (comment)
  suggested using 127.0.0.1 instead of localhost
* trillian-log-server was not yet healthy even when MySQL reported as
  healthy, causing trillian-log-server to crash and restart a few times.
  There was no healthcheck for either Trillian service because the image
  we're using is based on Distroless, which has no curl/wget.
* rekor-server tried to start up with an unhealthy trillian-log-server,
  and crashed. The healthcheck reported as unhealthy, and even though
  the server eventually became healthy because of the restart:always
  policy, the healthcheck reported the startup as unhealthy.

This change adds healthchecks to trillian-log-server and log-signer by
pulling the binaries out of the images and putting them into Debian
12 containers that include curl, so we can curl the /healthz endpoint.
This also fixes the MySQL healthcheck as noted above. Now, docker
compose up --wait properly waits for a healthy MySQL before starting
trillian-log-server, and a healthy Trillian before starting Rekor.

Also fix minor Dockerfile linting errors.

Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>
haydentherapper added a commit to haydentherapper/rekor that referenced this issue May 6, 2025
As noted in docker/compose#12424, compose
--wait doesn't seem to honor healthchecks with restart:always, when the
server crashes and restarts a few times and eventually becomes healthy.
This was happening with Rekor:

* MySQL was not yet healthy because the healthcheck wasn't working as
  expected. docker-library/mysql#930 (comment)
  suggested using 127.0.0.1 instead of localhost
* trillian-log-server was not yet healthy even when MySQL reported as
  healthy, causing trillian-log-server to crash and restart a few times.
  There was no healthcheck for either Trillian service because the image
  we're using is based on Distroless, which has no curl/wget.
* rekor-server tried to start up with an unhealthy trillian-log-server,
  and crashed. The healthcheck reported as unhealthy, and even though
  the server eventually became healthy because of the restart:always
  policy, the healthcheck reported the startup as unhealthy.

This change adds healthchecks to trillian-log-server and log-signer by
pulling the binaries out of the images and putting them into Debian
12 containers that include curl, so we can curl the /healthz endpoint.
This also fixes the MySQL healthcheck as noted above. Now, docker
compose up --wait properly waits for a healthy MySQL before starting
trillian-log-server, and a healthy Trillian before starting Rekor.

Also fix minor Dockerfile linting errors.

Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>
haydentherapper added a commit to haydentherapper/rekor that referenced this issue May 6, 2025
As noted in docker/compose#12424, compose
--wait doesn't seem to honor healthchecks with restart:always, when the
server crashes and restarts a few times and eventually becomes healthy.
This was happening with Rekor:

* MySQL was not yet healthy because the healthcheck wasn't working as
  expected. docker-library/mysql#930 (comment)
  suggested using 127.0.0.1 instead of localhost
* trillian-log-server was not yet healthy even when MySQL reported as
  healthy, causing trillian-log-server to crash and restart a few times.
  There was no healthcheck for either Trillian service because the image
  we're using is based on Distroless, which has no curl/wget.
* rekor-server tried to start up with an unhealthy trillian-log-server,
  and crashed. The healthcheck reported as unhealthy, and even though
  the server eventually became healthy because of the restart:always
  policy, the healthcheck reported the startup as unhealthy.

This change adds healthchecks to trillian-log-server and log-signer by
pulling the binaries out of the images and putting them into Debian
12 containers that include curl, so we can curl the /healthz endpoint.
This also fixes the MySQL healthcheck as noted above. Now, docker
compose up --wait properly waits for a healthy MySQL before starting
trillian-log-server, and a healthy Trillian before starting Rekor.

Also fix minor Dockerfile linting errors.

Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>
haydentherapper added a commit to haydentherapper/rekor that referenced this issue May 6, 2025
As noted in docker/compose#12424, compose
--wait doesn't seem to honor healthchecks with restart:always, when the
server crashes and restarts a few times and eventually becomes healthy.
This was happening with Rekor:

* MySQL was not yet healthy because the healthcheck wasn't working as
  expected. docker-library/mysql#930 (comment)
  suggested using 127.0.0.1 instead of localhost
* trillian-log-server was not yet healthy even when MySQL reported as
  healthy, causing trillian-log-server to crash and restart a few times.
  There was no healthcheck for either Trillian service because the image
  we're using is based on Distroless, which has no curl/wget.
* rekor-server tried to start up with an unhealthy trillian-log-server,
  and crashed. The healthcheck reported as unhealthy, and even though
  the server eventually became healthy because of the restart:always
  policy, the healthcheck reported the startup as unhealthy.

This change adds healthchecks to trillian-log-server and log-signer by
pulling the binaries out of the images and putting them into Debian
12 containers that include curl, so we can curl the /healthz endpoint.
This also fixes the MySQL healthcheck as noted above. Now, docker
compose up --wait properly waits for a healthy MySQL before starting
trillian-log-server, and a healthy Trillian before starting Rekor.

Also fix minor Dockerfile linting errors.

Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>
haydentherapper added a commit to haydentherapper/rekor that referenced this issue May 13, 2025
As noted in docker/compose#12424, compose
--wait doesn't seem to honor healthchecks with restart:always, when the
server crashes and restarts a few times and eventually becomes healthy.
This was happening with Rekor:

* MySQL was not yet healthy because the healthcheck wasn't working as
  expected. docker-library/mysql#930 (comment)
  suggested using 127.0.0.1 instead of localhost
* trillian-log-server was not yet healthy even when MySQL reported as
  healthy, causing trillian-log-server to crash and restart a few times.
  There was no healthcheck for either Trillian service because the image
  we're using is based on Distroless, which has no curl/wget.
* rekor-server tried to start up with an unhealthy trillian-log-server,
  and crashed. The healthcheck reported as unhealthy, and even though
  the server eventually became healthy because of the restart:always
  policy, the healthcheck reported the startup as unhealthy.

This change adds healthchecks to trillian-log-server and log-signer by
pulling the binaries out of the images and putting them into Debian
12 containers that include curl, so we can curl the /healthz endpoint.
This also fixes the MySQL healthcheck as noted above. Now, docker
compose up --wait properly waits for a healthy MySQL before starting
trillian-log-server, and a healthy Trillian before starting Rekor.

Also fix minor Dockerfile linting errors.

Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>
haydentherapper added a commit to haydentherapper/rekor that referenced this issue May 13, 2025
As noted in docker/compose#12424, compose
--wait doesn't seem to honor healthchecks with restart:always, when the
server crashes and restarts a few times and eventually becomes healthy.
This was happening with Rekor:

* MySQL was not yet healthy because the healthcheck wasn't working as
  expected. docker-library/mysql#930 (comment)
  suggested using 127.0.0.1 instead of localhost
* trillian-log-server was not yet healthy even when MySQL reported as
  healthy, causing trillian-log-server to crash and restart a few times.
  There was no healthcheck for either Trillian service because the image
  we're using is based on Distroless, which has no curl/wget.
* rekor-server tried to start up with an unhealthy trillian-log-server,
  and crashed. The healthcheck reported as unhealthy, and even though
  the server eventually became healthy because of the restart:always
  policy, the healthcheck reported the startup as unhealthy.

This change adds healthchecks to trillian-log-server and log-signer by
pulling the binaries out of the images and putting them into Debian
12 containers that include curl, so we can curl the /healthz endpoint.
This also fixes the MySQL healthcheck as noted above. Now, docker
compose up --wait properly waits for a healthy MySQL before starting
trillian-log-server, and a healthy Trillian before starting Rekor.

Also fix minor Dockerfile linting errors.

Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>
haydentherapper added a commit to sigstore/rekor that referenced this issue May 14, 2025
…hy (#2473)

As noted in docker/compose#12424, compose
--wait doesn't seem to honor healthchecks with restart:always, when the
server crashes and restarts a few times and eventually becomes healthy.
This was happening with Rekor:

* MySQL was not yet healthy because the healthcheck wasn't working as
  expected. docker-library/mysql#930 (comment)
  suggested using 127.0.0.1 instead of localhost
* trillian-log-server was not yet healthy even when MySQL reported as
  healthy, causing trillian-log-server to crash and restart a few times.
  There was no healthcheck for either Trillian service because the image
  we're using is based on Distroless, which has no curl/wget.
* rekor-server tried to start up with an unhealthy trillian-log-server,
  and crashed. The healthcheck reported as unhealthy, and even though
  the server eventually became healthy because of the restart:always
  policy, the healthcheck reported the startup as unhealthy.

This change adds healthchecks to trillian-log-server and log-signer by
pulling the binaries out of the images and putting them into Debian
12 containers that include curl, so we can curl the /healthz endpoint.
This also fixes the MySQL healthcheck as noted above. Now, docker
compose up --wait properly waits for a healthy MySQL before starting
trillian-log-server, and a healthy Trillian before starting Rekor.

Also fix minor Dockerfile linting errors.

Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants