[BUG] Docker Compose --wait not always honoring healthy healthcheck for container that crashed, then became healthy #12424

hubertdeng123 · 2024-12-30T22:13:43Z

Description

It seems like occasionally when a container is unhealthy on start and restarts and becomes healthy, the docker compose up -d --wait will fail with an unhealthy error message. This happens when docker compose up -d --wait is run in parallel, and with the policy restart: unless-stopped. Note that this occasionally happens, not all the time.

I would hope that even if the container is unhealthy and crashes on start, --wait will account for this as it eventually becomes healthy after restarting itself if it is within the timeout period.

Steps To Reproduce

I have 3 config files like so:

docker-compose-redis:

services:
  redis:
    image: ghcr.io/getsentry/image-mirror-library-redis:5.0-alpine
    healthcheck:
      test: redis-cli ping | grep PONG
      interval: 5s
      timeout: 5s
      retries: 3
    command:
      [
        'redis-server',
        '--appendonly',
        'yes',
        '--save',
        '60',
        '20',
        '--auto-aof-rewrite-percentage',
        '100',
        '--auto-aof-rewrite-min-size',
        '64mb',
      ]
    ports:
      - 127.0.0.1:6379:6379
    volumes:
      - redis-data:/data
    networks:
      - devservices
    extra_hosts:
      - host.docker.internal:host-gateway # Allow host.docker.internal to resolve to the host machine
    labels:
      - orchestrator=devservices
    restart: unless-stopped
networks:
  devservices:
    external: true
volumes:
  redis-data:

docker-compose-kafka:

services:
  kafka:
    image: ghcr.io/getsentry/image-mirror-confluentinc-cp-kafka:7.5.0
    healthcheck:
      test: kafka-topics --bootstrap-server 127.0.0.1:9092 --list
      interval: 5s
      timeout: 5s
      retries: 3
    environment:
      # https://docs.confluent.io/platform/current/installation/docker/config-reference.html#cp-kakfa-example
      KAFKA_PROCESS_ROLES: broker,controller
      KAFKA_CONTROLLER_QUORUM_VOTERS: 1001@127.0.0.1:29093
      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER
      KAFKA_NODE_ID: 1001
      CLUSTER_ID: MkU3OEVBNTcwNTJENDM2Qk
      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:29092,INTERNAL://0.0.0.0:9093,EXTERNAL://0.0.0.0:9092,CONTROLLER://0.0.0.0:29093
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://127.0.0.1:29092,INTERNAL://kafka:9093,EXTERNAL://127.0.0.1:9092
      KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,INTERNAL:PLAINTEXT,EXTERNAL:PLAINTEXT,CONTROLLER:PLAINTEXT
      KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
      KAFKA_OFFSETS_TOPIC_NUM_PARTITIONS: 1
      KAFKA_LOG_RETENTION_HOURS: 24
      KAFKA_MESSAGE_MAX_BYTES: 50000000 # 50MB or bust
      KAFKA_MAX_REQUEST_SIZE: 50000000 # 50MB on requests apparently too
      CONFLUENT_SUPPORT_METRICS_ENABLE: false
      KAFKA_LOG4J_LOGGERS: kafka.cluster=WARN,kafka.controller=WARN,kafka.coordinator=WARN,kafka.log=WARN,kafka.server=WARN,state.change.logger=WARN
      KAFKA_LOG4J_ROOT_LOGLEVEL: WARN
      KAFKA_TOOLS_LOG4J_LOGLEVEL: WARN
    ulimits:
      nofile:
        soft: 4096
        hard: 4096
    ports:
      - 127.0.0.1:9092:9092
      - 127.0.0.1:9093:9093
    volumes:
      - kafka-data:/var/lib/kafka/data
    networks:
      - devservices
    extra_hosts:
      - host.docker.internal:host-gateway # Allow host.docker.internal to resolve to the host machine
    labels:
      - orchestrator=devservices
    restart: unless-stopped
networks:
  devservices:
    external: true
volumes:
  kafka-data:

docker-compose-relay:

services:
  relay:
    image: us-central1-docker.pkg.dev/sentryio/relay/relay:nightly
    ports:
      - 127.0.0.1:7899:7899
    command: [run, --config, /etc/relay]
    healthcheck:
      test: curl -f http://127.0.0.1:7899/api/relay/healthcheck/live/
      interval: 5s
      timeout: 5s
      retries: 3
    volumes:
      - ./config/relay.yml:/etc/relay/config.yml
      - ./config/devservices-credentials.json:/etc/relay/credentials.json
    extra_hosts:
      - host.docker.internal:host-gateway
    networks:
      - devservices
    labels:
      - orchestrator=devservices
    restart: unless-stopped
networks:
  devservices:
    external: true
volumes:
  kafka-data:
  redis-data:

When I run

# Start up commands in parallel
docker compose -p redis -f docker-compose-redis.yml up redis -d --wait > redis_up.log 2>&1 &
kafka_pid=$!
docker compose -p kafka -f docker-compose-kafka.yml up kafka -d --wait > kafka_up.log 2>&1 &
redis_pid=$!
docker compose -p relay -f docker-compose-relay.yml up relay -d --wait > relay_up.log 2>&1 &
relay_pid=$!

# Wait for all up commands to complete
wait $kafka_pid $redis_pid $relay_pid

Relay sometimes fails the to come up with the --wait flag, even if the docker status is technically healthy.

Logs:

Container relay-relay-1  Creating
 Container relay-relay-1  Created
 Container relay-relay-1  Starting
 Container relay-relay-1  Started
 Container relay-relay-1  Waiting
container relay-relay-1 is unhealthy

Compose Version

2.29.7

Docker Environment

Client:
 Version:    27.2.0
 Context:    colima

Anything else?

Let me know if there is anything else I can add to help out when reproducing the issue. The contents of the relay configs can be found here:
https://github.com/getsentry/relay/tree/fe3f09fd3accd2361887dd678dbe034f25139fce/devservices/config

The text was updated successfully, but these errors were encountered:

ndeloof · 2025-01-10T11:59:11Z

Compose polls engine API to check container reach "healthy" state. But if it detects a container crash, I would not expect it silently ignores and let container restart. IMHO the bug you describe should have the opposite fix: Compose should always detect container crashed then at least warn user or stop

As noted in docker/compose#12424, compose --wait doesn't seem to honor healthchecks with restart:always, when the server crashes and restarts a few times and eventually becomes healthy. This was happening with Rekor: * MySQL was not yet healthy because the healthcheck wasn't working as expected. docker-library/mysql#930 (comment) suggested using 127.0.0.1 instead of localhost * trillian-log-server was not yet healthy even when MySQL reported as healthy, causing trillian-log-server to crash and restart a few times. There was no healthcheck for either Trillian service because the image we're using is based on Distroless, which has no curl/wget. * rekor-server tried to start up with an unhealthy trillian-log-server, and crashed. The healthcheck reported as unhealthy, and even though the server eventually became healthy because of the restart:always policy, the healthcheck reported the startup as unhealthy. This change adds healthchecks to trillian-log-server and log-signer by pulling the binaries out of the images and putting them into Debian 12 containers that include curl, so we can curl the /healthz endpoint. This also fixes the MySQL healthcheck as noted above. Now, docker compose up --wait properly waits for a healthy MySQL before starting trillian-log-server, and a healthy Trillian before starting Rekor. This also bumps the Redis image to latest, since Redis 6.x is quite old. Also fix minor Dockerfile linting errors. Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>

As noted in docker/compose#12424, compose --wait doesn't seem to honor healthchecks with restart:always, when the server crashes and restarts a few times and eventually becomes healthy. This was happening with Rekor: * MySQL was not yet healthy because the healthcheck wasn't working as expected. docker-library/mysql#930 (comment) suggested using 127.0.0.1 instead of localhost * trillian-log-server was not yet healthy even when MySQL reported as healthy, causing trillian-log-server to crash and restart a few times. There was no healthcheck for either Trillian service because the image we're using is based on Distroless, which has no curl/wget. * rekor-server tried to start up with an unhealthy trillian-log-server, and crashed. The healthcheck reported as unhealthy, and even though the server eventually became healthy because of the restart:always policy, the healthcheck reported the startup as unhealthy. This change adds healthchecks to trillian-log-server and log-signer by pulling the binaries out of the images and putting them into Debian 12 containers that include curl, so we can curl the /healthz endpoint. This also fixes the MySQL healthcheck as noted above. Now, docker compose up --wait properly waits for a healthy MySQL before starting trillian-log-server, and a healthy Trillian before starting Rekor. Also fix minor Dockerfile linting errors. Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com> wip Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>

As noted in docker/compose#12424, compose --wait doesn't seem to honor healthchecks with restart:always, when the server crashes and restarts a few times and eventually becomes healthy. This was happening with Rekor: * MySQL was not yet healthy because the healthcheck wasn't working as expected. docker-library/mysql#930 (comment) suggested using 127.0.0.1 instead of localhost * trillian-log-server was not yet healthy even when MySQL reported as healthy, causing trillian-log-server to crash and restart a few times. There was no healthcheck for either Trillian service because the image we're using is based on Distroless, which has no curl/wget. * rekor-server tried to start up with an unhealthy trillian-log-server, and crashed. The healthcheck reported as unhealthy, and even though the server eventually became healthy because of the restart:always policy, the healthcheck reported the startup as unhealthy. This change adds healthchecks to trillian-log-server and log-signer by pulling the binaries out of the images and putting them into Debian 12 containers that include curl, so we can curl the /healthz endpoint. This also fixes the MySQL healthcheck as noted above. Now, docker compose up --wait properly waits for a healthy MySQL before starting trillian-log-server, and a healthy Trillian before starting Rekor. Also fix minor Dockerfile linting errors. Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>

…hy (#2473) As noted in docker/compose#12424, compose --wait doesn't seem to honor healthchecks with restart:always, when the server crashes and restarts a few times and eventually becomes healthy. This was happening with Rekor: * MySQL was not yet healthy because the healthcheck wasn't working as expected. docker-library/mysql#930 (comment) suggested using 127.0.0.1 instead of localhost * trillian-log-server was not yet healthy even when MySQL reported as healthy, causing trillian-log-server to crash and restart a few times. There was no healthcheck for either Trillian service because the image we're using is based on Distroless, which has no curl/wget. * rekor-server tried to start up with an unhealthy trillian-log-server, and crashed. The healthcheck reported as unhealthy, and even though the server eventually became healthy because of the restart:always policy, the healthcheck reported the startup as unhealthy. This change adds healthchecks to trillian-log-server and log-signer by pulling the binaries out of the images and putting them into Debian 12 containers that include curl, so we can curl the /healthz endpoint. This also fixes the MySQL healthcheck as noted above. Now, docker compose up --wait properly waits for a healthy MySQL before starting trillian-log-server, and a healthy Trillian before starting Rekor. Also fix minor Dockerfile linting errors. Signed-off-by: Hayden B <8418760+haydentherapper@users.noreply.github.com>

hubertdeng123 added kind/bug status/0-triage labels Dec 30, 2024

ZauberNerd mentioned this issue Jan 16, 2025

timesketch-web container crashes and restarts initially due to postgres race-condition google/timesketch#3263

Open

haydentherapper mentioned this issue May 6, 2025

Fix docker compose up --wait failing when Trillian server isn't healthy sigstore/rekor#2473

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Docker Compose --wait not always honoring healthy healthcheck for container that crashed, then became healthy #12424

[BUG] Docker Compose --wait not always honoring healthy healthcheck for container that crashed, then became healthy #12424

hubertdeng123 commented Dec 30, 2024

ndeloof commented Jan 10, 2025

Uh oh!

[BUG] Docker Compose --wait not always honoring healthy healthcheck for container that crashed, then became healthy #12424

[BUG] Docker Compose --wait not always honoring healthy healthcheck for container that crashed, then became healthy #12424

Comments

hubertdeng123 commented Dec 30, 2024

Description

Steps To Reproduce

Compose Version

Docker Environment

Anything else?

ndeloof commented Jan 10, 2025

Uh oh!