Skip to content

Conversation

fcostaoliveira
Copy link
Collaborator

This PR introduces connection error tracking and a configurable auto-reconnect mechanism to memtier_benchmark. The goal is to make benchmarks more resilient when running against clusters where TLS errors, timeouts, or dropped connections can otherwise abort runs prematurely.

Key Features

  • Connection error tracking

    • Tracks and reports total connection errors per client group.
    • Adds connection error metrics in both text and JSON outputs (Connection Errors and Connection Errors/sec).
  • New options for reconnection

    • --reconnect-on-error → enable automatic reconnection when errors occur.
    • --max-reconnect-attempts=N → limit reconnection retries (default: 3).
    • --reconnect-backoff-factor=F → exponential backoff multiplier (default: 2.0).
    • --connection-timeout=SECS → timeout before considering a connection attempt failed (default: 10s).
  • Client behavior

    • On BEV_EVENT_ERROR, BEV_EVENT_EOF, or timeout → update error stats, attempt reconnection (if enabled).
    • Supports exponential backoff for retries, resets backoff on successful connect.
    • Connection error counts are included in live progress reports:
      [RUN #1  45%  900 secs] 45 threads 200 conns 12 conn errors ...
      

Motivation

Large-scale TLS benchmarks (esp. with pipelines & high concurrency) often hit intermittent EOF or unexpected EOF while reading. Previously, these failures could skew results or abort runs. With this change:

  • We can observe connection reliability explicitly via metrics.
  • Optionally allow benchmarks to auto-recover from transient errors.

Open Questions / Discussion

  • Should --reconnect-on-error be enabled by default in CI/benchmarks, or stay opt-in?
  • Current design tracks errors globally, not per operation. Would per-op error tagging be useful?
  • Backoff strategy: currently exponential. Should we consider jitter/randomization?
  • JSON schema: metrics added at top level — is this the right place, or should they live under a "connections" block?

Next Steps

  • Gather feedback on CLI/JSON interface.
  • Validate behavior under different cluster conditions (timeouts, TLS resets, network hiccups).
  • Decide on defaults (opt-in vs opt-out).
  • Add tests for reconnection logic and error accounting.

@kamran-redis
Copy link

Any reason --reconnect-on-error should not be enabled by default other than backward compatibility?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants