WIP/Open for Feedback: Connection Error Tracking & Auto-Reconnect #320

fcostaoliveira · 2025-09-24T16:25:17Z

This PR introduces connection error tracking and a configurable auto-reconnect mechanism to memtier_benchmark. The goal is to make benchmarks more resilient when running against clusters where TLS errors, timeouts, or dropped connections can otherwise abort runs prematurely.

Key Features

Connection error tracking
- Tracks and reports total connection errors per client group.
- Adds connection error metrics in both text and JSON outputs (Connection Errors and Connection Errors/sec).
New options for reconnection
- --reconnect-on-error → enable automatic reconnection when errors occur.
- --max-reconnect-attempts=N → limit reconnection retries (default: 3).
- --reconnect-backoff-factor=F → exponential backoff multiplier (default: 2.0).
- --connection-timeout=SECS → timeout before considering a connection attempt failed (default: 10s).
Client behavior
- On BEV_EVENT_ERROR, BEV_EVENT_EOF, or timeout → update error stats, attempt reconnection (if enabled).
- Supports exponential backoff for retries, resets backoff on successful connect.
- Connection error counts are included in live progress reports:
```
[RUN #1  45%  900 secs] 45 threads 200 conns 12 conn errors ...
```

Motivation

Large-scale TLS benchmarks (esp. with pipelines & high concurrency) often hit intermittent EOF or unexpected EOF while reading. Previously, these failures could skew results or abort runs. With this change:

We can observe connection reliability explicitly via metrics.
Optionally allow benchmarks to auto-recover from transient errors.

Open Questions / Discussion

Should --reconnect-on-error be enabled by default in CI/benchmarks, or stay opt-in?
Current design tracks errors globally, not per operation. Would per-op error tagging be useful?
Backoff strategy: currently exponential. Should we consider jitter/randomization?
JSON schema: metrics added at top level — is this the right place, or should they live under a "connections" block?

Next Steps

Gather feedback on CLI/JSON interface.
Validate behavior under different cluster conditions (timeouts, TLS resets, network hiccups).
Decide on defaults (opt-in vs opt-out).
Add tests for reconnection logic and error accounting.

kamran-redis · 2025-09-26T13:01:25Z

Any reason --reconnect-on-error should not be enabled by default other than backward compatibility?

fcostaoliveira added 2 commits September 24, 2025 16:25

wip on control c

07fec17

WIP: Connection Error Tracking & Auto-Reconnect in Memtier Benchmark

4630388

fcostaoliveira requested a review from paulorsousa September 24, 2025 16:25

fcostaoliveira added 9 commits October 2, 2025 13:07

live --csv update and better error logging

1681142

added jitter on conn creation/reconnect

8aacd8f

set and get percentiles on per second latencies

d7cccbf

Added Totals to csv

d5a4cca

Merge branch 'control.c' into error.reconnect

cb1f00c

include timestamp in csv file

1740e42

Added try/catch mechanism

665760a

Ignore SIGPIPE to prevent exit when writing to closed TLS sockets

06c6d1e

not asserting on wrong condition on rate-limiting

f632395

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP/Open for Feedback: Connection Error Tracking & Auto-Reconnect #320

WIP/Open for Feedback: Connection Error Tracking & Auto-Reconnect #320

Uh oh!

fcostaoliveira commented Sep 24, 2025

Uh oh!

kamran-redis commented Sep 26, 2025

Uh oh!

Uh oh!

WIP/Open for Feedback: Connection Error Tracking & Auto-Reconnect #320

Are you sure you want to change the base?

WIP/Open for Feedback: Connection Error Tracking & Auto-Reconnect #320

Uh oh!

Conversation

fcostaoliveira commented Sep 24, 2025

Key Features

Motivation

Open Questions / Discussion

Next Steps

Uh oh!

kamran-redis commented Sep 26, 2025

Uh oh!

Uh oh!