Skip to content

Inconsistent Envoy Gateway replica state #7115

@NeonSludge

Description

@NeonSludge

Description:
It appears that Envoy Gateway replicas can sometimes diverge in terms of observed managed resources state. This, in turn, leads to the Envoy Proxy fleet members receiving inconsistent configuration that seemingly depends on which Envoy Gateway replica the proxy is connected to. The issue appears when an Envoy Proxy reconnects to the Envoy Gateway service for any reason (pod restarts, connection resets due to network issues, etc.). If the proxy happens to connect to an EG replica that is in an inconsistent state, it might perform some unsolicited actions: remove or add a cluster, use an out-of-date upstream address etc.

We're using Envoy Gateway in an environment where it basically acts as an automatic port forwarder for JupyterHub single-user servers. Each JupyterHub user gets a personal Gateway with a varying number of listeners and some TCPRoutes associated with it. All of these are automatically created when the user's JupyterHub server is spawned and removed when the server stops. This means that there is active Gateway API resource churn at times. The GatewayClass that is being used by all of these Gateways has the mergeGateways option enabled.

We've tried significantly lowering the cacheSyncPeriod but it had no effect. Performing a rolling restart of the Envoy Gateway deployment always fixes the issue, though.

Repro steps:

  1. Install Envoy Gateway with several replicas (2-3 should work) and leader election enabled.
  2. Create a GatewayClass and configure it to use the merged gateways mode (mergeGateways: true).
  3. Create some resource churn: start randomly creating, modifying and deleting Gateways and TCPRoutes.
  4. Monitor the envoy_cluster_manager_active_clusters metric exported by Envoy Proxies. It should reliably start showing different values for different Proxies after the issue has been successfully reproduced.

Environment:

  • A self-hosted Kubernetes 1.30.14 cluster
  • Envoy Gateway 1.5.1
  • XDSNameSchemeV2 enabled
  • Backend and EnvoyPatchPolicy extension APIs enabled in the config, not currently used

Logs:
The issue always happens after an Envoy Proxy reconnects to Envoy Gateway.
The reconnection itself is indicated by log entries like this:

[2025-09-29 20:53:21.028[][1[][warning[][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:188] DeltaAggregatedResources gRPC config stream to xds_cluster closed: 13, upstream_reset_after_response_started{connection_termination}

We have not observed any other warnings or errors at these moments.

Metadata

Metadata

Assignees

Labels

kind/bugSomething isn't working

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions