-
Notifications
You must be signed in to change notification settings - Fork 70
Description
A Refined Vision
We're need to know what we're aiming at, and users need to know what they can expect from using RIG in their infrastructure. Given new requirements or change requests, it should be easier to decide whether or not they match the goals of the project.
The subject of this issue is to formulate this vision, but also to track the progress of related tasks (spoiler: we're not quite there yet).
The tasks are defined inline in the following vision statement, which will become a document on the website eventually. Feedback welcome.
- Update README with the vision statement
- Update repository tagline with the vision statement
- Update the website with the text below
Reactive Interaction Gateway a.k.a. RIG
RIG enables event-driven frontends in a secure, reliable and scaleable way.
What does that mean, exactly? Well, let's break it down.
Event-driven frontends
Frontends react to what happens on the backend. Upon receiving business events, they can change their state, adapt their UI or use HTTP requests to fetch new data.
This is not just Server-Sent Events versus polling for data; it also decouples the frontends from the backends, as they describe what kind of events they're interested in, but they don't need to know where and when to fetch them.
Secure
RIG employs process supervisors, which makes it very unlikely that a node fails as a whole. For example, a fault in the Kafka connection might cause the Kafka subsystem to fail, but the rest of the system would be unaffected. The respective supervisor would restart the Kafka process automatically; as soon as the connection would be re-established successfully, the node would be fully operational again.
Running on the BEAM (i.e., the Erlang runtime machine), there is tooling available for introspection, code updates and instrumentation (e.g., tracing) at runtime, in production.
- Describe, and link to, the tooling mentioned here.
Verbose logging and a Prometheus metrics endpoint also help with monitoring a node in production.
- Describe how to set change log levels (using env vars and at runtime).
To limit a cluster's resource consumption, an administrator can configure a number of soft limits:
-
- limit on the size of the JWT blacklist,
-
- limit on the number of connections per jti Add limit on the number of connections per jti #258,
-
- limit on the number of subscriptions per jti Add limit on the number of subscriptions per jti #256,
-
- limit on the rate of new connections Add limit on the rate of new connections #257,
-
- limit on the size of the event buffer Allow to configure the event buffer size #236 Support last-event-id for SSE connections by buffering events #253.
Plugins are sandboxed, as they are either
- external (see AUTHORIZATION_CHECK and SUBMISSION_CHECK), or
-
- run in a WebAssembly sandbox environment.
Reliable
Multiple RIG nodes form a self-healing cluster, as described in the following.
Scenario 1: A RIG node goes offline and leaves the cluster
After a node has failed, the frontends that were connected to that node reconnect to another node. They will be notified upon reconnect whether their subscriptions are still available and whether events have been lost (that depends on whether a frontend's session was hosted on the failed node). If subscriptions need to be set up again or missed data has to be requested, the frontend can act accordingly.
Any Kafka or Kinesis partitions the node had handled before its failure are reassigned to other nodes in the cluster automatically. As a consequence of how Kafka/Kinesis consumer groups work, messages may be delivered more than once.
Scenario 2: A RIG node comes back online and (re-)joins the cluster
When a node (re-)joins the cluster, it replicates the subscriptions table as well as the JWT blacklist from other nodes. It might get assigned Kafka or Kinesis partitions and starts to accept frontend connections.
Scaleable
RIG is designed to be scaleable with respect to two key metrics: the number of connected frontends and the number of backend events it can process without generating a noticeable* delay in transmission.
*For our tests we assume 1 second as the highest tolerable delay between RIG's consumption of a message from Kafka and the reception of that message on the frontend via an established SSE connection.
Scalability with respect to the number of connected frontends
For handling large amounts of frontend connections per node we rely on the light-weight processes ("actors") that BEAM, the Erlang runtime, provides. Those processes each have their own lifecycle (and even their own garbage collector) and are scheduled by BEAM on all available CPU cores. They are also memory-efficient and designed to be created and destroyed very quickly. Those features make them ideal for running TCP connections, where they provide the isolation and runtime efficiency needed for handling the long-lived connections in RIG.
- Benchmarks
Scaleability with respect to the number of processed backend events
Designed as the event-handling backbone for "everything UI" in a microservice architecture, RIG must be able to process large amounts of data. The trick: RIG immediately discards events that no users are subscribed to.
For example, in a banking system, the backend might generate a million events per second that need to be consumed by RIG. At the same time, only a few thousand users are online at any point in time, which allows RIG to drop most of the messages right at the consuming node.
- Benchmarks