Skip to content

Commit 861addd

Browse files
DEVOPS-2708-added-monitoring-and-alerting-guide (#61)
1 parent f5aaef2 commit 861addd

File tree

2 files changed

+138
-1
lines changed

2 files changed

+138
-1
lines changed

README.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -100,10 +100,15 @@ Select your preferred cloud provider for deployment:
100100
- **[Amazon EKS](docs/installation/cloud/eks.md)**
101101
- **[Google Kubernetes Engine (GKE)](docs/installation/cloud/gke.md)**
102102
- **[OpenShift](docs/installation/cloud/openshift.md)**
103+
104+
#### **12. Setup Monitoring and Alerting system**
105+
106+
Review the [Monitoring and Alerting guide](docs/monitoring_and_alerting_guide.md).
107+
103108
## Known Issues and Limitations
104109

105110
## Release Notes
106111

107112
Check the [versions mapping documentation](docs/installation/versions_mapping.md) for version compatibility.
108113
Check the [release strategy](docs/release_strategy.md) for the meaning of major minor patch.
109-
See the [changelog](CHANGELOG.md) for a detailed history of changes and improvements in each release.
114+
See the [changelog](CHANGELOG.md) for a detailed history of changes and improvements in each release.

docs/monitoring_and_alerting_guide.md

Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
# Monitoring and Alerting Guide for Lightrun Application
2+
3+
This guide highlights the key metrics that should be monitored for the Lightrun application when deployed on Kubernetes using a Helm chart. While it identifies the most critical metrics, it is the responsibility of customers to implement the monitoring using their preferred tools and to set up appropriate alerting mechanisms.
4+
5+
## 1. Bird's-eye View of System Health
6+
7+
**1.1 Nodes Status**
8+
Monitor the overall health and performance of nodes within the Kubernetes cluster. Keep an eye on:
9+
10+
- CPU usage
11+
- Memory utilization
12+
- Disk space consumption
13+
14+
**1.2 Application URL Availability**
15+
Ensure the main application URL is accessible and returns a successful HTTP status code (200).
16+
17+
## 2. Kubernetes Cluster Monitoring
18+
19+
**2.1 Deployment Status**
20+
Monitor the status of deployments to ensure desired pods are running and replicating correctly.
21+
22+
**2.2 Pod Restart Count**
23+
Track number of restarts per pod. Frequent restarts may indicate instability, errors or crashes.
24+
25+
**2.3 Kubernetes Events (Warning/Error)**
26+
Monitor Kubernetes events to detect warnings or errors that may affect the cluster's stability:
27+
28+
- Track events with type "Warning" for potential issues like unhealthy containers or resource constraints.
29+
- Monitor events with type "Error" for critical failures requiring immediate attention.
30+
31+
## 3. Resource Usage Monitoring
32+
33+
**3.1 Pod CPU Usage**
34+
Track CPU usage across all pods. Sustained high CPU usage can signal bottlenecks or misconfigured resource limits.
35+
36+
**3.2 Pod Memory Resources**
37+
Monitor memory usage across all pods to prevent out-of-memory issues and ensure optimal performance.
38+
39+
**3.3 Disk Space Consumption**
40+
Monitor disk space consumption on nodes and persistent volumes to prevent storage exhaustion.
41+
42+
## 4. Network Monitoring
43+
44+
**4.1 Incoming Network Traffic (RX)**
45+
Monitor incoming network data volume per pod to identify spikes or anomalies.
46+
47+
**4.2 Outgoing Network Traffic (TX)**
48+
Track outgoing network data volume per pod to maintain proper network operation.
49+
50+
## 5. Ingress Monitoring
51+
52+
**5.1 Active Connections**
53+
Monitor total active connections to ensure the application is able to handle incoming traffic effectively.
54+
55+
**5.2 Requests per Second**
56+
Track request rates to ensure they are within acceptable limits and not causing performance issues.
57+
58+
**5.3 Rate-limited Requests (HTTP 429)**
59+
Monitor requests that exceed rate limits. A high number of 429 responses may indicate misconfiguration or excessive load.
60+
61+
**5.4 Application Response Time**
62+
Measure the response time of the main application endpoint to identify latency issues.
63+
64+
## 6. Lightrun Backend Application Metrics
65+
66+
The Lightrun backend exposes custom Prometheus metrics via the `/management/prometheus` endpoint. The following metrics can be monitored:
67+
68+
**6.1 Total Connected Agents per Runtime (`connected_agents_per_runtime`)**
69+
Track the number of agents connected per runtime to ensure runtimes are reporting data correctly.
70+
71+
**6.2 JVM Memory Utilization (`jvm_memory_used_bytes`)**
72+
Monitor JVM heap usage to prevent out-of-memory errors.
73+
74+
## 7. MySQL Database Monitoring
75+
76+
**7.1 CPU Utilization**
77+
Monitor database CPU usage to ensure databases run efficiently without performance issues.
78+
79+
**7.2 RAM Utilization**
80+
Track memory usage to avoid performance bottlenecks.
81+
82+
**7.3 Active DB Connections**
83+
Monitor database connections to prevent overwhelming the database with simultaneous requests.
84+
85+
**7.4 File System Usage**
86+
Track disk space usage on the database volume to prevent storage exhaustion.
87+
88+
**7.5 Network Throughput**
89+
Monitor network traffic to and from MySQL.
90+
91+
**7.6 Query Latency**
92+
Monitor query execution times to identify performance bottlenecks:
93+
94+
- Track average query execution time across all operations
95+
- Identify slowest executing queries by type or table
96+
- Track frequency of long-running queries (>500ms) that may indicate blocking issues
97+
98+
## 8. Redis Cache Monitoring
99+
100+
**8.1 CPU Utilization**
101+
Observe CPU usage to ensure efficient processing of cache operations.
102+
103+
**8.2 RAM Utilization**
104+
Track memory usage to avoid performance bottlenecks.
105+
106+
**8.3 Query Latency**
107+
Monitor command execution times for cache operations:
108+
109+
- Track average response time for common commands (GET, SET, etc.)
110+
- Monitor latency percentiles (p50, p95, p99) for different operation types
111+
- Identify slowest executing commands that may indicate memory pressure or disk I/O issues
112+
- Track frequency of delayed responses (>100ms) during high load periods
113+
114+
## 9. Backend HTTP Statistics
115+
116+
**9.1 HTTP Hits by Status Code**
117+
Monitor the distribution of HTTP status codes returned by backend requests.
118+
119+
# Alerting Recommendations
120+
121+
Set up alerts based on the following conditions to enable proactive issue resolution:
122+
123+
- **Pod restart detected** – Any pod restarting more than once within a short period.
124+
- **High CPU/Memory usage** - Sustained usage above 85% of capacity.
125+
- **High Rate of 5xx Errors** - Indicative of backend failures or instability.
126+
- **High number of rate-limited requests (429)** - May suggest malicious traffic or misconfigured clients.
127+
- **Low disk space** - Trigger alerts when available disk space drops below a safe threshold.
128+
- **Long application load time** - Detect latency issues impacting end-user experience.
129+
- **Abnormal database connection counts** - May indicate overloading.
130+
- **Network throughput anomalies** - Unexpected surges or drops in traffic could signal issues.
131+
132+
By monitoring the above metrics and setting up alerts accordingly, you can proactively manage your Lightrun application's performance and reliability in a Kubernetes environment.

0 commit comments

Comments
 (0)