|
| 1 | +# Monitoring and Alerting Guide for Lightrun Application |
| 2 | + |
| 3 | +This guide highlights the key metrics that should be monitored for the Lightrun application when deployed on Kubernetes using a Helm chart. While it identifies the most critical metrics, it is the responsibility of customers to implement the monitoring using their preferred tools and to set up appropriate alerting mechanisms. |
| 4 | + |
| 5 | +## 1. Bird's-eye View of System Health |
| 6 | + |
| 7 | +**1.1 Nodes Status** |
| 8 | +Monitor the overall health and performance of nodes within the Kubernetes cluster. Keep an eye on: |
| 9 | + |
| 10 | +- CPU usage |
| 11 | +- Memory utilization |
| 12 | +- Disk space consumption |
| 13 | + |
| 14 | +**1.2 Application URL Availability** |
| 15 | +Ensure the main application URL is accessible and returns a successful HTTP status code (200). |
| 16 | + |
| 17 | +## 2. Kubernetes Cluster Monitoring |
| 18 | + |
| 19 | +**2.1 Deployment Status** |
| 20 | +Monitor the status of deployments to ensure desired pods are running and replicating correctly. |
| 21 | + |
| 22 | +**2.2 Pod Restart Count** |
| 23 | +Track number of restarts per pod. Frequent restarts may indicate instability, errors or crashes. |
| 24 | + |
| 25 | +**2.3 Kubernetes Events (Warning/Error)** |
| 26 | +Monitor Kubernetes events to detect warnings or errors that may affect the cluster's stability: |
| 27 | + |
| 28 | +- Track events with type "Warning" for potential issues like unhealthy containers or resource constraints. |
| 29 | +- Monitor events with type "Error" for critical failures requiring immediate attention. |
| 30 | + |
| 31 | +## 3. Resource Usage Monitoring |
| 32 | + |
| 33 | +**3.1 Pod CPU Usage** |
| 34 | +Track CPU usage across all pods. Sustained high CPU usage can signal bottlenecks or misconfigured resource limits. |
| 35 | + |
| 36 | +**3.2 Pod Memory Resources** |
| 37 | +Monitor memory usage across all pods to prevent out-of-memory issues and ensure optimal performance. |
| 38 | + |
| 39 | +**3.3 Disk Space Consumption** |
| 40 | +Monitor disk space consumption on nodes and persistent volumes to prevent storage exhaustion. |
| 41 | + |
| 42 | +## 4. Network Monitoring |
| 43 | + |
| 44 | +**4.1 Incoming Network Traffic (RX)** |
| 45 | +Monitor incoming network data volume per pod to identify spikes or anomalies. |
| 46 | + |
| 47 | +**4.2 Outgoing Network Traffic (TX)** |
| 48 | +Track outgoing network data volume per pod to maintain proper network operation. |
| 49 | + |
| 50 | +## 5. Ingress Monitoring |
| 51 | + |
| 52 | +**5.1 Active Connections** |
| 53 | +Monitor total active connections to ensure the application is able to handle incoming traffic effectively. |
| 54 | + |
| 55 | +**5.2 Requests per Second** |
| 56 | +Track request rates to ensure they are within acceptable limits and not causing performance issues. |
| 57 | + |
| 58 | +**5.3 Rate-limited Requests (HTTP 429)** |
| 59 | +Monitor requests that exceed rate limits. A high number of 429 responses may indicate misconfiguration or excessive load. |
| 60 | + |
| 61 | +**5.4 Application Response Time** |
| 62 | +Measure the response time of the main application endpoint to identify latency issues. |
| 63 | + |
| 64 | +## 6. Lightrun Backend Application Metrics |
| 65 | + |
| 66 | +The Lightrun backend exposes custom Prometheus metrics via the `/management/prometheus` endpoint. The following metrics can be monitored: |
| 67 | + |
| 68 | +**6.1 Total Connected Agents per Runtime (`connected_agents_per_runtime`)** |
| 69 | +Track the number of agents connected per runtime to ensure runtimes are reporting data correctly. |
| 70 | + |
| 71 | +**6.2 JVM Memory Utilization (`jvm_memory_used_bytes`)** |
| 72 | +Monitor JVM heap usage to prevent out-of-memory errors. |
| 73 | + |
| 74 | +## 7. MySQL Database Monitoring |
| 75 | + |
| 76 | +**7.1 CPU Utilization** |
| 77 | +Monitor database CPU usage to ensure databases run efficiently without performance issues. |
| 78 | + |
| 79 | +**7.2 RAM Utilization** |
| 80 | +Track memory usage to avoid performance bottlenecks. |
| 81 | + |
| 82 | +**7.3 Active DB Connections** |
| 83 | +Monitor database connections to prevent overwhelming the database with simultaneous requests. |
| 84 | + |
| 85 | +**7.4 File System Usage** |
| 86 | +Track disk space usage on the database volume to prevent storage exhaustion. |
| 87 | + |
| 88 | +**7.5 Network Throughput** |
| 89 | +Monitor network traffic to and from MySQL. |
| 90 | + |
| 91 | +**7.6 Query Latency** |
| 92 | +Monitor query execution times to identify performance bottlenecks: |
| 93 | + |
| 94 | +- Track average query execution time across all operations |
| 95 | +- Identify slowest executing queries by type or table |
| 96 | +- Track frequency of long-running queries (>500ms) that may indicate blocking issues |
| 97 | + |
| 98 | +## 8. Redis Cache Monitoring |
| 99 | + |
| 100 | +**8.1 CPU Utilization** |
| 101 | +Observe CPU usage to ensure efficient processing of cache operations. |
| 102 | + |
| 103 | +**8.2 RAM Utilization** |
| 104 | +Track memory usage to avoid performance bottlenecks. |
| 105 | + |
| 106 | +**8.3 Query Latency** |
| 107 | +Monitor command execution times for cache operations: |
| 108 | + |
| 109 | +- Track average response time for common commands (GET, SET, etc.) |
| 110 | +- Monitor latency percentiles (p50, p95, p99) for different operation types |
| 111 | +- Identify slowest executing commands that may indicate memory pressure or disk I/O issues |
| 112 | +- Track frequency of delayed responses (>100ms) during high load periods |
| 113 | + |
| 114 | +## 9. Backend HTTP Statistics |
| 115 | + |
| 116 | +**9.1 HTTP Hits by Status Code** |
| 117 | +Monitor the distribution of HTTP status codes returned by backend requests. |
| 118 | + |
| 119 | +# Alerting Recommendations |
| 120 | + |
| 121 | +Set up alerts based on the following conditions to enable proactive issue resolution: |
| 122 | + |
| 123 | +- **Pod restart detected** – Any pod restarting more than once within a short period. |
| 124 | +- **High CPU/Memory usage** - Sustained usage above 85% of capacity. |
| 125 | +- **High Rate of 5xx Errors** - Indicative of backend failures or instability. |
| 126 | +- **High number of rate-limited requests (429)** - May suggest malicious traffic or misconfigured clients. |
| 127 | +- **Low disk space** - Trigger alerts when available disk space drops below a safe threshold. |
| 128 | +- **Long application load time** - Detect latency issues impacting end-user experience. |
| 129 | +- **Abnormal database connection counts** - May indicate overloading. |
| 130 | +- **Network throughput anomalies** - Unexpected surges or drops in traffic could signal issues. |
| 131 | + |
| 132 | +By monitoring the above metrics and setting up alerts accordingly, you can proactively manage your Lightrun application's performance and reliability in a Kubernetes environment. |
0 commit comments