|
| 1 | +# EP-001: Redfish Power Monitoring Support |
| 2 | + |
| 3 | +- **Status**: Draft |
| 4 | +- **Author**: Sunil Thaha |
| 5 | +- **Created**: 2025-08-14 |
| 6 | + |
| 7 | +## Summary |
| 8 | + |
| 9 | +Add Redfish BMC power monitoring to Kepler for platform-level power consumption data, |
| 10 | +complementing existing RAPL CPU monitoring to provide comprehensive server power |
| 11 | +visibility. |
| 12 | + |
| 13 | +## Problem |
| 14 | + |
| 15 | +Kepler currently measures only CPU power via Intel RAPL, missing: |
| 16 | + |
| 17 | +- Platform power (PSU, cooling, storage, network) |
| 18 | +- Multi-vendor support (AMD, ARM systems) |
| 19 | +- BMC integration capabilities already present in data centers |
| 20 | + |
| 21 | +## Goals |
| 22 | + |
| 23 | +- Add Redfish BMC power monitoring capability |
| 24 | +- Support Kubernetes, bare metal, and standalone deployments |
| 25 | +- Integrate with existing Kepler architecture |
| 26 | +- Maintain security best practices |
| 27 | + |
| 28 | +## Non-Goals |
| 29 | + |
| 30 | +- Replace RAPL monitoring (complementary) |
| 31 | +- Support non-Redfish protocols (IPMI) initially |
| 32 | +- Implement power control features |
| 33 | +- Advanced resilience patterns in v1 |
| 34 | + |
| 35 | +## Solution |
| 36 | + |
| 37 | +Add platform service layer collecting BMC power data via Redfish, exposed through |
| 38 | +Prometheus collectors separately from CPU power attribution. |
| 39 | + |
| 40 | +```mermaid |
| 41 | +C4Container |
| 42 | + title Container Diagram - Kepler Power Monitoring |
| 43 | +
|
| 44 | + Person(user, "User", "Prometheus/Grafana") |
| 45 | +
|
| 46 | + Container_Boundary(kepler, "Kepler") { |
| 47 | + Component(rapl, "RAPL Reader", "Go", "CPU power") |
| 48 | + Component(redfish, "Redfish Client", "Go", "Platform power") |
| 49 | + Component(monitor, "Power Monitor", "Go", "Attribution") |
| 50 | + Component(platform, "Platform Collector", "Go", "Platform metrics") |
| 51 | + Component(exporter, "Prometheus Exporter", "Go", "Metrics endpoint") |
| 52 | + } |
| 53 | +
|
| 54 | + System_Ext(bmc, "BMC", "Redfish API") |
| 55 | + System_Ext(kernel, "Linux", "RAPL sysfs") |
| 56 | +
|
| 57 | + Rel(rapl, kernel, "Reads") |
| 58 | + Rel(redfish, bmc, "Queries") |
| 59 | + Rel(monitor, rapl, "Uses") |
| 60 | + Rel(platform, redfish, "Uses") |
| 61 | + Rel(exporter, monitor, "Collects") |
| 62 | + Rel(exporter, platform, "Collects") |
| 63 | + Rel(user, exporter, "Scrapes") |
| 64 | +``` |
| 65 | + |
| 66 | +## Node Identification |
| 67 | + |
| 68 | +Nodes identified via `--platform.node-id` flag or `platform.nodeID` config, |
| 69 | +matching identifiers in BMC configuration file. |
| 70 | + |
| 71 | +```mermaid |
| 72 | +flowchart LR |
| 73 | + A[Node Start] --> B{Node ID?} |
| 74 | + B -->|Yes| C[Load BMC Config] |
| 75 | + B -->|No| D[RAPL Only] |
| 76 | + C --> E{BMC Found?} |
| 77 | + E -->|Yes| F[Connect & Monitor] |
| 78 | + E -->|No| D |
| 79 | +``` |
| 80 | + |
| 81 | +## Implementation |
| 82 | + |
| 83 | +### Package Structure |
| 84 | + |
| 85 | +```mermaid |
| 86 | +graph TD |
| 87 | + subgraph "internal/" |
| 88 | + A[platform/redfish/<br/>service.go<br/>config.go<br/>client.go] |
| 89 | + B[exporter/prometheus/<br/>collector/<br/>platform_collector.go] |
| 90 | + end |
| 91 | + A --> B |
| 92 | +``` |
| 93 | + |
| 94 | +### Service Interfaces |
| 95 | + |
| 96 | +Implements standard Kepler patterns: |
| 97 | + |
| 98 | +- `service.Initializer`: Configuration and connection setup |
| 99 | +- `service.Runner`: Periodic power collection with context |
| 100 | +- `service.Shutdowner`: Clean resource release |
| 101 | + |
| 102 | +### Configuration |
| 103 | + |
| 104 | +**Kepler Config Structure:** |
| 105 | + |
| 106 | +```go |
| 107 | +type Platform struct { |
| 108 | + NodeID string `yaml:"nodeID"` |
| 109 | + Redfish Redfish `yaml:"redfish"` |
| 110 | +} |
| 111 | + |
| 112 | +type Redfish struct { |
| 113 | + Enabled *bool `yaml:"enabled"` |
| 114 | + ConfigFile string `yaml:"configFile"` |
| 115 | +} |
| 116 | +``` |
| 117 | + |
| 118 | +**CLI Flags:** |
| 119 | + |
| 120 | +```bash |
| 121 | +--platform.node-id=worker-1 |
| 122 | +--platform.redfish.enabled=true |
| 123 | +--platform.redfish.config=/etc/kepler/redfish.yaml |
| 124 | +``` |
| 125 | + |
| 126 | +**BMC Configuration (`/etc/kepler/redfish.yaml`):** |
| 127 | + |
| 128 | +```yaml |
| 129 | +nodes: |
| 130 | + worker-1: bmc-1 |
| 131 | + worker-2: bmc-2 |
| 132 | + |
| 133 | +bmcs: |
| 134 | + bmc-1: |
| 135 | + endpoint: "https://192.168.1.100" |
| 136 | + username: "admin" |
| 137 | + password: "secret" |
| 138 | + insecure: true # TLS verification |
| 139 | + |
| 140 | + bmc-2: |
| 141 | + endpoint: "https://192.168.1.101" |
| 142 | + username: "admin" |
| 143 | + password: "secret456" |
| 144 | + insecure: false # Verify TLS certificates |
| 145 | + |
| 146 | + control-bmc: |
| 147 | + endpoint: "https://192.168.1.102" |
| 148 | + username: "root" |
| 149 | + password: "admin123" |
| 150 | + insecure: false |
| 151 | +``` |
| 152 | +
|
| 153 | +## Metrics |
| 154 | +
|
| 155 | +```prometheus |
| 156 | +# New platform metrics |
| 157 | +kepler_node_platform_watts{source="redfish",node_name="worker-1"} 450.5 |
| 158 | +kepler_node_platform_joules_total{source="redfish",node_name="worker-1"} 123456.789 |
| 159 | + |
| 160 | +# Existing CPU metrics unchanged |
| 161 | +kepler_node_cpu_watts{zone="package",node_name="worker-1"} 125.2 |
| 162 | +``` |
| 163 | + |
| 164 | +## Error Handling |
| 165 | + |
| 166 | +- Connection failures: Log and continue with RAPL-only |
| 167 | +- Authentication errors: Retry once, then disable for node |
| 168 | +- Timeouts: 30-second context timeout for BMC requests |
| 169 | +- Graceful degradation when BMCs unavailable |
| 170 | + |
| 171 | +## Security |
| 172 | + |
| 173 | +- Credentials in Kubernetes secrets or secure files (mode 0600) |
| 174 | +- No credential logging |
| 175 | +- Require explicit opt-in via configuration |
| 176 | + |
| 177 | +## Implementation Phases |
| 178 | + |
| 179 | +1. **Foundation**: Dependencies, service structure, config parsing |
| 180 | +2. **Core**: Gofish integration, power collection, service interface |
| 181 | +3. **Metrics**: Platform collector, Prometheus registration |
| 182 | +4. **Testing**: Unit, integration, multi-vendor validation |
| 183 | +5. **Release**: Documentation, migration guides |
| 184 | + |
| 185 | +## Testing Strategy |
| 186 | + |
| 187 | +- Unit tests with mocked Redfish responses |
| 188 | +- Integration tests with Redfish simulator |
| 189 | +- Performance impact validation (<2% overhead target compared to base kepler) |
| 190 | + |
| 191 | +## Migration |
| 192 | + |
| 193 | +- **Backward Compatible**: No breaking changes, opt-in feature |
| 194 | +- **Phased Rollout**: Test subset before full deployment |
| 195 | +- **Rollback**: Disable via config flag, continues with RAPL-only |
| 196 | + |
| 197 | +## Risks and Mitigations |
| 198 | + |
| 199 | +| Risk | Mitigation | |
| 200 | +|------|------------| |
| 201 | +| BMC connectivity | Retry logic, graceful degradation | |
| 202 | +| Vendor compatibility | Multi-vendor testing | |
| 203 | +| Performance impact | <2% overhead validation | |
| 204 | +| Security | Secure credential handling, TLS default | |
| 205 | + |
| 206 | +## Success Metrics |
| 207 | + |
| 208 | +- Platform metrics available on 95%+ nodes with BMCs |
| 209 | +- <2% CPU/memory overhead |
| 210 | +- <5% collection failure rate |
| 211 | +- Successful deployment by ops teams |
| 212 | + |
| 213 | +## Future Enhancements |
| 214 | + |
| 215 | +- Circuit breaker patterns |
| 216 | +- Exponential backoff strategies |
| 217 | +- External secret integration |
| 218 | +- Chassis sub-component power zones |
| 219 | + |
| 220 | +## Open Questions |
| 221 | + |
| 222 | +1. Multi-chassis server handling? |
| 223 | +2. Sub-component power exposure (PSU, fans)? |
0 commit comments