Skip to content

Commit 8f0e1da

Browse files
committed
docs: add enhancement proposal for Redfish power measurement
Add enhancement proposal document outlining the design and implementation approach for integrating Redfish power measurement capabilities into Kepler. Signed-off-by: Sunil Thaha <sthaha@redhat.com>
1 parent 2b56585 commit 8f0e1da

File tree

2 files changed

+224
-0
lines changed

2 files changed

+224
-0
lines changed
Lines changed: 223 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,223 @@
1+
# EP-001: Redfish Power Monitoring Support
2+
3+
- **Status**: Draft
4+
- **Author**: Sunil Thaha
5+
- **Created**: 2025-08-14
6+
7+
## Summary
8+
9+
Add Redfish BMC power monitoring to Kepler for platform-level power consumption data,
10+
complementing existing RAPL CPU monitoring to provide comprehensive server power
11+
visibility.
12+
13+
## Problem
14+
15+
Kepler currently measures only CPU power via Intel RAPL, missing:
16+
17+
- Platform power (PSU, cooling, storage, network)
18+
- Multi-vendor support (AMD, ARM systems)
19+
- BMC integration capabilities already present in data centers
20+
21+
## Goals
22+
23+
- Add Redfish BMC power monitoring capability
24+
- Support Kubernetes, bare metal, and standalone deployments
25+
- Integrate with existing Kepler architecture
26+
- Maintain security best practices
27+
28+
## Non-Goals
29+
30+
- Replace RAPL monitoring (complementary)
31+
- Support non-Redfish protocols (IPMI) initially
32+
- Implement power control features
33+
- Advanced resilience patterns in v1
34+
35+
## Solution
36+
37+
Add platform service layer collecting BMC power data via Redfish, exposed through
38+
Prometheus collectors separately from CPU power attribution.
39+
40+
```mermaid
41+
C4Container
42+
title Container Diagram - Kepler Power Monitoring
43+
44+
Person(user, "User", "Prometheus/Grafana")
45+
46+
Container_Boundary(kepler, "Kepler") {
47+
Component(rapl, "RAPL Reader", "Go", "CPU power")
48+
Component(redfish, "Redfish Client", "Go", "Platform power")
49+
Component(monitor, "Power Monitor", "Go", "Attribution")
50+
Component(platform, "Platform Collector", "Go", "Platform metrics")
51+
Component(exporter, "Prometheus Exporter", "Go", "Metrics endpoint")
52+
}
53+
54+
System_Ext(bmc, "BMC", "Redfish API")
55+
System_Ext(kernel, "Linux", "RAPL sysfs")
56+
57+
Rel(rapl, kernel, "Reads")
58+
Rel(redfish, bmc, "Queries")
59+
Rel(monitor, rapl, "Uses")
60+
Rel(platform, redfish, "Uses")
61+
Rel(exporter, monitor, "Collects")
62+
Rel(exporter, platform, "Collects")
63+
Rel(user, exporter, "Scrapes")
64+
```
65+
66+
## Node Identification
67+
68+
Nodes identified via `--platform.node-id` flag or `platform.nodeID` config,
69+
matching identifiers in BMC configuration file.
70+
71+
```mermaid
72+
flowchart LR
73+
A[Node Start] --> B{Node ID?}
74+
B -->|Yes| C[Load BMC Config]
75+
B -->|No| D[RAPL Only]
76+
C --> E{BMC Found?}
77+
E -->|Yes| F[Connect & Monitor]
78+
E -->|No| D
79+
```
80+
81+
## Implementation
82+
83+
### Package Structure
84+
85+
```mermaid
86+
graph TD
87+
subgraph "internal/"
88+
A[platform/redfish/<br/>service.go<br/>config.go<br/>client.go]
89+
B[exporter/prometheus/<br/>collector/<br/>platform_collector.go]
90+
end
91+
A --> B
92+
```
93+
94+
### Service Interfaces
95+
96+
Implements standard Kepler patterns:
97+
98+
- `service.Initializer`: Configuration and connection setup
99+
- `service.Runner`: Periodic power collection with context
100+
- `service.Shutdowner`: Clean resource release
101+
102+
### Configuration
103+
104+
**Kepler Config Structure:**
105+
106+
```go
107+
type Platform struct {
108+
NodeID string `yaml:"nodeID"`
109+
Redfish Redfish `yaml:"redfish"`
110+
}
111+
112+
type Redfish struct {
113+
Enabled *bool `yaml:"enabled"`
114+
ConfigFile string `yaml:"configFile"`
115+
}
116+
```
117+
118+
**CLI Flags:**
119+
120+
```bash
121+
--platform.node-id=worker-1
122+
--platform.redfish.enabled=true
123+
--platform.redfish.config=/etc/kepler/redfish.yaml
124+
```
125+
126+
**BMC Configuration (`/etc/kepler/redfish.yaml`):**
127+
128+
```yaml
129+
nodes:
130+
worker-1: bmc-1
131+
worker-2: bmc-2
132+
133+
bmcs:
134+
bmc-1:
135+
endpoint: "https://192.168.1.100"
136+
username: "admin"
137+
password: "secret"
138+
insecure: true # TLS verification
139+
140+
bmc-2:
141+
endpoint: "https://192.168.1.101"
142+
username: "admin"
143+
password: "secret456"
144+
insecure: false # Verify TLS certificates
145+
146+
control-bmc:
147+
endpoint: "https://192.168.1.102"
148+
username: "root"
149+
password: "admin123"
150+
insecure: false
151+
```
152+
153+
## Metrics
154+
155+
```prometheus
156+
# New platform metrics
157+
kepler_node_platform_watts{source="redfish",node_name="worker-1"} 450.5
158+
kepler_node_platform_joules_total{source="redfish",node_name="worker-1"} 123456.789
159+
160+
# Existing CPU metrics unchanged
161+
kepler_node_cpu_watts{zone="package",node_name="worker-1"} 125.2
162+
```
163+
164+
## Error Handling
165+
166+
- Connection failures: Log and continue with RAPL-only
167+
- Authentication errors: Retry once, then disable for node
168+
- Timeouts: 30-second context timeout for BMC requests
169+
- Graceful degradation when BMCs unavailable
170+
171+
## Security
172+
173+
- Credentials in Kubernetes secrets or secure files (mode 0600)
174+
- No credential logging
175+
- Require explicit opt-in via configuration
176+
177+
## Implementation Phases
178+
179+
1. **Foundation**: Dependencies, service structure, config parsing
180+
2. **Core**: Gofish integration, power collection, service interface
181+
3. **Metrics**: Platform collector, Prometheus registration
182+
4. **Testing**: Unit, integration, multi-vendor validation
183+
5. **Release**: Documentation, migration guides
184+
185+
## Testing Strategy
186+
187+
- Unit tests with mocked Redfish responses
188+
- Integration tests with Redfish simulator
189+
- Performance impact validation (<2% overhead target compared to base kepler)
190+
191+
## Migration
192+
193+
- **Backward Compatible**: No breaking changes, opt-in feature
194+
- **Phased Rollout**: Test subset before full deployment
195+
- **Rollback**: Disable via config flag, continues with RAPL-only
196+
197+
## Risks and Mitigations
198+
199+
| Risk | Mitigation |
200+
|------|------------|
201+
| BMC connectivity | Retry logic, graceful degradation |
202+
| Vendor compatibility | Multi-vendor testing |
203+
| Performance impact | <2% overhead validation |
204+
| Security | Secure credential handling, TLS default |
205+
206+
## Success Metrics
207+
208+
- Platform metrics available on 95%+ nodes with BMCs
209+
- <2% CPU/memory overhead
210+
- <5% collection failure rate
211+
- Successful deployment by ops teams
212+
213+
## Future Enhancements
214+
215+
- Circuit breaker patterns
216+
- Exponential backoff strategies
217+
- External secret integration
218+
- Chassis sub-component power zones
219+
220+
## Open Questions
221+
222+
1. Multi-chassis server handling?
223+
2. Sub-component power exposure (PSU, fans)?

docs/developer/proposal/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ This directory contains Enhancement Proposals (EPs) for major features and chang
77
| ID | Title | Status | Author | Created |
88
|----|-------|--------|--------|---------|
99
| [EP-000](EP_TEMPLATE.md) | Enhancement Proposal Template | Accepted |Sunil Thaha | 2025-01-18 |
10+
| [EP-001](EP_001-redfish-support.md) | Redfish Power Monitoring Support | Draft | Sunil Thaha | 2025-01-21 |
1011

1112
## Proposal Status
1213

0 commit comments

Comments
 (0)