Skip to content

Commit f52528c

Browse files
committed
docs: add enhancement proposal for Redfish power measurement
Add enhancement proposal document outlining the design and implementation approach for integrating Redfish power measurement capabilities into Kepler. Signed-off-by: Sunil Thaha <sthaha@redhat.com>
1 parent 547331d commit f52528c

File tree

2 files changed

+368
-0
lines changed

2 files changed

+368
-0
lines changed
Lines changed: 367 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,367 @@
1+
# EP-001: Redfish Power Monitoring Support
2+
3+
**Status**: Draft
4+
**Author**: Sunil Thaha
5+
**Created**: 2025-01-18
6+
**Last Updated**: 2025-01-18
7+
8+
## Summary
9+
10+
This proposal adds Redfish BMC power monitoring support to Kepler, enabling collection
11+
of platform-level power consumption data from server BMCs. This complements existing
12+
RAPL CPU power monitoring and provides comprehensive server power visibility.
13+
14+
## Problem Statement
15+
16+
Currently, Kepler only measures CPU power consumption using Intel RAPL sensors. This
17+
provides incomplete power visibility as it doesn't account for:
18+
19+
- **Platform Power**: Overall system power including PSU efficiency, cooling, storage,
20+
network interfaces, and other system components
21+
- **Multi-vendor Support**: RAPL is Intel-specific and doesn't work on AMD or ARM systems
22+
- **BMC Integration**: Modern data centers use BMCs for server management but Kepler
23+
can't leverage these existing power monitoring capabilities
24+
- **Kubernetes Environments**: In containerized environments, understanding total node
25+
power consumption (not just CPU) is critical for resource allocation and cost attribution
26+
27+
### Current Limitations
28+
29+
1. **Incomplete Power Attribution**: Workloads are attributed only CPU power, missing
30+
significant power consumption from other components
31+
2. **Platform Blindness**: No visibility into overall server power consumption trends
32+
3. **Limited Hardware Support**: RAPL availability varies across processor generations and vendors
33+
4. **Manual Power Management**: No integration with existing BMC-based power monitoring infrastructure
34+
35+
## Goals
36+
37+
- **Primary**: Add Redfish BMC power monitoring capability to Kepler
38+
- **Multi-Environment Support**: Support Kubernetes, OpenStack bare metal, and standalone deployments
39+
- **Seamless Integration**: Integrate with existing Kepler architecture and service patterns
40+
- **Standard Metrics**: Provide platform power metrics via Prometheus following Kepler conventions
41+
- **Security**: Maintain security best practices for credential management
42+
- **Scalability**: Support multi-machine configurations efficiently
43+
44+
## Non-Goals
45+
46+
- Replace existing RAPL CPU power monitoring (complementary, not replacement)
47+
- Support non-Redfish BMC protocols (IPMI, proprietary APIs) in initial implementation
48+
- Implement power capping or control features (monitoring only)
49+
- Provide historical power data storage beyond Prometheus metrics
50+
- Support for edge devices or embedded systems without BMCs
51+
52+
## Requirements
53+
54+
### Functional Requirements
55+
56+
- Use `github.com/stmcginnis/gofish` library for Redfish client functionality
57+
- Support node-specific BMC configuration lookup in multi-node environments
58+
- Implement standard Kepler service interfaces (Initializer, Runner, Shutdowner)
59+
- Generate `kepler_node_platform_watts{source="redfish"}` and
60+
`kepler_node_platform_joules_total{source="redfish"}` metrics
61+
- Follow Kepler's configuration patterns and coding conventions
62+
- Support both secure (TLS verified) and insecure BMC connections
63+
64+
### Non-Functional Requirements
65+
66+
- **Performance**: Minimal impact on Kepler's CPU and memory footprint
67+
- **Reliability**: Graceful handling of BMC connection failures
68+
- **Security**: Secure credential storage and transmission
69+
- **Maintainability**: Clean code following Go idioms and Kepler patterns
70+
- **Testability**: Comprehensive unit and integration test coverage
71+
72+
## Proposed Solution
73+
74+
### High-Level Architecture
75+
76+
Add a new platform service layer to Kepler that collects power data from BMCs via
77+
Redfish and exposes it directly through Prometheus collectors, separate from CPU
78+
power attribution handled by PowerMonitor.
79+
80+
```text
81+
┌─────────────────┐ ┌──────────────────┐
82+
│ CPU Power │ │ Platform Power │
83+
│ (RAPL) │ │ (Redfish) │
84+
│ │ │ │
85+
└─────────────────┘ └──────────────────┘
86+
│ │
87+
▼ ▼
88+
┌─────────────────┐ ┌──────────────────┐
89+
│ Power Monitor │ │ Platform │
90+
│ (Attribution) │ │ Collector │
91+
└─────────────────┘ └──────────────────┘
92+
│ │
93+
└──────────┬─────────────┘
94+
95+
┌──────────────────┐
96+
│ Prometheus │
97+
│ Exporter │
98+
└──────────────────┘
99+
```
100+
101+
### Node Identification Strategy
102+
103+
The solution uses a flexible node identification approach that works across different environments:
104+
105+
1. **CLI Override**: `--platform.node-id=my-node` (highest priority)
106+
2. **Kubernetes Node Name**: Uses existing `cfg.Kube.Node` from `--kube.node-name`
107+
3. **Hostname Fallback**: Uses `os.Hostname()` as last resort
108+
109+
This creates a single BMC configuration file that maps node identifiers to BMC configurations,
110+
eliminating the need for environment-specific configuration management.
111+
112+
## Detailed Design
113+
114+
### Package Structure
115+
116+
```text
117+
internal/
118+
├── platform/
119+
│ └── redfish/
120+
│ ├── service.go # Main service implementation
121+
│ ├── config.go # Configuration parsing and validation
122+
│ ├── power_reader.go # Power data collection logic
123+
│ ├── client.go # Gofish client wrapper
124+
│ └── service_test.go # Unit tests
125+
└── exporter/prometheus/collector/
126+
└── platform_collector.go # Platform power metrics collector
127+
```
128+
129+
### Service Interfaces
130+
131+
The Redfish service implements standard Kepler service interfaces:
132+
133+
- **`service.Initializer`**: Load configuration, resolve BMC, establish connection
134+
- **`service.Runner`**: Periodic power data collection with context cancellation
135+
- **`service.Shutdowner`**: Clean connection closure and resource cleanup
136+
137+
## Configuration
138+
139+
### Kepler Configuration
140+
141+
Platform configuration integrates with existing Kepler config structure:
142+
143+
```go
144+
type Platform struct {
145+
NodeID string `yaml:"nodeID"` // High-level node identifier
146+
Redfish Redfish `yaml:"redfish"`
147+
}
148+
149+
type Redfish struct {
150+
Enabled *bool `yaml:"enabled"`
151+
ConfigFile string `yaml:"configFile"`
152+
}
153+
```
154+
155+
**CLI Flags:**
156+
157+
```bash
158+
--platform.node-id=worker-node-1 # Node identifier override
159+
--platform.redfish.enabled=true # Enable Redfish monitoring
160+
--platform.redfish.config=/etc/kepler/redfish.yaml # BMC configuration file
161+
```
162+
163+
### BMC Configuration File
164+
165+
Single configuration file maps nodes to BMCs (`/etc/kepler/redfish.yaml`):
166+
167+
```yaml
168+
# Node identifier to BMC ID mapping
169+
nodes:
170+
worker-node-1: bmc-1
171+
worker-node-2: bmc-2
172+
control-plane-1: control-bmc
173+
174+
# BMC connection details
175+
bmcs:
176+
bmc-1:
177+
endpoint: "https://192.168.1.100"
178+
username: "admin"
179+
password: "secret123"
180+
insecure: true # Skip TLS verification
181+
182+
bmc-2:
183+
endpoint: "https://192.168.1.101"
184+
username: "admin"
185+
password: "secret456"
186+
insecure: false # Verify TLS certificates
187+
188+
control-bmc:
189+
endpoint: "https://192.168.1.102"
190+
username: "root"
191+
password: "admin123"
192+
insecure: false
193+
```
194+
195+
### Security Considerations
196+
197+
- Store BMC credentials in Kubernetes secrets or secure files (permissions 600)
198+
- Never log credentials or include in error messages
199+
- Support both secure (TLS verified) and insecure connections
200+
- Implement proper session management and connection timeouts
201+
202+
## Deployment Examples
203+
204+
### Kubernetes Environment
205+
206+
Standard DaemonSet deployment with BMC configuration from secrets:
207+
208+
```yaml
209+
apiVersion: apps/v1
210+
kind: DaemonSet
211+
metadata:
212+
name: kepler
213+
spec:
214+
template:
215+
spec:
216+
containers:
217+
- name: kepler
218+
args:
219+
- --kube.enable=true
220+
- --kube.node-name=$(NODE_NAME)
221+
- --platform.redfish.enabled=true
222+
- --platform.redfish.config=/etc/kepler/redfish.yaml
223+
env:
224+
- name: NODE_NAME
225+
valueFrom:
226+
fieldRef:
227+
fieldPath: spec.nodeName
228+
volumeMounts:
229+
- name: redfish-config
230+
mountPath: /etc/kepler
231+
readOnly: true
232+
volumes:
233+
- name: redfish-config
234+
secret:
235+
secretName: redfish-config
236+
```
237+
238+
### Standalone Deployment
239+
240+
```bash
241+
# Create BMC configuration
242+
cat > /etc/kepler/redfish.yaml <<EOF
243+
nodes:
244+
$(hostname): local-bmc
245+
bmcs:
246+
local-bmc:
247+
endpoint: "https://192.168.1.100"
248+
username: "admin"
249+
password: "secret123"
250+
insecure: true
251+
EOF
252+
253+
# Run Kepler with Redfish support
254+
./kepler --platform.redfish.enabled=true
255+
```
256+
257+
## Testing Strategy
258+
259+
### Test Coverage
260+
261+
- **Unit Tests**: Service lifecycle, BMC resolution, configuration parsing
262+
- **Integration Tests**: End-to-end with Redfish simulator/emulator
263+
- **Vendor Testing**: Validation with Dell iDRAC, HPE iLO, Lenovo XCC
264+
- **Performance Testing**: Impact on Kepler resource consumption
265+
- **Security Testing**: Credential handling and TLS configuration
266+
267+
### Test Infrastructure
268+
269+
- Mock Redfish responses for unit testing
270+
- Redfish simulator for integration testing
271+
- Kubernetes test environments for DaemonSet validation
272+
273+
## Migration and Compatibility
274+
275+
### Backward Compatibility
276+
277+
- **No Breaking Changes**: Existing RAPL functionality remains unchanged
278+
- **Opt-in Feature**: Redfish support is disabled by default
279+
- **Configuration Isolation**: Platform configuration is separate from existing settings
280+
281+
### Migration Path
282+
283+
1. **Phase 1**: Deploy new Kepler version with Redfish support disabled
284+
2. **Phase 2**: Create BMC configuration files and secrets
285+
3. **Phase 3**: Enable Redfish monitoring on subset of nodes for testing
286+
4. **Phase 4**: Roll out to all nodes after validation
287+
288+
### Rollback Strategy
289+
290+
- Disable Redfish monitoring via configuration flag
291+
- Remove BMC configuration files
292+
- Kepler continues operating with RAPL-only monitoring
293+
294+
## Metrics Output
295+
296+
Platform power metrics complement existing CPU power metrics:
297+
298+
```prometheus
299+
# New platform power metrics
300+
kepler_node_platform_watts{source="redfish",node_name="worker-1"} 450.5
301+
kepler_node_platform_joules_total{source="redfish",node_name="worker-1"} 123456.789
302+
303+
# Existing CPU power metrics (unchanged)
304+
kepler_node_cpu_watts{zone="package",node_name="worker-1"} 125.2
305+
kepler_node_cpu_joules_total{zone="package",node_name="worker-1"} 89234.567
306+
```
307+
308+
## Implementation Plan
309+
310+
### Phase 1: Foundation
311+
312+
- Add Redfish dependencies and basic service structure
313+
- Implement configuration parsing and validation
314+
- Create BMC resolution logic with node identifier fallback
315+
316+
### Phase 2: Core Functionality
317+
318+
- Implement Redfish client integration using gofish library
319+
- Add power data collection from BMC endpoints
320+
- Create service interface for platform power access
321+
322+
### Phase 3: Metrics and Export
323+
324+
- Create platform power Prometheus collector that directly queries Redfish service
325+
- Add platform metrics to exporter registration
326+
- Validate metrics format and output
327+
328+
### Phase 4: Testing and Validation
329+
330+
- Comprehensive unit and integration testing
331+
- Multi-vendor BMC validation (Dell, HPE, Lenovo)
332+
- Kubernetes deployment testing
333+
334+
### Phase 5: Documentation and Release
335+
336+
- User documentation and deployment guides
337+
- Security best practices documentation
338+
- Migration guide for existing deployments
339+
340+
## Risks and Mitigations
341+
342+
### Technical Risks
343+
344+
- **BMC Connectivity Issues**: Mitigate with robust retry logic and circuit breaker patterns
345+
- **Vendor Compatibility**: Address through comprehensive testing with major BMC vendors
346+
- **Performance Impact**: Validate minimal resource overhead through performance testing
347+
- **Security Concerns**: Implement secure credential handling and TLS by default
348+
349+
### Operational Risks
350+
351+
- **Configuration Complexity**: Mitigate with clear documentation and examples
352+
- **Deployment Dependencies**: Provide fallback to RAPL-only operation when BMC unavailable
353+
- **Monitoring Gaps**: Ensure graceful degradation when platform power unavailable
354+
355+
## Success Metrics
356+
357+
- **Functional**: Platform power metrics available on 95%+ of nodes with BMCs
358+
- **Performance**: <2% overhead on Kepler CPU/memory usage
359+
- **Reliability**: <1% data collection failure rate under normal conditions
360+
- **Adoption**: Documentation enables successful deployment by operations teams
361+
362+
## Open Questions
363+
364+
1. **Multi-chassis Support**: How should Kepler handle servers with multiple power supplies?
365+
2. **Power Zones**: Should we expose chassis sub-component power (PSU, fans, storage)?
366+
367+
These questions will be addressed during implementation based on user feedback and technical constraints discovered during development.

docs/developer/proposal/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ This directory contains Enhancement Proposals (EPs) for major features and chang
77
| ID | Title | Status | Author | Created |
88
|----|-------|--------|--------|---------|
99
| [EP-000](EP_TEMPLATE.md) | Enhancement Proposal Template | Accepted |Sunil Thaha | 2025-01-18 |
10+
| [EP-001](EP_001-redfish-support.md) | Redfish Power Monitoring Support | Draft | Sunil Thaha | 2025-01-21 |
1011

1112
## Proposal Status
1213

0 commit comments

Comments
 (0)