|
| 1 | +# EP-001: Redfish Power Monitoring Support |
| 2 | + |
| 3 | +**Status**: Draft |
| 4 | +**Author**: Sunil Thaha |
| 5 | +**Created**: 2025-01-18 |
| 6 | +**Last Updated**: 2025-01-18 |
| 7 | + |
| 8 | +## Summary |
| 9 | + |
| 10 | +This proposal adds Redfish BMC power monitoring support to Kepler, enabling collection |
| 11 | +of platform-level power consumption data from server BMCs. This complements existing |
| 12 | +RAPL CPU power monitoring and provides comprehensive server power visibility. |
| 13 | + |
| 14 | +## Problem Statement |
| 15 | + |
| 16 | +Currently, Kepler only measures CPU power consumption using Intel RAPL sensors. This |
| 17 | +provides incomplete power visibility as it doesn't account for: |
| 18 | + |
| 19 | +- **Platform Power**: Overall system power including PSU efficiency, cooling, storage, |
| 20 | + network interfaces, and other system components |
| 21 | +- **Multi-vendor Support**: RAPL is Intel-specific and doesn't work on AMD or ARM systems |
| 22 | +- **BMC Integration**: Modern data centers use BMCs for server management but Kepler |
| 23 | + can't leverage these existing power monitoring capabilities |
| 24 | +- **Kubernetes Environments**: In containerized environments, understanding total node |
| 25 | + power consumption (not just CPU) is critical for resource allocation and cost attribution |
| 26 | + |
| 27 | +### Current Limitations |
| 28 | + |
| 29 | +1. **Incomplete Power Attribution**: Workloads are attributed only CPU power, missing |
| 30 | + significant power consumption from other components |
| 31 | +2. **Platform Blindness**: No visibility into overall server power consumption trends |
| 32 | +3. **Limited Hardware Support**: RAPL availability varies across processor generations and vendors |
| 33 | +4. **Manual Power Management**: No integration with existing BMC-based power monitoring infrastructure |
| 34 | + |
| 35 | +## Goals |
| 36 | + |
| 37 | +- **Primary**: Add Redfish BMC power monitoring capability to Kepler |
| 38 | +- **Multi-Environment Support**: Support Kubernetes, OpenStack bare metal, and standalone deployments |
| 39 | +- **Seamless Integration**: Integrate with existing Kepler architecture and service patterns |
| 40 | +- **Standard Metrics**: Provide platform power metrics via Prometheus following Kepler conventions |
| 41 | +- **Security**: Maintain security best practices for credential management |
| 42 | +- **Scalability**: Support multi-machine configurations efficiently |
| 43 | + |
| 44 | +## Non-Goals |
| 45 | + |
| 46 | +- Replace existing RAPL CPU power monitoring (complementary, not replacement) |
| 47 | +- Support non-Redfish BMC protocols (IPMI, proprietary APIs) in initial implementation |
| 48 | +- Implement power capping or control features (monitoring only) |
| 49 | +- Provide historical power data storage beyond Prometheus metrics |
| 50 | +- Support for edge devices or embedded systems without BMCs |
| 51 | + |
| 52 | +## Requirements |
| 53 | + |
| 54 | +### Functional Requirements |
| 55 | + |
| 56 | +- Use `github.com/stmcginnis/gofish` library for Redfish client functionality |
| 57 | +- Support node-specific BMC configuration lookup in multi-node environments |
| 58 | +- Implement standard Kepler service interfaces (Initializer, Runner, Shutdowner) |
| 59 | +- Generate `kepler_node_platform_watts{source="redfish"}` and |
| 60 | + `kepler_node_platform_joules_total{source="redfish"}` metrics |
| 61 | +- Follow Kepler's configuration patterns and coding conventions |
| 62 | +- Support both secure (TLS verified) and insecure BMC connections |
| 63 | + |
| 64 | +### Non-Functional Requirements |
| 65 | + |
| 66 | +- **Performance**: Minimal impact on Kepler's CPU and memory footprint |
| 67 | +- **Reliability**: Graceful handling of BMC connection failures |
| 68 | +- **Security**: Secure credential storage and transmission |
| 69 | +- **Maintainability**: Clean code following Go idioms and Kepler patterns |
| 70 | +- **Testability**: Comprehensive unit and integration test coverage |
| 71 | + |
| 72 | +## Proposed Solution |
| 73 | + |
| 74 | +### High-Level Architecture |
| 75 | + |
| 76 | +Add a new platform service layer to Kepler that collects power data from BMCs via |
| 77 | +Redfish and exposes it directly through Prometheus collectors, separate from CPU |
| 78 | +power attribution handled by PowerMonitor. |
| 79 | + |
| 80 | +```text |
| 81 | +┌─────────────────┐ ┌──────────────────┐ |
| 82 | +│ CPU Power │ │ Platform Power │ |
| 83 | +│ (RAPL) │ │ (Redfish) │ |
| 84 | +│ │ │ │ |
| 85 | +└─────────────────┘ └──────────────────┘ |
| 86 | + │ │ |
| 87 | + ▼ ▼ |
| 88 | +┌─────────────────┐ ┌──────────────────┐ |
| 89 | +│ Power Monitor │ │ Platform │ |
| 90 | +│ (Attribution) │ │ Collector │ |
| 91 | +└─────────────────┘ └──────────────────┘ |
| 92 | + │ │ |
| 93 | + └──────────┬─────────────┘ |
| 94 | + ▼ |
| 95 | + ┌──────────────────┐ |
| 96 | + │ Prometheus │ |
| 97 | + │ Exporter │ |
| 98 | + └──────────────────┘ |
| 99 | +``` |
| 100 | + |
| 101 | +### Node Identification Strategy |
| 102 | + |
| 103 | +The solution uses a flexible node identification approach that works across different environments: |
| 104 | + |
| 105 | +1. **CLI Override**: `--platform.node-id=my-node` (highest priority) |
| 106 | +2. **Kubernetes Node Name**: Uses existing `cfg.Kube.Node` from `--kube.node-name` |
| 107 | +3. **Hostname Fallback**: Uses `os.Hostname()` as last resort |
| 108 | + |
| 109 | +This creates a single BMC configuration file that maps node identifiers to BMC configurations, |
| 110 | +eliminating the need for environment-specific configuration management. |
| 111 | + |
| 112 | +## Detailed Design |
| 113 | + |
| 114 | +### Package Structure |
| 115 | + |
| 116 | +```text |
| 117 | +internal/ |
| 118 | +├── platform/ |
| 119 | +│ └── redfish/ |
| 120 | +│ ├── service.go # Main service implementation |
| 121 | +│ ├── config.go # Configuration parsing and validation |
| 122 | +│ ├── power_reader.go # Power data collection logic |
| 123 | +│ ├── client.go # Gofish client wrapper |
| 124 | +│ └── service_test.go # Unit tests |
| 125 | +└── exporter/prometheus/collector/ |
| 126 | + └── platform_collector.go # Platform power metrics collector |
| 127 | +``` |
| 128 | + |
| 129 | +### Service Interfaces |
| 130 | + |
| 131 | +The Redfish service implements standard Kepler service interfaces: |
| 132 | + |
| 133 | +- **`service.Initializer`**: Load configuration, resolve BMC, establish connection |
| 134 | +- **`service.Runner`**: Periodic power data collection with context cancellation |
| 135 | +- **`service.Shutdowner`**: Clean connection closure and resource cleanup |
| 136 | + |
| 137 | +## Configuration |
| 138 | + |
| 139 | +### Kepler Configuration |
| 140 | + |
| 141 | +Platform configuration integrates with existing Kepler config structure: |
| 142 | + |
| 143 | +```go |
| 144 | +type Platform struct { |
| 145 | + NodeID string `yaml:"nodeID"` // High-level node identifier |
| 146 | + Redfish Redfish `yaml:"redfish"` |
| 147 | +} |
| 148 | + |
| 149 | +type Redfish struct { |
| 150 | + Enabled *bool `yaml:"enabled"` |
| 151 | + ConfigFile string `yaml:"configFile"` |
| 152 | +} |
| 153 | +``` |
| 154 | + |
| 155 | +**CLI Flags:** |
| 156 | + |
| 157 | +```bash |
| 158 | +--platform.node-id=worker-node-1 # Node identifier override |
| 159 | +--platform.redfish.enabled=true # Enable Redfish monitoring |
| 160 | +--platform.redfish.config=/etc/kepler/redfish.yaml # BMC configuration file |
| 161 | +``` |
| 162 | + |
| 163 | +### BMC Configuration File |
| 164 | + |
| 165 | +Single configuration file maps nodes to BMCs (`/etc/kepler/redfish.yaml`): |
| 166 | + |
| 167 | +```yaml |
| 168 | +# Node identifier to BMC ID mapping |
| 169 | +nodes: |
| 170 | + worker-node-1: bmc-1 |
| 171 | + worker-node-2: bmc-2 |
| 172 | + control-plane-1: control-bmc |
| 173 | + |
| 174 | +# BMC connection details |
| 175 | +bmcs: |
| 176 | + bmc-1: |
| 177 | + endpoint: "https://192.168.1.100" |
| 178 | + username: "admin" |
| 179 | + password: "secret123" |
| 180 | + insecure: true # Skip TLS verification |
| 181 | + |
| 182 | + bmc-2: |
| 183 | + endpoint: "https://192.168.1.101" |
| 184 | + username: "admin" |
| 185 | + password: "secret456" |
| 186 | + insecure: false # Verify TLS certificates |
| 187 | + |
| 188 | + control-bmc: |
| 189 | + endpoint: "https://192.168.1.102" |
| 190 | + username: "root" |
| 191 | + password: "admin123" |
| 192 | + insecure: false |
| 193 | +``` |
| 194 | +
|
| 195 | +### Security Considerations |
| 196 | +
|
| 197 | +- Store BMC credentials in Kubernetes secrets or secure files (permissions 600) |
| 198 | +- Never log credentials or include in error messages |
| 199 | +- Support both secure (TLS verified) and insecure connections |
| 200 | +- Implement proper session management and connection timeouts |
| 201 | +
|
| 202 | +## Deployment Examples |
| 203 | +
|
| 204 | +### Kubernetes Environment |
| 205 | +
|
| 206 | +Standard DaemonSet deployment with BMC configuration from secrets: |
| 207 | +
|
| 208 | +```yaml |
| 209 | +apiVersion: apps/v1 |
| 210 | +kind: DaemonSet |
| 211 | +metadata: |
| 212 | + name: kepler |
| 213 | +spec: |
| 214 | + template: |
| 215 | + spec: |
| 216 | + containers: |
| 217 | + - name: kepler |
| 218 | + args: |
| 219 | + - --kube.enable=true |
| 220 | + - --kube.node-name=$(NODE_NAME) |
| 221 | + - --platform.redfish.enabled=true |
| 222 | + - --platform.redfish.config=/etc/kepler/redfish.yaml |
| 223 | + env: |
| 224 | + - name: NODE_NAME |
| 225 | + valueFrom: |
| 226 | + fieldRef: |
| 227 | + fieldPath: spec.nodeName |
| 228 | + volumeMounts: |
| 229 | + - name: redfish-config |
| 230 | + mountPath: /etc/kepler |
| 231 | + readOnly: true |
| 232 | + volumes: |
| 233 | + - name: redfish-config |
| 234 | + secret: |
| 235 | + secretName: redfish-config |
| 236 | +``` |
| 237 | +
|
| 238 | +### Standalone Deployment |
| 239 | +
|
| 240 | +```bash |
| 241 | +# Create BMC configuration |
| 242 | +cat > /etc/kepler/redfish.yaml <<EOF |
| 243 | +nodes: |
| 244 | + $(hostname): local-bmc |
| 245 | +bmcs: |
| 246 | + local-bmc: |
| 247 | + endpoint: "https://192.168.1.100" |
| 248 | + username: "admin" |
| 249 | + password: "secret123" |
| 250 | + insecure: true |
| 251 | +EOF |
| 252 | + |
| 253 | +# Run Kepler with Redfish support |
| 254 | +./kepler --platform.redfish.enabled=true |
| 255 | +``` |
| 256 | + |
| 257 | +## Testing Strategy |
| 258 | + |
| 259 | +### Test Coverage |
| 260 | + |
| 261 | +- **Unit Tests**: Service lifecycle, BMC resolution, configuration parsing |
| 262 | +- **Integration Tests**: End-to-end with Redfish simulator/emulator |
| 263 | +- **Vendor Testing**: Validation with Dell iDRAC, HPE iLO, Lenovo XCC |
| 264 | +- **Performance Testing**: Impact on Kepler resource consumption |
| 265 | +- **Security Testing**: Credential handling and TLS configuration |
| 266 | + |
| 267 | +### Test Infrastructure |
| 268 | + |
| 269 | +- Mock Redfish responses for unit testing |
| 270 | +- Redfish simulator for integration testing |
| 271 | +- Kubernetes test environments for DaemonSet validation |
| 272 | + |
| 273 | +## Migration and Compatibility |
| 274 | + |
| 275 | +### Backward Compatibility |
| 276 | + |
| 277 | +- **No Breaking Changes**: Existing RAPL functionality remains unchanged |
| 278 | +- **Opt-in Feature**: Redfish support is disabled by default |
| 279 | +- **Configuration Isolation**: Platform configuration is separate from existing settings |
| 280 | + |
| 281 | +### Migration Path |
| 282 | + |
| 283 | +1. **Phase 1**: Deploy new Kepler version with Redfish support disabled |
| 284 | +2. **Phase 2**: Create BMC configuration files and secrets |
| 285 | +3. **Phase 3**: Enable Redfish monitoring on subset of nodes for testing |
| 286 | +4. **Phase 4**: Roll out to all nodes after validation |
| 287 | + |
| 288 | +### Rollback Strategy |
| 289 | + |
| 290 | +- Disable Redfish monitoring via configuration flag |
| 291 | +- Remove BMC configuration files |
| 292 | +- Kepler continues operating with RAPL-only monitoring |
| 293 | + |
| 294 | +## Metrics Output |
| 295 | + |
| 296 | +Platform power metrics complement existing CPU power metrics: |
| 297 | + |
| 298 | +```prometheus |
| 299 | +# New platform power metrics |
| 300 | +kepler_node_platform_watts{source="redfish",node_name="worker-1"} 450.5 |
| 301 | +kepler_node_platform_joules_total{source="redfish",node_name="worker-1"} 123456.789 |
| 302 | +
|
| 303 | +# Existing CPU power metrics (unchanged) |
| 304 | +kepler_node_cpu_watts{zone="package",node_name="worker-1"} 125.2 |
| 305 | +kepler_node_cpu_joules_total{zone="package",node_name="worker-1"} 89234.567 |
| 306 | +``` |
| 307 | + |
| 308 | +## Implementation Plan |
| 309 | + |
| 310 | +### Phase 1: Foundation |
| 311 | + |
| 312 | +- Add Redfish dependencies and basic service structure |
| 313 | +- Implement configuration parsing and validation |
| 314 | +- Create BMC resolution logic with node identifier fallback |
| 315 | + |
| 316 | +### Phase 2: Core Functionality |
| 317 | + |
| 318 | +- Implement Redfish client integration using gofish library |
| 319 | +- Add power data collection from BMC endpoints |
| 320 | +- Create service interface for platform power access |
| 321 | + |
| 322 | +### Phase 3: Metrics and Export |
| 323 | + |
| 324 | +- Create platform power Prometheus collector that directly queries Redfish service |
| 325 | +- Add platform metrics to exporter registration |
| 326 | +- Validate metrics format and output |
| 327 | + |
| 328 | +### Phase 4: Testing and Validation |
| 329 | + |
| 330 | +- Comprehensive unit and integration testing |
| 331 | +- Multi-vendor BMC validation (Dell, HPE, Lenovo) |
| 332 | +- Kubernetes deployment testing |
| 333 | + |
| 334 | +### Phase 5: Documentation and Release |
| 335 | + |
| 336 | +- User documentation and deployment guides |
| 337 | +- Security best practices documentation |
| 338 | +- Migration guide for existing deployments |
| 339 | + |
| 340 | +## Risks and Mitigations |
| 341 | + |
| 342 | +### Technical Risks |
| 343 | + |
| 344 | +- **BMC Connectivity Issues**: Mitigate with robust retry logic and circuit breaker patterns |
| 345 | +- **Vendor Compatibility**: Address through comprehensive testing with major BMC vendors |
| 346 | +- **Performance Impact**: Validate minimal resource overhead through performance testing |
| 347 | +- **Security Concerns**: Implement secure credential handling and TLS by default |
| 348 | + |
| 349 | +### Operational Risks |
| 350 | + |
| 351 | +- **Configuration Complexity**: Mitigate with clear documentation and examples |
| 352 | +- **Deployment Dependencies**: Provide fallback to RAPL-only operation when BMC unavailable |
| 353 | +- **Monitoring Gaps**: Ensure graceful degradation when platform power unavailable |
| 354 | + |
| 355 | +## Success Metrics |
| 356 | + |
| 357 | +- **Functional**: Platform power metrics available on 95%+ of nodes with BMCs |
| 358 | +- **Performance**: <2% overhead on Kepler CPU/memory usage |
| 359 | +- **Reliability**: <1% data collection failure rate under normal conditions |
| 360 | +- **Adoption**: Documentation enables successful deployment by operations teams |
| 361 | + |
| 362 | +## Open Questions |
| 363 | + |
| 364 | +1. **Multi-chassis Support**: How should Kepler handle servers with multiple power supplies? |
| 365 | +2. **Power Zones**: Should we expose chassis sub-component power (PSU, fans, storage)? |
| 366 | + |
| 367 | +These questions will be addressed during implementation based on user feedback and technical constraints discovered during development. |
0 commit comments