Skip to content

CA DRA: handle partitionable devices (KEP-4815) #8053

@towca

Description

@towca

Which component are you using?:

/area cluster-autoscaler
/area core-autoscaler
/wg device-management

Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:

KEP-4815 adds support for partitionable devices to DRA. This means that the Devices exposed in ResourceSlices might "overlap", and allocating one Device might make other Devices unallocatable. The feature is behind a separate feature gate and went to alpha in 1.33.

Describe the solution you'd like.:

Cluster Autoscaler should be able to handle most of KEP-4815 out of the box, since all the additional partition-aware logic will be added to the DRA scheduler plugin that CA delegates to. However, there are some things we'll have to tackle:

  • The only part that won't work out of the box is calculating utilization for scale-down. The current logic assumes that all devices within a resource pool are identical, which isn't the case for partitioned devices. CA DRA: review calculating Node utilization for DRA resources #7781 tracks designing how to calculate utilization for DRA in general, which should include partitionable devices. However, solving the full problem will require a KEP and might take some time. We might want to consider adapting the current logic to be partition-aware in the meantime - if that's feasible.
  • We need to add integration tests for partitionable devices to static_autoscaler_dra_test.go.
  • We need to test autoscaling partitionable devices in a real cluster.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/cluster-autoscalerarea/core-autoscalerDenotes an issue that is related to the core autoscaler and is not specific to any provider.wg/device-managementCategorizes an issue or PR as relevant to WG Device Management.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions