Skip to content

Conversation

wenxuan0923
Copy link
Contributor

@wenxuan0923 wenxuan0923 commented Sep 30, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

Added Azure configuration environment variables to support autoscaling for managed (hosted-on-behalf-of) systempools. These systempools consist of mixed-SKU VM sizes and are hosted within an internal AKS tenant rather than the cluster’s subscription. Each SKU is registered with Cluster Autoscaler (CAS) as a distinct NodeGroup.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

azure: Add support for hosted-on-behalf-of systempool autoscaling

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area area/cluster-autoscaler area/provider/azure Issues or PRs related to azure provider needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 30, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @wenxuan0923. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 30, 2025
@jackfrancis
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 30, 2025
// It can override the default public ARM endpoint for VMs pool scale operations.
ARMBaseURLForAPClient string `json:"armBaseURLForAPClient" yaml:"armBaseURLForAPClient"`

// Managed system pool configuration for automatic cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rename these to HoboSubscriptionID, HoboResourceGroup, and HoboResourceProxyURL to distinguish from the existing understanding of "managed" in the vanilla AKS sense?

(same comment for all the env vars, etc)

Copy link
Contributor Author

@wenxuan0923 wenxuan0923 Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jackfrancis, I initially named them HoboXXX, but @tallaxes felt that it’s too specific to AKS internals so may not be ideal for open-source usage. That’s why I went with the ManagedXXX naming. I'm open to suggestions @tallaxes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree "Managed" could be confusing here. I was proposing something like "AlternativeResourceGroup" or "ResourceGroupOverride" to indicate the effect without constraining it to a single purpose (like Hobo). This still feels better to me than Hobo, but if we feel Hobo is clear, and we can't imagine other possible uses for these - I am ok with Hobo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just Hosted* for short?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like Hosted! I have updated the PR, let me know how you feel about it

}
}

// A proxy service is required to access resources for the managed system pool within automatic clusters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

micro nit: let's move this up one block, before we check for the ExtendedLocation config, so that all of our "override from default azClientConfig property settings" code is organized together

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!

header.Set("Target-Count", fmt.Sprintf("%d", count))
updateCtx = policy.WithHTTPHeader(updateCtx, header)
}
header := make(http.Header)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change we appear to be setting HTTP headers that include the Target-Count key/value pair, and hoisting that into the context object that we send to the Azure API, for both self-hosted and AKS-managed. Whereas now we're only doing this in the AKS-managed case.

Why are we making this change to include this behavior for both cases going forward?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is because Hobo systempool will only have manual scale profile, a simple if-else check is no longer sufficient to distinguish between self-hosted and managed CAS.

I just realized I could add an explicit check for HOBO to make the logic clearer, so I’ve updated the code accordingly. Let me know if that makes sense. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks!

@jackfrancis
Copy link
Contributor

/lgtm
/approve
/hold

for @tallaxes to have a look

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 1, 2025
@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 1, 2025
@jackfrancis
Copy link
Contributor

/release-note-edit

azure: Add support for hosted-on-behalf-of systempool autoscaling

@wenxuan0923
Copy link
Contributor Author

@tallaxes please take a look when you get a chance! thank you!

// hosted CAS will be using Autoscale scale profile
// HostedSystem will be using manual scale profile
// Both of them need to set the Target-Count and SKU headers
if len(versionedAP.Properties.VirtualMachinesProfile.Scale.Autoscale) > 0 ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the new states that make us stop using else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is because Hobo systempool will only have manual scale profile, but its scaling request will be processed by NPS, so a simple if-else check on the scale profile type is no longer sufficient to distinguish between self-hosted and managed CAS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my understanding below correct?

Before:

  • Self-hosted: have manual scale profile: go to the if only
  • Managed: don't have manual scale profile: go the else only

Now:

  • Self-hosted: have manual scale profile: go to the if only
  • Managed: don't have manual scale profile AND have auto scale profile: go the second if only
  • HOBO: have manual scale profile and HostedSystem mode: go to both
    • If not introducing the new if, it would go to the if only, while we want it to go to the else as well due to it being managed


func newAzureCache(client *azClient, cacheTTL time.Duration, config Config) (*azureCache, error) {
nodeResourceGroup := config.ResourceGroup
if config.HostedResourceGroup != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to in azure_config.go, do you mind adding a comment on the purpose of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments added

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 6, 2025
@comtalyst
Copy link
Contributor

/label tide/merge-method-squash

@k8s-ci-robot k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Oct 6, 2025
@comtalyst
Copy link
Contributor

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 6, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: comtalyst, jackfrancis, wenxuan0923

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jackfrancis
Copy link
Contributor

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 6, 2025
@k8s-ci-robot k8s-ci-robot merged commit 2944dcb into kubernetes:master Oct 6, 2025
8 checks passed
wenxuan0923 added a commit to wenxuan0923/autoscaler that referenced this pull request Oct 7, 2025
wenxuan0923 added a commit to wenxuan0923/autoscaler that referenced this pull request Oct 7, 2025
wenxuan0923 added a commit to wenxuan0923/autoscaler that referenced this pull request Oct 7, 2025
wenxuan0923 added a commit to wenxuan0923/autoscaler that referenced this pull request Oct 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants