Skip to content

Conversation

kad
Copy link
Contributor

@kad kad commented Sep 15, 2025

This change introduces more generic CPUAffinity property of Process to specify desired CPU affinities while performing operations on create, start and exec operations.

As it was originally discussed in PR#1253, the existing implementation covers only exec usecase, where setting affinity for OCI hooks and initial container process will benefit wider set of workloads.

This change introduces more generic `CPUAffinity` property of `Process`
to specify desired CPU affinities while performing operations on create,
start and exec operations.

Signed-off-by: Alexander Kanevskiy <alexander.kanevskiy@intel.com>
ranges. For example, `0-3,7` represents CPUs 0,1,2,3, and 7. If omitted or empty for particular [lifecycle](runtime.md#lifecycle) stage,
the runtime SHOULD NOT change process' CPU affinity, and the affinity is determined by the Linux kernel.
The following properties are available:
* **`runtimeCreate`** (string, OPTIONAL) is a list of CPUs the runtime parent process should run on during the runtime creation stage,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this just be configured by the process that runs the OCI runtime so that more of the startup runs on the correct CPUs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better or worse, this might be non-trivial to do if someone is using a tool that wraps an OCI runtime and doesn't provide a knob for it (basically the same argument as runtime hooks).

I thin matching the lifecycle stages (which also match the hook names) as done here is probably the simplest approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

theoretically it is possible, e.g. via systemd to restrict container runtime to run on subset of CPUs, but this gives granularity on splitting CPU affinity for containerd/cri-o itself vs. per-container hook execution. In other words, do not affect higher level runtime' CPU quota with operations that might be local to subset of containers.

regarding names: currently it is in form "whereWhat", but OCI hooks are named "whatWhere". if we feel that OCI hook naming is better, I can rename proposed fields in this PR to have more consistency with the hooks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having the same names as the hooks would be preferable if only for consistency.

* **`containerCreate`** (string, OPTIONAL) is a list of CPUs the process should run on during the container creation stage,
after entering the container's namespaces but before the user process or any of [`createContainer` hooks](#createContainer-hooks) are executed.
* **`containerStart`** (string, OPTIONAL) is a list of CPUs the process should be shoud run on during the container start stage,
inside the container's namespace during `start` operation. The affinity should be applied before executing [`startContainer` hooks](#startContainer-hooks).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to differentiate between these two?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could argue that the runtime should have a particular set of affinities which then should be adjusted to be more or less restrictive as late as possible to avoid latency on those cores.

I suspect most people wouldn't set both at the same time but the distinction does make some sense. Now, whether this should be done before or after applying seccomp rules is a different question...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to differentiate between these two?

the difference is that "runc create" and "runc start" might be executed in different PIDs, thus those PIDs by default would inherit CPU affinity from parent process(es). Having above in the spec allows to make sure that OCI hooks are executed (and quota accounted) on the correct "infrastructure" CPUs.

@giuseppe
Copy link
Member

@kolyshkin PTAL

@cyphar
Copy link
Member

cyphar commented Sep 15, 2025

Given issues like golang/sys#259, I think it might be nice to have a way to indicate "all CPUs" as a way to reset affinity (but without doing 0-1024, which is ~300x slower than the memset approach). In runc we implicitly do this now, but users might want to reset affinity for other stages explicitly.

@kad
Copy link
Contributor Author

kad commented Sep 15, 2025

Given issues like golang/sys#259, I think it might be nice to have a way to indicate "all CPUs" as a way to reset affinity (but without doing 0-1024, which is ~300x slower than the memset approach). In runc we implicitly do this now, but users might want to reset affinity for other stages explicitly.

theoretically, the higher-level runtime that generates OCI spec might be feeling it with right number based on detected system information. e.g. use result of sched_getaffinity(2) of parent process

@haircommander
Copy link

fyi @bitoku

@@ -94,8 +94,12 @@ type Process struct {
SelinuxLabel string `json:"selinuxLabel,omitempty" platform:"linux"`
// IOPriority contains the I/O priority settings for the cgroup.
IOPriority *LinuxIOPriority `json:"ioPriority,omitempty" platform:"linux"`
// ExecCPUAffinity is Deprecated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want various linters to warn about using deprecated functionality (and other linters, that insist that description starts with the Id, to keep working fine), the deprecation notice should be added like this:

// ExecCPUAffinity specifies CPU affinity for exec processes.
//
// Deprecated: use [CPUAffinity] instead.

(I.e. after the description, on a separate paragraph, etc.)

@cyphar
Copy link
Member

cyphar commented Sep 18, 2025

@kad

theoretically, the higher-level runtime that generates OCI spec might be feeling it with right number based on detected system information. e.g. use result of sched_getaffinity(2) of parent process

My experience is that this rarely happens -- usually higher-level runtimes either hide new knobs like this (requiring you to specify patch config.json through experimental or unsupported hacks) or they transparently forward the values to the lower-level runtime without adding new functionality. Even if they do implement it, there is no guarantee that the behaviour or syntax will be standardised between runtimes, which leads to more problems than it solves.

Given that we have had seen practical issues with container runtimes being spawned with suboptimal CPU affinity values, I would suggest that having an "all" or "max" special value would be a good idea.

Also you do not need to detect anything with sched_getaffinity(2) for this -- you just need to memset(&cpuset, 0xFF, sizeof(cpuset)) to reset the affinity to the maximum possible value. In fact, you don't want to use sched_getaffinity(2) or Go's runtime.NumCpu because they give you the current affinity which is precisely the value you don't want.

will be affinitized to specified list of CPUs.
* **`containerCreate`** (string, OPTIONAL) is a list of CPUs the process should run on during the container creation stage,
after entering the container's namespaces but before the user process or any of [`createContainer` hooks](#createContainer-hooks) are executed.
* **`containerStart`** (string, OPTIONAL) is a list of CPUs the process should be shoud run on during the container start stage,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* **`containerStart`** (string, OPTIONAL) is a list of CPUs the process should be shoud run on during the container start stage,
* **`containerStart`** (string, OPTIONAL) is a list of CPUs the process should be run on during the container start stage,

"final": "0-3,7"
"CPUAffinity": {
"runtimeCreate": "7",
"runtimeExec": "7",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, should we add alsoi containerStart in this example?

Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, when it comes to formatting -- in the OCI specs we do not hard-wrap lines at N columns. Each complete sentence should be one line (this makes diffs easier to read).

* **`CPUAffinity`** (object, OPTIONAL) specifies CPU affinity used to execute processes in the runtime and the container.
All the properties in this setting expects as argument a string that contain a comma-separated list, with dashes to represent
ranges. For example, `0-3,7` represents CPUs 0,1,2,3, and 7. If omitted or empty for particular [lifecycle](runtime.md#lifecycle) stage,
the runtime SHOULD NOT change process' CPU affinity, and the affinity is determined by the Linux kernel.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be very useful to have an all special value which indicates that the CPU affinity should be reset to include as many CPUs as possible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "max" makes that clearer symantically

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with either, but I guess you could say that "max" implies that it might not be all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants