Dmabuf IPC support for amdgpu plugin (v2) #2613

fdavid-amd · 2025-03-10T19:50:40Z

This patch set, in combination with the accompanying kernel patch set, is the
second draft of work to support amdgpu dmabuf IPC. Since the first post, the
patches have been split to make them more readable and made
backwards-compatible.

criu/files.c

criu/cr-dump.c

criu/servicefd.c

github-actions · 2025-04-29T00:09:16Z

A friendly reminder that this PR had no activity for 30 days.

avagin · 2025-05-19T06:29:58Z

@fdavid-amd how is it going? Let me know if you need help with this pr.

rst0git · 2025-06-10T15:03:57Z

plugins/amdgpu/criu-amdgpu.proto

+
 message criu_render_node {
 	required uint32 gpu_id = 1;
+	required uint32 id = 2;


It looks like these changes break compatibility - a newer version of CRIU would not be able to restore a checkpoint created with a previous version.

That is true - I wasn't aware that was a type of backwards-compatibility we committed to.

plugins/amdgpu/amdgpu_plugin_util.c

criu/include/criu-plugin.h

amdgpu represents allocated device memory as a memory mapping of the device file. This is a non-standard VMA that must be handled by the plugin, not the normal VMA code. Ignore all VMAs on device files. Signed-off-by: David Francis <David.Francis@amd.com>

amdgpu dmabuf CRIU requires the ability of the amdgpu plugin to retry. Change files_ext.c to read a response of 1 from a plugin restore function to mean retry. Signed-off-by: David Francis <David.Francis@amd.com>

fdavid-amd · 2025-07-08T18:45:37Z

@avagin
Sorry for leaving this untouched for so long, I was revising the kernel side of the changes. Could you take a look at this again?

fdavid-amd · 2025-07-15T18:35:42Z

Some of the build failures I don't understand - they might be an issue with this patch set or they might be an issue with the CI. There's a lot of 'Error: ipv6: address already assigned.', which could possibly be a result of the way that I'm allocating the plugin socket but seems unlikely.
The linter errors I do understand - they're flagging real grammar and syntax errors in the amdgpu-drm.h and drm.h files - but the point of those files is that they're exact copies of the kernel header files of the same names, so I don't want to change them.

fdavid-amd · 2025-07-15T18:51:16Z

@avagin sorry for the repeat ping, but there is something of a rush on my part to get kernel patch https://lists.freedesktop.org/archives/dri-devel/2025-July/514922.html merged in the next week or so (it adds the CHANGE_HANDLE ioctl). As part of merging that patch, I need to demonstrate that it has written and reasonably stable userspace code that uses it (which would be these CRIU patches). It would be a big help to me if you could say if the general approach taken by these patches is acceptable or unacceptable.
Thanks,
David

avagin · 2025-07-16T06:03:35Z

It would be a big help to me if you could say if the general approach taken by these patches is acceptable or unacceptable.

BTW we can merge the current version to the criu-dev branch and do all other adjustments on top of it.

fdavid-amd · 2025-07-16T14:08:22Z

This code is currently not functional because it relies on ioctls that are not merged into any kernel branch; is is still okay to merge?

rst0git · 2025-08-08T13:11:02Z

This code is currently not functional because it relies on ioctls that are not merged into any kernel branch; is is still okay to merge?

The way we usually handle this problem in CRIU is by checking if the required kernel functionality is available and disabling features that are not supported. However, it might be better to wait until the ioctls are merged.

For amdgpu plugin to call the new amdgpu drm CRIU ioctls, it needs the amdgpu drm header file, copied from the kernel's includes. Signed-off-by: David Francis <David.Francis@amd.com>

The amdgpu plugin usually calls drm ioctls through the libdrm wrappers. However, amdgpu restore requires dealing with dmabufs and gem handles directly, which means drm ioctls must be called directly. Add the drm.h header (from the kernel's uapi). Signed-off-by: David Francis <David.Francis@amd.com>

Buffer objects held by the amdgpu drm driver are checkpointed with the new BO_INFO and MAPPING_INFO ioctls/ioctl options. Handling is in amdgpu_plugin_drm.h Handling of imported buffer objects may require dmabuf fds to be transferred between processes. These occur over fdstore, with the handle-fstore id relationships kept in shread memory. There is a new plugin callback: RESTORE_INIT to create the shared memory. During checkpoint, track shared buffer objects, so that buffer objects that are shared across processes can be identified. During restore, track which buffer objects have been restored. Retry restore of a drm file if a buffer object is imported and the original has not been exported yet. Skip buffer objects that have already been completed or cannot be completed in the current restore. So drm code can use sdma_copy_bo, that function no longer requires kfd bo structs Update the protobuf messages with new amdgpu drm information. Signed-off-by: David Francis <David.Francis@amd.com>

The amdgpu plugin was counting how many files were checkpointed to determine when it should close the device files. The number of device files is not consistent; a process may have multiple copies of the drm device files open. Instead of doing this counting, add a new callback after all files are checkpointed, so plugins can clean up their resources at an appropriate time. Signed-off-by: David Francis <David.Francis@amd.com>

fdavid-amd · 2025-08-26T14:01:10Z

Some of these compilation tests seem to be failing at a install phase; is that something my code is doing or is it a problem with the test?

fdavid-amd · 2025-09-02T14:10:28Z

@avagin There's now an additional patch in this patch set (Support for Open dmabuf handles) to handle the case where a dmabuf fd is open during checkpoint. It involves some changes to core CRIU, what do you think of the approach?

Add new ioctl DRM_IOCTL_AMDGPU_GEM_LIST_HANDLES. This ioctl returns a list of bos with their handles, sizes, and flags and domains. This ioctl is meant to be used during CRIU checkpoint and provide information needed to reconstruct the bos in CRIU restore. Userspace for this and the next change can be found at checkpoint-restore/criu#2613 Signed-off-by: David Francis <David.Francis@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

Add new GEM_OP_IOCTL option GET_MAPPING_INFO, which returns a list of mappings associated with a given bo, along with their positions and offsets. Userspace for this and the previous change can be found at: checkpoint-restore/criu#2613 Signed-off-by: David Francis <David.Francis@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

fdavid-amd · 2025-09-11T13:51:01Z

The kernel code this relies in is now merged (into Alex Deucher's topic branch, not an official kernel release), and the CRIU code does test for the presence of the kernel changes and fallback to the old way if they're not present.

…p.org/agd5f/linux into drm-next amd-drm-next-6.18-2025-09-09: amdgpu: - Add CRIU support for gem objects - SI UVD fix - SI DPM fixes - Misc code cleanups - RAS updates - GPUVM debugfs fixes - Cyan Skillfish updates - UserQ updates - OEM i2c fix - SMU 13.0.x updates - DPCD probe quirk fix - Make vbios build number available in sysfs - HDCP updates - Brightness curve fixes - eDP updates - Vblank fixes - DCN 3.5 PG fix - PBN calcution fix amdkfd: - Add CRIU support for gem objects - Flexible array fix - P2P topology fix - APU memlimit fixes - Misc code cleanups UAPI: - Add CRIU support for gem objects Proposed userspace: checkpoint-restore/criu#2613 radeon: - Use dev_warn_once() in CS parsers Signed-off-by: Dave Airlie <airlied@redhat.com> From: Alex Deucher <alexander.deucher@amd.com> Link: https://lore.kernel.org/r/20250909161928.942785-1-alexander.deucher@amd.com

fdavid-amd · 2025-09-12T13:10:33Z

It would be a big help to me if you could say if the general approach taken by these patches is acceptable or unacceptable.

BTW we can merge the current version to the criu-dev branch and do all other adjustments on top of it.

Given that the ioctls have been merged into at least one kernel branch, could this now happen?
I would appreciate if someone could review (Support for open dmabuf handles) since it makes changes to core CRIU.

rst0git · 2025-09-13T07:47:45Z

@fdavid-amd Thank you for the update!

I would appreciate if someone could review (Support for open dmabuf handles) since it makes changes to core CRIU.

Would it be possible to add a Signed-off-by line in the commit message?

rst0git · 2025-09-13T07:58:57Z

criu/files.c

-int do_dump_gen_file(struct fd_parms *p, int lfd, const struct fdtype_ops *ops, FdinfoEntry *e)
+/* Use "force" to override cache */
+int do_dump_gen_file(struct fd_parms *p, int lfd,
+					const struct fdtype_ops *ops, FdinfoEntry *e, bool force)


Would it be possible to provide more information about why do we need to "override cache"? For example, can you share with us an example where CRIU needs to handle open dmabuf file descriptors. We usually implement tests to very that such functionality works as expected.

CRIU has always dealt with dependencies between restoring files, but dmabuf fd adds dependencies between dumping files. A dmabuf fd can't be dumped until we've dumped the device file it was exported from. For this reason this patch set adds a retry system to file dumping. Files that are being retried should have the appropriate dump call made even if there is no new file descriptor for them.

I have implemented tests for internal testing, which can be found here. The DmabufDelayTest in this branch pauses while a dmabuf fd is active to trigger this code.
https://github.com/fdavid-amd/ROCT-Thunk-Interface-CRIU#

avagin · 2025-09-19T18:19:55Z

 criu/files-ext.c                      |   6 +++---
 criu/files.c                          |  60 ++++++++++++++++++++++++++++++++++++++++++++++--------------
 criu/include/files.h                  |   6 ++++--
 criu/sockets.c                        |   2 +-
 plugins/amdgpu/Makefile               |   2 +-
 plugins/amdgpu/amdgpu_plugin.c        |  32 ++++++++++++++++++++++----------
 plugins/amdgpu/amdgpu_plugin_dmabuf.c | 158 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 plugins/amdgpu/amdgpu_plugin_dmabuf.h |  13 +++++++++++++
 plugins/amdgpu/amdgpu_plugin_drm.c    |   7 ++++++-
 plugins/amdgpu/amdgpu_plugin_util.c   |  48 +++++++++++++++++++++++++++++++++++++++++-------
 plugins/amdgpu/amdgpu_plugin_util.h   |   8 +++++---
 plugins/amdgpu/criu-amdgpu.proto      |   4 ++++
 12 files changed, 304 insertions(+), 42 deletions(-)

I think you need to split this patch. Changes in the criu core part should be made in separate patches with clear descriptions why they are required.

avagin · 2025-09-19T18:25:57Z

criu/files.c

 			if (ret)
 				break;
+		}
+		/* Dmabuf FDs cannot be dumped unless we have first dumped the device file that exported them.


Could the proper device file already be closed? Could it be opened in another process?

I was under the impression that CRIU didn't close the files it was dumping with until later in the process.
The earlier patches in this pr allow the amdgpu plugin to find the dmabuf fd even if it was exported by another process.

avagin · 2025-09-19T18:28:55Z

criu/files.c

-			ret = dump_one_file(item->pid, dfds->fds[i + off], lfds[i], opts + i, ctl, &e, dfds);
+			ret = dump_one_file(item->pid, dfds->fds[i + off], lfds[i], opts + i, ctl, &e, dfds, false);
+			if (ret == -EAGAIN) {
+				retry_indices[retry_count++] = i;


I believe this addition is performed a few lines later

fdavid-amd · 2025-09-22T19:18:21Z

Patch has been split. The approach to dump retries does propagate these "force" variable all over the place, which is not the prettiest, but I can't think of a better way of doing it.

dmabuf fd files cannot be dumped until the device file they were exported from is dumped. This introduces the concept of dependencies bewteen dumping files. During checkpoint, track files that fail to dump, then retry them. A new flag "is_retry" is needed so these dump can occur on files that already received a file id. Signed-off-by: David Francis <David.Francis@amd.com>

amdgpu libraries that use dmabuf fd to share GPU memory between processes close the dmabuf fds immediately after using them. However, it is possible that checkpoint of a process catches one of the dmabuf fds open. In that case, the amdgpu plugin needs to handle it. The checkpoint of the dmabuf fd does require the device file it was exported from to have already been dumped To identify which device this dmabuf fd was exprted from, attempt to import it on each device, then record the dmabuf handle it imports as. This handle can be used to restore it. Signed-off-by: David Francis <David.Francis@amd.com>

github-advanced-security bot found potential problems Mar 10, 2025

View reviewed changes

criu/files.c Fixed Show fixed Hide fixed

avagin self-requested a review March 14, 2025 21:56

rst0git mentioned this pull request Mar 17, 2025

Proposal for support of dmabuf with CRIU #2581

Closed

fdavid-amd force-pushed the dmabuf-post branch from 50b59d4 to 5a23fce Compare March 28, 2025 14:36

avagin reviewed Mar 29, 2025

View reviewed changes

criu/files.c Outdated Show resolved Hide resolved

avagin reviewed Mar 29, 2025

View reviewed changes

criu/cr-dump.c Outdated Show resolved Hide resolved

avagin reviewed Mar 29, 2025

View reviewed changes

criu/servicefd.c Outdated Show resolved Hide resolved

avagin reviewed Mar 29, 2025

View reviewed changes

criu/servicefd.c Outdated Show resolved Hide resolved

github-actions bot added the stale-pr label Apr 29, 2025

avagin added no-auto-close Don't auto-close as a stale issue and removed stale-pr labels May 19, 2025

fdavid-amd force-pushed the dmabuf-post branch from 5a23fce to 88dfc49 Compare June 9, 2025 18:42

rst0git reviewed Jun 10, 2025

View reviewed changes

fdavid-amd force-pushed the dmabuf-post branch from 88dfc49 to a14b516 Compare June 17, 2025 13:47

dayatsin-amd suggested changes Jun 17, 2025

View reviewed changes

plugins/amdgpu/amdgpu_plugin_util.c Outdated Show resolved Hide resolved

plugins/amdgpu/amdgpu_plugin_util.c Outdated Show resolved Hide resolved

criu/include/criu-plugin.h Outdated Show resolved Hide resolved

fdavid-amd force-pushed the dmabuf-post branch 4 times, most recently from b057779 to 456978a Compare June 18, 2025 18:12

fdavid-amd added 2 commits June 19, 2025 15:31

files-ext: Allow plugin files to retry

bfeb5f2

amdgpu dmabuf CRIU requires the ability of the amdgpu plugin to retry. Change files_ext.c to read a response of 1 from a plugin restore function to mean retry. Signed-off-by: David Francis <David.Francis@amd.com>

fdavid-amd force-pushed the dmabuf-post branch 3 times, most recently from 45f796c to 878a313 Compare July 8, 2025 14:05

fdavid-amd force-pushed the dmabuf-post branch from 878a313 to a8acafe Compare July 15, 2025 18:38

fdavid-amd force-pushed the dmabuf-post branch 2 times, most recently from 3bace7e to 84fb396 Compare August 7, 2025 13:59

fdavid-amd force-pushed the dmabuf-post branch from 84fb396 to 1dc9e9a Compare August 15, 2025 14:00

fdavid-amd added 4 commits August 26, 2025 09:24

plugin/amdgpu: Add amdgpu drm header

1768541

For amdgpu plugin to call the new amdgpu drm CRIU ioctls, it needs the amdgpu drm header file, copied from the kernel's includes. Signed-off-by: David Francis <David.Francis@amd.com>

fdavid-amd force-pushed the dmabuf-post branch from 1dc9e9a to e90ffcd Compare August 26, 2025 13:43

fdavid-amd force-pushed the dmabuf-post branch from e90ffcd to a155b84 Compare September 3, 2025 18:38

rst0git reviewed Sep 13, 2025

View reviewed changes

fdavid-amd force-pushed the dmabuf-post branch from a155b84 to 1a5f191 Compare September 19, 2025 13:14

avagin reviewed Sep 19, 2025

View reviewed changes

fdavid-amd force-pushed the dmabuf-post branch from 1a5f191 to 87059a8 Compare September 22, 2025 19:17

fdavid-amd added 2 commits September 29, 2025 10:01

fdavid-amd force-pushed the dmabuf-post branch from 87059a8 to 3d40e7a Compare September 29, 2025 14:02

Dmabuf IPC support for amdgpu plugin (v2) #2613

Are you sure you want to change the base?

Dmabuf IPC support for amdgpu plugin (v2) #2613

Uh oh!

Conversation

fdavid-amd commented Mar 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 29, 2025

Uh oh!

avagin commented May 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fdavid-amd commented Jul 8, 2025

Uh oh!

fdavid-amd commented Jul 15, 2025

Uh oh!

fdavid-amd commented Jul 15, 2025

Uh oh!

avagin commented Jul 16, 2025

Uh oh!

fdavid-amd commented Jul 16, 2025

Uh oh!

rst0git commented Aug 8, 2025

Uh oh!

fdavid-amd commented Aug 26, 2025

Uh oh!

fdavid-amd commented Sep 2, 2025

Uh oh!

fdavid-amd commented Sep 11, 2025

Uh oh!

fdavid-amd commented Sep 12, 2025

Uh oh!

rst0git commented Sep 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

avagin commented Sep 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fdavid-amd commented Sep 22, 2025

Uh oh!

Uh oh!