Skip to content

qcow2 support instead of raw #986

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
iwikus opened this issue Mar 27, 2025 · 11 comments
Open

qcow2 support instead of raw #986

iwikus opened this issue Mar 27, 2025 · 11 comments
Labels
not-a-bug Not a bug

Comments

@iwikus
Copy link

iwikus commented Mar 27, 2025

Using raw block device as tpm state is blocking snapshots. Is it possible to add qcow2 disk image as storage for tpm state?
See https://bugzilla.proxmox.com/show_bug.cgi?id=4693

@stefanberger
Copy link
Owner

stefanberger commented Mar 27, 2025

Using raw block device as tpm state is blocking snapshots. Is it possible to add qcow2 disk image as storage for tpm state?

You cannot use a qcow2 image.
PS: Let me qualify. You cannot use a qcow2 images as a swtpm storage backend. However, when a snapshot occurs the current state of the TPM goes into the qcow2 image. This happens right here: https://github.com/qemu/qemu/blob/master/backends/tpm/tpm_emulator.c#L912-L928

@iwikus
Copy link
Author

iwikus commented Mar 27, 2025

However, when a snapshot occurs the current state of the TPM goes into the qcow2 image

Into which image? I don't get it from that code.

@stefanberger
Copy link
Owner

stefanberger commented Mar 27, 2025

However, when a snapshot occurs the current state of the TPM goes into the qcow2 image

Into which image? I don't get it from that code.

Into the qcow image of the VM. All other QEMU devices' state is also stored there.

@iwikus
Copy link
Author

iwikus commented Mar 27, 2025

VM can use multiple qcow2 images. Is it using first one? Or where is it defined? For example PVE is using in case of UEFI qcow2 disk for UEFI state. Can this disk be used?

@stefanberger
Copy link
Owner

VM can use multiple qcow2 images. Is it using first one? Or where is it defined? For example PVE is using in case of UEFI qcow2 disk for UEFI state. Can this disk be used?

I don't know what the 'first' one is. It's the one that QEMU writes all the devices' state into -- maybe it's the first one passed on the command line, but I don't know. I doubt it's the UEFI qcow2 disk.

@iwikus
Copy link
Author

iwikus commented Mar 27, 2025

Thank you, I was able to find out this:
A VM snapshot is made of a VM state info (its size is shown in info snapshots) and a snapshot of every writable disk image. The VM state info is stored in the first qcow2 non removable and writable block device.
https://qemu-project.gitlab.io/qemu/system/images.html

@Fabian-Gruenbichler
Copy link

as usual in Qemu context, snapshot is a rather overloaded term.

PVE (which is a management stack similar to libvirtd, but with different features) supports two kinds of snapshots for VMs:

  • snapshots with state (similar to stock Qemu, but supports dumping the guest state into arbitrary, standalone volumes)
  • snapshots without state (guest state is not saved at all, the state of the VM is akin to what you'd get if you'd have pulled the power plug at that point in time)

in both cases, we also need snapshot the volumes backing the VMs disks on the storage level, including "special" volumes such as the one containing EFIvars and the one containing the TPM backing file. this is the part where swtpm only supporting directories or raw files becomes an issue - if its backing file is stored on a storage that supports snapshots of raw-formatted volumes (such as zvols for ZFS, rbd block devices for Ceph, a thin LV for LVM-thin) this works. but if it is stored as a plain raw file on a directory-based storage without snapshot capabilities, we cannot snapshot the VM anymore (but could if the volume were qcow2-formatted, since that supports snapshots as part of the file format).

there are some possible solutions for this:

  • allow swtpm to (optionally) use any block backend Qemu supports, which would unlock things like Qcow2 support for directory-based storages, librbd-based communication with Ceph clusters (kernel-based mapping has its own limitations), and similar things, by integrating it further into Qemu
  • add a wrapping layer between the actual backing file and what swtpm talks to (e.g., start qemu-storage-daemon talking to the qcow2-file/.., and let it expose an NBD block device for swtpm)
  • fallback to simply copying the tpmstate file if snapshots are not supported, since it is tiny anyway (this one would be solely on the management stack side)

obviously, all of these come with downsides/complexity ;) would you be open to exploring letting swtpm talk to a vhost-user-blk "device"? that might be a simple way of letting swtpm talk to arbitrary volumes supported by qemu (via qemu-storage-daemon). alternatively, qemu-storage-daemon's fuse or NBD export types could probably work transparently for swtpm?

@stefanberger
Copy link
Owner

as usual in Qemu context, snapshot is a rather overloaded term.

PVE (which is a management stack similar to libvirtd, but with different features) supports two kinds of snapshots for VMs:

* snapshots with state (similar to stock Qemu, but supports dumping the guest state into arbitrary, standalone volumes)

Similar but not the same? How do you extract the state of all other devices from QEMU? Or is it QEMU storing the devices into the volumes?

* snapshots without state (guest state is not saved at all, the state of the VM is akin to what you'd get if you'd have pulled the power plug at that point in time)

What does "the state of the VM is akin to what you'd get if you'd have pulled the power plug at that point in time" mean for resuming a snapshot? I don't understand from this how or when you store the state of devices.

in both cases, we also need snapshot the volumes backing the VMs disks on the storage level, including "special" volumes such as the one containing EFIvars and the one containing the TPM backing file. this is the part where swtpm only supporting directories or raw files becomes an issue - if its backing file is stored on a storage that

How do you handle the states of all the other devices in this case? Is QEMU involved at all?

supports snapshots of raw-formatted volumes (such as zvols for ZFS, rbd block devices for Ceph, a thin LV for LVM-thin) this works. but if it is stored as a plain raw file on a directory-based storage without snapshot capabilities, we cannot snapshot the VM anymore (but could if the volume were qcow2-formatted, since that supports snapshots as part of the file format).

there are some possible solutions for this:

* allow swtpm to (optionally) use any block backend Qemu supports, which would unlock things like Qcow2 support for directory-based storages, librbd-based communication with Ceph clusters (kernel-based mapping has its own limitations), and similar things, by integrating it further into Qemu

* add a wrapping layer between the actual backing file and what swtpm talks to (e.g., start qemu-storage-daemon talking to the qcow2-file/.., and let it expose an NBD block device for swtpm)

* fallback to simply copying the tpmstate file if snapshots are not supported, since it is tiny anyway (this one would be solely on the management stack side)

obviously, all of these come with downsides/complexity ;) would you be open to exploring letting swtpm talk to a vhost-user-blk "device"? that might be a simple way of letting swtpm talk to arbitrary volumes supported by qemu (via qemu-storage-daemon). alternatively, qemu-storage-daemon's fuse or NBD export types could probably work transparently for swtpm?

swtpm writes plain files using the file backend into any filesystem, including a FUSE file system. I would certainly prefer a solution on the POSIX open/close/read/write level where swtpm doesn't need to integrate with all kinds of storage formats.

@Fabian-Gruenbichler
Copy link

Fabian-Gruenbichler commented Mar 28, 2025

stock qemu uses snapshots by combining the qcow2 snapshot feature with the state-saving feature into a single interface (QMP's snapshot-save and friends).

PVE instead splits snapshots into two components:

  • storage level snapshot of the VM's disks (not involving Qemu, other than guest filesystem freezing or suspending the VM to ensure consistency)
  • optional saving of the guest state into a separate volume (this is achieved by a custom patch introducing new savevm-start and similar QMP commands, it uses the same mechanism that stock Qemu's snapshot or live-migration features use)

when a snapshot without state is created, just the virtual disk contents are snapshotted on the storage layer. if you roll back to such a snapshot, the resulting state of the VM will be as if you'd have pulled the (virtual) power plug of the VM at the time the snapshot was taken.

when a snapshot with state is created, the virtual disks contents are snapshotted on the storage layer and a snapshot of the guest state is saved using the mechanism described above (with some extra steps to make them line up consistency-wise). if you roll back to such a snapshot, the first boot of the VM will load the state volume and resume execution, just like stock qemu would load it from a qcow2 volume.

this gives us more flexibility for snapshotting, as we are not limited to using storages that support qcow2 and allow users choosing which kind of snapshot semantics they want. the downside is that if a volume is on a storage not supporting snapshots, no snapshots can be taken.

I am not sure how it works with stock Qemu, if I do the following sequence of events:

  • start VM
  • setup vTPM (these changes are persisted to the TPM state file)
  • snapshot VM (this saves the runtime state into the state part of some qcow2 file)
  • change vTPM contents from guest (these changes are persisted to the TPM state file)
  • rollback to snapshot (loads runtime state from qcow2, but on-disk TPM state file has different state??)

swtpm writes plain files using the file backend into any filesystem, including a FUSE file system. I would certainly prefer a solution on the POSIX open/close/read/write level where swtpm doesn't need to integrate with all kinds of storage formats.

this is understandable - AFAICT there is no (public) library interface anyway for re-using the Qemu block layer, we integrate our binaries that need this into our Qemu build for this reason. I guess we will play around with the storage daemon approach or implement juggling multiple raw state volumes in our management stack.

@stefanberger
Copy link
Owner

stock qemu uses snapshots by combining the qcow2 snapshot feature with the state-saving feature into a single interface (QMP's snapshot-save and friends).

PVE instead splits snapshots into two components:

* storage level snapshot of the VM's disks (not involving Qemu, other than guest filesystem freezing or suspending the VM to ensure consistency)

* optional saving of the guest state into a separate volume (this is achieved by [a custom patch](https://git.proxmox.com/?p=pve-qemu.git;a=blob;f=debian/patches/pve/0017-PVE-add-savevm-async-for-background-state-snapshots.patch;h=e558da66aea5763f872c5d475370e25168d02ddc;hb=HEAD) introducing new `savevm-start` and similar QMP commands, it uses the same mechanism that stock Qemu's snapshot or live-migration features use)

when a snapshot without state is created, just the virtual disk contents are snapshotted on the storage layer. if you roll back to such a snapshot, the resulting state of the VM will be as if you'd have pulled the (virtual) power plug of the VM at the time the snapshot was taken.

I see what you mean. Swtpm does not have support for this. QEMU's swtpm support keeps the hardware interface state (CRB or TIS) and the swtpm state together with all other devices' state -- this supports QEMU-style snapshooting including VM save/restore. In your case you would just want the swtpm state to be saved somewhere.

when a snapshot with state is created, the virtual disks contents are snapshotted on the storage layer and a snapshot of the guest state is saved using the mechanism described above (with some extra steps to make them line up consistency-wise). if you roll back to such a snapshot, the first boot of the VM will load the state volume and resume execution, just like stock qemu would load it from a qcow2 volume.

this gives us more flexibility for snapshotting, as we are not limited to using storages that support qcow2 and allow users choosing which kind of snapshot semantics they want. the downside is that if a volume is on a storage not supporting snapshots, no snapshots can be taken.

I am not sure how it works with stock Qemu, if I do the following sequence of events:

* start VM

* setup vTPM (these changes are persisted to the TPM state file)

swtpm_setup runs and creates the initial state; if it wasn't to run then swtpm would create the initial state.

* snapshot VM (this saves the runtime state into the state part of some qcow2 file)

The state of the vTPM is pulled out of swtpm on .pre_save: https://github.com/qemu/qemu/blob/master/backends/tpm/tpm_emulator.c#L785-L809

* change vTPM contents from guest (these changes are persisted to the TPM state file)

correct

* rollback to snapshot (loads runtime state from qcow2, but on-disk TPM state file has different state??)

The state of the vTPM is pushed back into swtpm on .post_load before VM resumes: https://github.com/qemu/qemu/blob/master/backends/tpm/tpm_emulator.c#L883-L910
The pushing back of state into swtpm will have it overwrite any TPM state file. It will also restore the volatile (non-permanent state of vTPM) of the vTPM.

swtpm writes plain files using the file backend into any filesystem, including a FUSE file system. I would certainly prefer a solution on the POSIX open/close/read/write level where swtpm doesn't need to integrate with all kinds of storage formats.

this is understandable - AFAICT there is no (public) library interface anyway for re-using the Qemu block layer, we integrate our binaries that need this into our Qemu build for this reason. I guess we will play around with the storage daemon approach or implement juggling multiple raw state volumes in our management stack.

@Fabian-Gruenbichler
Copy link

The state of the vTPM is pushed back into swtpm on .post_load before VM resumes: https://github.com/qemu/qemu/blob/master/backends/tpm/tpm_emulator.c#L883-L910
The pushing back of state into swtpm will have it overwrite any TPM state file. It will also restore the volatile (non-permanent state of vTPM) of the vTPM.

that makes a lot of sense :) so we could effectively also drop snapshotting of the state volume for snapshots with RAM/state, if we don't care about losing the option of downgrading such a snapshot to a storage-only one (e.g., if the state is corrupt). but since we want to support both anyway, we will explore the FUSE variant as plan A!

@stefanberger stefanberger added the not-a-bug Not a bug label Apr 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
not-a-bug Not a bug
Projects
None yet
Development

No branches or pull requests

3 participants