vdev_disk_close: take disk write lock before destroying it #17719

robn · 2025-09-10T11:28:34Z

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

I had a surprising crash in a client project and traced it to an A-B-A problem in vdev_disk, where the vdev could be closed and the private data, including an rwlock, destroyed while the lock was still held.

IO is issued asynchronously. If the kernel deschedules the initiating task, and then IO is scheduled, completed and goes on to call vdev_disk_close() and close the vdev before the initiating task runs, then the vdev_disk_t in vdev_tsd will have been destroyed, rendering the callers vd now invalid.

I have not been able to trigger this in stock OpenZFS, but I can easily in my client project. Since I hope to upstream that work soon, fixing this becomes important.

Description

Take the write lock in vdev_disk_close(). If any read locks are outstanding, the task will sleep, and those issuing tasks will get an opportunity to release their read lock. Once cleared, vdev_disk_close() will gain the write lock and be able to null vdev_tsd.

I don't think it's possible for this to deadlock, because the read lock should always be taken in an "issue" task, while the completions will redispatch the zio to an "interrupt" task.

How Has This Been Tested?

ZTS run completed. In my own code, this fixes my crash.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Quality assurance (non-breaking change which makes the code more robust against bugs)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

amotin

I may be wrong, but shouldn't config lock prevent vdev close while I/Os are running? Considering vdev_disk_io_start() reads vdev_tsd before taking the vd_lock, I am not sure this patch can close whatever the race is there completely, unless I misunderstand it.

robn · 2025-09-11T07:35:56Z

Sorry, I didn't explain at all really; I shouldn't post PRs when sleepy :)

I believe it looks roughly like this:

ISSUE taskq thread:

  zio_vdev_io_start(zio)
    ...
    vd->vdev_ops->vdev_op_io_start(zio)
    vdev_disk_io_start(zio)
      vd = zio->io_vd->vdev_tsd
      [vd NULL check: it _can_ be NULL if closed before we get here]
      rw_enter(&vd->vd_lock, RW_READER)
      ...
      vdev_disk_io_<somekind>(zio)
        ...
        bio->bi_end_io = vdev_disk_io_<somekind>_completion;
        bio->bi_private = zio;
        ...
        vdev_submit_bio(bio)

[kernel preempts ISSUE thread with vd_lock held]

[IO completes; kernel calls bi_end_io in interrupt context]

  vdev_disk_io_<somekind>_completion(bio)
    zio = bio->bi_private;
    bio_put(bio);
    zio_interrupt(zio);

INTR taskq thread:

  zio_vdev_io_done(zio)
  [IO pipeline proceeds]
  zio_done(zio)

[Time passes. IO tree completes. Someone somewhere calls vdev_close() or vdev_reopen()]

  vdev_(close|reopen)(vd)
    ...
    vd->vdev_ops->vdev_op_io_close(vd)
    vdev_disk_close(v)
      vd = v->vdev_tsd
      ...
      rw_destroy(&vd->vd_lock)  [ASSERT! vd->vd_lock still held by sleeping ISSUE thread]
      v->vdev_tsd = NULL
      ...

[Time passes. kernel wakes ISSUE thread]

  [return to vdev_disk_io_start(zio)]
  [vd still points to old tsd structure]
  rw_exit(&vd->vd_lock)         [oh no]

With #17718, the assert in rw_destroy fires. Without, or on a prod build, we end up trying to release a lock through a stale pointer.

Now, I am not sure of exactly the circumstances that would cause the ISSUE thread to sleep for that long. In my case, it happened reliably in zpool_create_008_pos, running zpool create -f testpool1 loop1 spare loop2, while testing a patch series that emits a FLUSH at the end of every IO "tree" (my white whale project). The crash made it pretty easy to see where it was going wrong, and printing the pointers either side of the call to vdev_disk_io_flush() proved it:

[    4.084620] NOTICE: 979 flush v ffff8d31d26e4000 vd ffff8d31d1933400 tsd ffff8d31d1933400 guid 16856970401076039102

[ vdev_disk_io_flush(...) ]

[    4.085465] NOTICE: 979 flushed v ffff8d31d26e4000 vd ffff8d31d1933400 tsd 0000000000000000 guid 18446744072641211488 err 0
[    4.085533] VERIFY(RW_READ_HELD(&vd->vd_lock)) failed
[    4.085605] PANIC at vdev_disk.c:1164:vdev_disk_io_start()

With #17718 in place, the crash changed to:

[    3.788436] VERIFY(!(RW_LOCK_HELD(&vd->vd_lock))) failed
[    3.788521] PANIC at vdev_disk.c:478:vdev_disk_close()

I don't know if there's any particular reason for the (relatively) long sleep; possibly to do with the small number of flush taskq threads? Or just the workload difference.

But, looking closer, it seems like this isn't just my code; it seems like all IO types theoretically have this problem. We even already sort-of knew about the shape of it, see the comment at the bottom of vbio_submit().

So anyway, I thought about reloading vd from v->vdev_tsd, but then I saw the NULL check, so the whole vdev might be invalid at that point. And then, even worse, zio might be dead at that point. So there's no safe way to get back to the lock at that point.

And that means having to ensure that the lock will survive until we're done, and the simplest way to do that seemed to be to just take the write lock in vdev_disk_close(). It will block until all this other stuff returns, and then it will NULL it, so any future IO on that vdev shouldn't see it. And there shouldn't be any incoming IO at that point because of the config locks, I think. And if there is, well, that seems like a thing to fix up elsewhere, but I'll wait until I see it because my head already hurts :)

Many IO operations are submitted to the kernel async, and so the zio can complete and followup actions before the submission call returns. If one of the followup actions closes the disk (eg during pool create/import), the initiator may be left holding a lock on the disk at destruction. Instead, take the write lock before finishing up and decoupling the disk state from the vdev proper. The caller will hold until all IO is submitted and locks released. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

amotin

OK. I guess makes sense.

amotin · 2025-09-11T15:38:45Z

I haven't looked deep on the code, but as an alternative thought: Do we really need to hold the lock during requests issue, or we may rely on some other means to be sure the device is still there? FreeBSD's vdev_geom_io_start() happily has g_io_request(bp, cp); as the last call and does not use any locks there.

Many IO operations are submitted to the kernel async, and so the zio can complete and followup actions before the submission call returns. If one of the followup actions closes the disk (eg during pool create/import), the initiator may be left holding a lock on the disk at destruction. Instead, take the write lock before finishing up and decoupling the disk state from the vdev proper. The caller will hold until all IO is submitted and locks released. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com> Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Closes openzfs#17719

amotin reviewed Sep 10, 2025

View reviewed changes

behlendorf added the Status: Code Review Needed Ready for review and testing label Sep 10, 2025

robn force-pushed the vdev-disk-close-lock branch from 3d97062 to 21451df Compare September 11, 2025 07:36

amotin approved these changes Sep 11, 2025

View reviewed changes

behlendorf approved these changes Sep 11, 2025

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Sep 12, 2025

behlendorf merged commit f319ff3 into openzfs:master Sep 15, 2025
22 of 25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vdev_disk_close: take disk write lock before destroying it #17719

vdev_disk_close: take disk write lock before destroying it #17719

robn commented Sep 10, 2025 •

edited

Loading

Uh oh!

amotin left a comment

Uh oh!

robn commented Sep 11, 2025 •

edited

Loading

Uh oh!

amotin left a comment

Uh oh!

amotin commented Sep 11, 2025

Uh oh!

Uh oh!

Uh oh!

vdev_disk_close: take disk write lock before destroying it #17719

vdev_disk_close: take disk write lock before destroying it #17719

Conversation

robn commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Context

Description

How Has This Been Tested?

Types of changes

Checklist:

Uh oh!

amotin left a comment

Choose a reason for hiding this comment

Uh oh!

robn commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amotin left a comment

Choose a reason for hiding this comment

Uh oh!

amotin commented Sep 11, 2025

Uh oh!

Uh oh!

Uh oh!

robn commented Sep 10, 2025 •

edited

Loading

robn commented Sep 11, 2025 •

edited

Loading