Skip to content

Conversation

robn
Copy link
Member

@robn robn commented Sep 10, 2025

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

I had a surprising crash in a client project and traced it to an A-B-A problem in vdev_disk, where the vdev could be closed and the private data, including an rwlock, destroyed while the lock was still held.

IO is issued asynchronously. If the kernel deschedules the initiating task, and then IO is scheduled, completed and goes on to call vdev_disk_close() and close the vdev before the initiating task runs, then the vdev_disk_t in vdev_tsd will have been destroyed, rendering the callers vd now invalid.

I have not been able to trigger this in stock OpenZFS, but I can easily in my client project. Since I hope to upstream that work soon, fixing this becomes important.

Description

Take the write lock in vdev_disk_close(). If any read locks are outstanding, the task will sleep, and those issuing tasks will get an opportunity to release their read lock. Once cleared, vdev_disk_close() will gain the write lock and be able to null vdev_tsd.

I don't think it's possible for this to deadlock, because the read lock should always be taken in an "issue" task, while the completions will redispatch the zio to an "interrupt" task.

How Has This Been Tested?

ZTS run completed. In my own code, this fixes my crash.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be wrong, but shouldn't config lock prevent vdev close while I/Os are running? Considering vdev_disk_io_start() reads vdev_tsd before taking the vd_lock, I am not sure this patch can close whatever the race is there completely, unless I misunderstand it.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Sep 10, 2025
@robn
Copy link
Member Author

robn commented Sep 11, 2025

Sorry, I didn't explain at all really; I shouldn't post PRs when sleepy :)

I believe it looks roughly like this:

ISSUE taskq thread:

  zio_vdev_io_start(zio)
    ...
    vd->vdev_ops->vdev_op_io_start(zio)
    vdev_disk_io_start(zio)
      vd = zio->io_vd->vdev_tsd
      [vd NULL check: it _can_ be NULL if closed before we get here]
      rw_enter(&vd->vd_lock, RW_READER)
      ...
      vdev_disk_io_<somekind>(zio)
        ...
        bio->bi_end_io = vdev_disk_io_<somekind>_completion;
        bio->bi_private = zio;
        ...
        vdev_submit_bio(bio)

[kernel preempts ISSUE thread with vd_lock held]

[IO completes; kernel calls bi_end_io in interrupt context]

  vdev_disk_io_<somekind>_completion(bio)
    zio = bio->bi_private;
    bio_put(bio);
    zio_interrupt(zio);

INTR taskq thread:

  zio_vdev_io_done(zio)
  [IO pipeline proceeds]
  zio_done(zio)

[Time passes. IO tree completes. Someone somewhere calls vdev_close() or vdev_reopen()]

  vdev_(close|reopen)(vd)
    ...
    vd->vdev_ops->vdev_op_io_close(vd)
    vdev_disk_close(v)
      vd = v->vdev_tsd
      ...
      rw_destroy(&vd->vd_lock)  [ASSERT! vd->vd_lock still held by sleeping ISSUE thread]
      v->vdev_tsd = NULL
      ...

[Time passes. kernel wakes ISSUE thread]

  [return to vdev_disk_io_start(zio)]
  [vd still points to old tsd structure]
  rw_exit(&vd->vd_lock)         [oh no]

With #17718, the assert in rw_destroy fires. Without, or on a prod build, we end up trying to release a lock through a stale pointer.

Now, I am not sure of exactly the circumstances that would cause the ISSUE thread to sleep for that long. In my case, it happened reliably in zpool_create_008_pos, running zpool create -f testpool1 loop1 spare loop2, while testing a patch series that emits a FLUSH at the end of every IO "tree" (my white whale project). The crash made it pretty easy to see where it was going wrong, and printing the pointers either side of the call to vdev_disk_io_flush() proved it:

[    4.084620] NOTICE: 979 flush v ffff8d31d26e4000 vd ffff8d31d1933400 tsd ffff8d31d1933400 guid 16856970401076039102

[ vdev_disk_io_flush(...) ]

[    4.085465] NOTICE: 979 flushed v ffff8d31d26e4000 vd ffff8d31d1933400 tsd 0000000000000000 guid 18446744072641211488 err 0
[    4.085533] VERIFY(RW_READ_HELD(&vd->vd_lock)) failed
[    4.085605] PANIC at vdev_disk.c:1164:vdev_disk_io_start()

With #17718 in place, the crash changed to:

[    3.788436] VERIFY(!(RW_LOCK_HELD(&vd->vd_lock))) failed
[    3.788521] PANIC at vdev_disk.c:478:vdev_disk_close()

I don't know if there's any particular reason for the (relatively) long sleep; possibly to do with the small number of flush taskq threads? Or just the workload difference.

But, looking closer, it seems like this isn't just my code; it seems like all IO types theoretically have this problem. We even already sort-of knew about the shape of it, see the comment at the bottom of vbio_submit().

So anyway, I thought about reloading vd from v->vdev_tsd, but then I saw the NULL check, so the whole vdev might be invalid at that point. And then, even worse, zio might be dead at that point. So there's no safe way to get back to the lock at that point.

And that means having to ensure that the lock will survive until we're done, and the simplest way to do that seemed to be to just take the write lock in vdev_disk_close(). It will block until all this other stuff returns, and then it will NULL it, so any future IO on that vdev shouldn't see it. And there shouldn't be any incoming IO at that point because of the config locks, I think. And if there is, well, that seems like a thing to fix up elsewhere, but I'll wait until I see it because my head already hurts :)

Many IO operations are submitted to the kernel async, and so the zio can
complete and followup actions before the submission call returns. If one
of the followup actions closes the disk (eg during pool create/import),
the initiator may be left holding a lock on the disk at destruction.

Instead, take the write lock before finishing up and decoupling the disk
state from the vdev proper. The caller will hold until all IO is
submitted and locks released.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
@robn robn force-pushed the vdev-disk-close-lock branch from 3d97062 to 21451df Compare September 11, 2025 07:36
Copy link
Member

@amotin amotin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I guess makes sense.

@amotin
Copy link
Member

amotin commented Sep 11, 2025

I haven't looked deep on the code, but as an alternative thought: Do we really need to hold the lock during requests issue, or we may rely on some other means to be sure the device is still there? FreeBSD's vdev_geom_io_start() happily has g_io_request(bp, cp); as the last call and does not use any locks there.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Sep 12, 2025
@behlendorf behlendorf merged commit f319ff3 into openzfs:master Sep 15, 2025
22 of 25 checks passed
behlendorf pushed a commit to behlendorf/zfs that referenced this pull request Sep 15, 2025
Many IO operations are submitted to the kernel async, and so the zio can
complete and followup actions before the submission call returns. If one
of the followup actions closes the disk (eg during pool create/import),
the initiator may be left holding a lock on the disk at destruction.

Instead, take the write lock before finishing up and decoupling the disk
state from the vdev proper. The caller will hold until all IO is
submitted and locks released.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <alexander.motin@TrueNAS.com>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Closes openzfs#17719
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants