Skip to content

Conversation

@bkirwi
Copy link
Contributor

@bkirwi bkirwi commented Nov 17, 2025

https://github.com/MaterializeInc/database-issues/issues/9729 is an example of a recurring class of issue:

  • When Materialize shuts down (for good or ill), the two Tokio runtimes are shut down.
  • Shutting-down Tokio runtimes cancel tasks.
  • Our code is generally unprepared for tasks to fail. We abort on panic outside of unit tests; and our Tokio wrappers don't allow both aborting and joining on the result of a task. The only case where joining on a task might return a join-error in production is during this disordered shutdown procedure.
  • mz_ore had a utility for this: JoinHandle::wait_and_assert_finished. This code takes advantage of the fact that the only reason for cancellation is runtime shutdown, and just hang out and wait for itself to be shut down as well. We've been adding this method to join points whack-a-mole as new errors crop up.

This patch just makes the wait_and_assert_finished behaviour the overall default for our JoinHandle wrapper. This should remove a class of bug and eliminates a bunch of sketch error handling across the codebase.

Motivation

Tired of bugs like https://github.com/MaterializeInc/database-issues/issues/9729!

Tips for reviewer

One case where join errors are still meaningful: catching panics in async tests. In those cases we just unwrap the join handle to expose underlying Tokio handle, which seems fine.

@bkirwi bkirwi force-pushed the perfect-handle branch 7 times, most recently from a8d3e24 to 917d47b Compare November 17, 2025 22:51
@bkirwi bkirwi changed the title [ore] Perfect handle [ore] Infallible mz_ore::task::JoinHandle Nov 18, 2025
Tasks can fail in theory for two ways:
- Panic! However, we disable panics in production.
  (And for tests, we percolate the panic upward.)
- Task is explicitly aborted. This is statically ruled out by the
  wrapper.
- Runtime is shutting down. This is not interesting - it means the
  process is shutting down - and there's no value to percolating errors
  in that case. Here, we just remain pending indefinitely until the
  runtime dies.
@bkirwi bkirwi marked this pull request as ready for review November 18, 2025 18:47
@bkirwi bkirwi requested review from a team and aljoscha as code owners November 18, 2025 18:47
@def-
Copy link
Contributor

def- commented Nov 18, 2025

Thansk for triggering nightly, please ignore the orchestratord failure, it's already fixed on main!

Copy link
Contributor

@teskje teskje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the simplification at the call sites and the reasoning seems sound to me. The only thing that's not great is that in non-abort-on-panic contexts you now need to remember to call into_tokio_handle or risk hanging forever when a task panics. As you say, we should have these contexts only in tests and hanging in a test is better than panicking in production.

let (handle, _idx, remaining) = future::select_all(handles).await;
handles = remaining;
let (id, global_id, res) = handle.expect("must join");
let (id, global_id, res) = handle;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to the changes you made, so feel free to ignore but: Calling this variable handle is misleading, since it is not a handle but the result of awaiting on one.

AbortOnDropHandle(self)
}

/// Checks if the task associated with this [`JoinHandle`] has finished.a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Checks if the task associated with this [`JoinHandle`] has finished.a
/// Checks if the task associated with this [`JoinHandle`] has finished.

(Also not your fault!)

Comment on lines +152 to +153
/// Checks if the task associated with this [`JoinHandle`] has finished.a
pub fn into_tokio_handle(self) -> TokioJoinHandle<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment got confused here!

self.0.abort();
let _ = self.await;
}
// Note: adding an `abort(&self)` method here is incorrect, please see `unpack_join_result`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unpack_join_result doesn't exist anymore I think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants