-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Record and Replay Support for Component Model #11284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
* Integrated `Recorder` and `Replayer` traits * Supported validation in configs and CLI
No replay support is included yet, and the recording still doesn't cover all cases surrounding overriden lowering implementations
Todos: * Fix some interfaces for Recorder/Replayer and Stores * Include feature gating
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initial review -- this is looking really good overall!
My main high-level thought around the heart of the record/replay (the interaction with the component hostcall logic and lifting/lowering): there are a lot of fine-grained conditionals, and it's somewhat hard to trace, even though there are only three modes (running + recording, replaying, or just running). It feels like somehow we should be able to hoist the conditionals a bit more and then have straight-line sequences for each of those modes.
A few other misc thoughts:
- We'll want to make sure we either plug all soundness holes or document them for the initial landing. We already know about component builtins (and their memory effects); we should ask around to ensure we catch any others.
- For core-wasm recording, I suspect we'll want to validate that there are no exported memories or tables.
- We should validate that the module is exactly the same on replay as on record -- I know you're working on adding hashing for this.
- Some comments below about
std
dependence -- we should ensure the core can work, serializing into memory, in a no-std build. - We'll want at least @alexcrichton to give this a review as well, maybe after we try some of the refactors mentioned here.
Thanks again!
crates/cli-flags/src/lib.rs
Outdated
/// enforced during replay by default (NaN canonicalization, deterministic relaxed SIMD) | ||
#[arg(short = 'P', long = "replay", value_name = "KEY[=VAL[,..]]")] | ||
#[serde(skip)] | ||
replay_raw: Vec<opt::CommaSeparated<Replay>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High-level thoughts on command-line options:
- It makes sense to me that record is a general option that the user should be able to set on any Wasmtime subcommand that executes WebAssembly; this includes
wasmtime run
,wasmtime serve
, and maybe others. (Curious: have you tried it on an HTTP server component?) - However, replay is a specific and separate thing: for example, if I am replaying an HTTP server execution, I should not be spawning an actual HTTP server on the host side. Basically, the idea of replay is that it is replacing the "real" host with a trace, so it should be properly seen (I think) as a separate top-level driver.
With that in mind, what do you think about turning replay into a command, rather than an option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to what @cfallin said about replay being a command, not an option.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replay in practice operates very similar to the run
command and shares a lot of the top level command line options with it as well.
The replay command as a result would be just a minor super-set of the run
setup. It sort of makes sense to do one of two things:
- Have a top-level
replay
command, that just adds additional replay options, and callsrun
under the hood with it - Add replay options perhaps just for the
run
command (as opposed to general options)
Leaning towards the former since maybe in the future replay
could support different modes of operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replay in practice operates very similar to the
run
command and shares a lot of the top level command line options with it as well.
Shouldn't all the top-level options pretty much be fixed and effectively identical to whatever they were when the trace was recorded? When does it make sense to specify different options than were used when recording? If the answer is "it doesn't make sense" then I don't think we want to force users to manually provide the same options from before in the replay command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that's true for most options. I think the ones that might be usable are perhaps just profiling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah great point, very cool to profile an execution after the fact and ignore I/O and the external world.
I think we can probably add CLI options one at a time as they make sense to the replay command, and have the default1 be "whatever was configured during the recording".
Footnotes
-
I think this is implicit/automatic in that the trace will reflect whatever configuration settings were in effect during recording? At least that is true for anything that affects execution. ↩
…ay lowering spurious event handling
Happy to help out! (and agreed I'd like to once-over at some point) How about scheduling a call when y'all are ready with the 3 of us? That'd probably be best to draw attention to any various areas and for me to ask some questoins in a high-bandwidth way before going off to review on my own. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super excited for this!
I think we should adjust the #[cfg(...)]
options / cargo features a little bit. Instead of having a cfg
feature for just type validation, but always including the general record-replay code in the build, I think we should have a #[cfg(feature = "rr")]
cargo feature for controlling whether we are building in the ability to do any kind of record-replay at all. (We don't want to force the smallest embedded builds of Wasmtime to include record-replay infrastructure, for example.) And then, when record-replay support was included at compile time, we can have runtime boolean knobs for whether to do type validation and divergence-checking when replaying.
Finally, in order to make it so that the whole core runtime isn't littered with #[cfg(feature = "rr")]
, we can do something similar to what we do with #[cfg(feature = "gc")]
and the wasmtime::runtime::gc
submodule:
// crates/wasmtime/src/runtime/rr.rs
#[cfg(feature = "rr")]
mod enabled;
#[cfg(feature = "rr")]
pub use enabled::*;
#[cfg(not(feature = "rr"))]
mod disabled;
#[cfg(not(feature = "rr"))]
pub use disabled::*;
The wasmtime::runtime::rr::enabled
submodule would have the actual record-replay implementation, and the wasmtime::runtime::rr::disabled
submodule would have a stubbed out version of the same public API but without constructors and with every public type being a newtype of wasmtime::runtime::uninhabited::Uninhabited
. This pattern lets the core runtime usage of record-replay APIs look the same regardless whether we actually build in the code to do record-replay.
The other thing I think we need before this can land is some kind of testing or fuzzing story. At minimum, we should (1) make it so that the fuzzer can turn on recording during our existing fuzzing, and (2) we should have a smoke test that records some kind of Wasm program and then replays it back again and asserts that we get the same result again. As a follow up, I think we should also generalize that smoke test into a new fuzz target where we take an arbitrary Wasm module generated by the fuzzer, record its execution, and then replay that execution and assert that we get the same results (we can mostly rely on enabling the internal type validation and divergence checks for these assertions).
Let me know if any of this isn't clear or if I've overlooked something.
Thanks!
crates/cli-flags/src/lib.rs
Outdated
/// enforced during replay by default (NaN canonicalization, deterministic relaxed SIMD) | ||
#[arg(short = 'P', long = "replay", value_name = "KEY[=VAL[,..]]")] | ||
#[serde(skip)] | ||
replay_raw: Vec<opt::CommaSeparated<Replay>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to what @cfallin said about replay being a command, not an option.
// Uninitialized data is assumed and serialized, so hence | ||
// may contain some undefined values | ||
unsafe { self.assume_init() }.to_valraw_bytes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is sound. Instead, I think we'll want something like this:
union SerializedValRaw {
bytes: ValRawBytes,
val: ValRaw,
}
impl SerializedValRaw {
pub fn new(val: ValRaw) -> Self {
// Zero-initialize `self.bytes` to ensure that there are
// no undefined bytes in `self` and we don't ever try to
// read and serialize undefined data.
let ret = Self {
// SAFETY: it is safe to zero-initialize `u8` arrays.
bytes: unsafe { mem::zeroed() },
};
ret.val = val;
ret
}
pub fn get_val(&self) -> ValRaw {
// SAFETY: `self` is always initialized in `new` such
// that there is a valid `ValRaw` inside.
unsafe { self.val }
}
pub fn get_bytes(&self) -> ValRawBytes {
// SAFETY: We take care to ensure that `self` has no
// undefined bytes in the constructor, so accessing the
// raw bytes representation of `ValRaw` is safe.
unsafe { self.bytes }
}
}
That'd be great! FWIW, I'm on PTO next week and the week after; please feel free to talk with Arjun directly before then if you both want, or I can join after Aug 11... |
@alexcrichton @fitzgen I'll take a pass through the comments early next week, and perhaps we can find a time later next week that works |
Sounds good! Feel free to ping me on Zulip when ready |
Brief
This PR is intended to support deterministic record and replay (RR) of Wasm components intrinsically in Wasmtime, and received an initial round of discussion in the Wasmtime bi-weekly meeting on 07/17
Motivation
RR is a very useful primitive for improving the debugging story of Wasm in Wasmtime. Bugs that are often encountered in modules during deployment can be deterministically reproduced. In particular, it provides the foundation for the following (to name a few):
Scope
This initial PR provides the base primitives for recording and replay events. It supports RR at all import function boundaries and lowering rules for component types. The RR event infrastructure is intended to be easily extensible to new event types as new use-cases emerge.
Primary Goals
Non-Goals (Subject to Discussion)
Initial Performance Numbers
Some initial runs on compression libraries like
zstd
show a 4-5% overhead on recording logic, excluding the disk I/O. This seems reasonable at the moment and likely doesn't need further optimization unless there are explicit use-cases.Minor Todo
The following (minor) additions will be made in the coming days prior to potential merging:
RecordBuffer
andReplayBuffer
Questions for Maintainers