Skip to content

Concurrent immix #1355

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 21 commits into
base: master
Choose a base branch
from
Draft

Concurrent immix #1355

wants to merge 21 commits into from

Conversation

tianleq
Copy link
Collaborator

@tianleq tianleq commented Jul 28, 2025

Draft PR for concurrent non-moving immix

@@ -16,4 +16,4 @@ pub const BLOCK_ONLY: bool = false;

/// Mark lines when scanning objects.
/// Otherwise, do it at mark time.
pub const MARK_LINE_AT_SCAN_TIME: bool = true;
pub const MARK_LINE_AT_SCAN_TIME: bool = false;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concurrent marking does not have scan object packet any more, for simplicity, I just change this value

@@ -51,3 +53,9 @@ pub mod vm;
pub use crate::plan::{
AllocationSemantics, BarrierSelector, Mutator, MutatorContext, ObjectQueue, Plan,
};

static NUM_CONCURRENT_TRACING_PACKETS: AtomicUsize = AtomicUsize::new(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be moved into GlobalState. Whoever picks up this PR can do that refactoring. I think we should reduce our reliance on statics in general.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. concurrent_marking_packets_drained is only accessed by the plan, and ConcurrentTraceObjects (which has a reference to the plan). We can move this to the plan (or make it a method for trait ConcurrentPlan).

@@ -524,6 +528,7 @@ pub struct CommonSpace<VM: VMBinding> {
/// This field equals to needs_log_bit in the plan constraints.
// TODO: This should be a constant for performance.
pub needs_log_bit: bool,
pub needs_satb: bool,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this because needs_log_bit has a generational semantic associated with it.

pub fn prepare(&mut self, full_heap: bool) {
if full_heap {
debug_assert!(self.treadmill.is_from_space_empty());
// debug_assert!(self.treadmill.is_from_space_empty());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to comment out this assertion as a full stop-the-world GC might be triggered during concurrent marking, this assertion will be triggered in this case.

@@ -337,6 +349,9 @@ pub trait Plan: 'static + HasSpaces + Sync + Downcast {
space.verify_side_metadata_sanity(&mut side_metadata_sanity_checker);
})
}

fn gc_pause_start(&self, _scheduler: &GCWorkScheduler<Self::VM>) {}
fn gc_pause_end(&self) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a separate gc_pause_end? It seems to be called at the same location as end_of_gc. Was the method added before we had end_of_gc?

gc_pause_start is called before MMTk asks mutators to stop. That looks like a reasonable place for hooks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, the semantic of gc_pause_end and end_of_gc are different. GC is not done yet when the initial pause finishes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those two hooks are called at the same location. I think this is just our old assumption that GC == pause/STW.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are similar at the moment. Ideally, we should treat them differently

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed gc_pause_start to notify_mutator_paused. After mutators are stopped, we notify both the plan and the scheduler.

        mmtk.get_plan().notify_mutators_paused(&mmtk.scheduler);
        mmtk.scheduler.notify_mutators_paused(mmtk);

I merged gc_pause_end with end_of_gc. I will rename end_of_gc in separate pull requests to address #1362.

@@ -191,11 +199,24 @@ impl<VM: VMBinding> GCWork<VM> for ReleaseCollector {
///
/// TODO: Smaller work granularity
#[derive(Default)]
pub struct StopMutators<C: GCWorkContext>(PhantomData<C>);
pub struct StopMutators<C: GCWorkContext> {
pause: Pause,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pause is defined for concurrent plans, and is leaking outside the plan.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can change this pause to some boolean flag to indicate whether stack scanning is needed. As for the final pause, no stack scanning is needed. It only needs to bring all mutators to safepoint

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Pause abstraction is good though. You can imagine an incremental GC using the same plumbing. An incremental GC is not a concurrent GC. Making into a boolean might lose semantics.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Pause is relevant to non-concurrent GC, too. Semantically, it describes why MMTk is stopping (pausing) mutators. For a STW GC, it is only "Full", while concurrent GCs have more reasons to pause mutators, including InitialMark and FinalMark.

But I do think the term "Full" may need to be changed. See the other comment of mine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Pause is relevant to non-concurrent GC, too.

It depends on what you meant by Pause. The current Pause is defined as an enum of [Full, InitialMark, FinalMark]. This is clearly specific to the plan.

We can have a type to describe Pause, but its definition should be plan specific.

In this case, the pause is used by StopMutators to decide whether to schedule mutator root scanning or not. What we could have is: when the plan schedules StopMutators for the final mark, it sets a flag to StopMutators to ask the work packet to skip mutator root scanning.

@@ -308,7 +354,7 @@ impl<VM: VMBinding> LargeObjectSpace<VM> {
#[cfg(feature = "vo_bit")]
crate::util::metadata::vo_bit::unset_vo_bit(object);
// Clear log bits for dead objects to prevent a new nursery object having the unlog bit set
if self.common.needs_log_bit {
if self.common.needs_log_bit || self.common.needs_satb {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just check needs_log_bit. Tianle mentioned that there are assumptions in the code base that log bit is related with generational. We should fix that first.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this needs_satb is a quick and dirty hack

@@ -49,6 +49,8 @@ pub struct GlobalState {
pub(crate) malloc_bytes: AtomicUsize,
/// This stores the live bytes and the used bytes (by pages) for each space in last GC. This counter is only updated in the GC release phase.
pub(crate) live_bytes_in_last_gc: AtomicRefCell<HashMap<&'static str, LiveBytesStats>>,
pub(crate) concurrent_marking_active: AtomicBool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two fields are maintained here so all policies can access them.

  • concurrent_marking_active basically means there is concurrent marking going on, and policies should allocate objects as if they were marked. I think we should add a flag such as allocate_as_live per space, and set the flag for the spaces.
  • concurrent_marking_threshold is more like a global counter for allocated pages. We could maintain such a counter regardless the plan is concurrent or not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each policy already has an initialize_object_metadata function that takes in a parameter alloc: bool. The parameter is currently unused. But this is fine-grained, and usually spaces would want to bulk set this metadata instead of per-object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think alloc means something different. It was initially from here:
https://github.com/JikesRVM/JikesRVM/blob/5072f19761115d987b6ee162f49a03522d36c697/MMTk/src/org/mmtk/policy/ExplicitLargeObjectSpace.java#L108-L109

We also misused it here

mmtk-core/src/mmtk.rs

Lines 594 to 597 in d93262b

self.get_plan()
.base()
.vm_space
.initialize_object_metadata(object, false)

return true;
}
let threshold = self.get_total_pages() >> 1;
let concurrent_marking_threshold = self
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is comparing accumulated allocated pages with total pages. I guess that means we only want to do concurrent marking if we have allocated enough, otherwise we just do a normal full heap GC.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a quick and dirty hack. It is a naive heuristic, triggering concurrent marking when half of the heap has been allocated. I even did not differentiate between clean blocks and reusable blocks. always increase this counter by Block::PAGES regardless of the block types

@@ -45,6 +45,7 @@ pub struct PlanConstraints {
/// `MutatorConfig::prepare_func`). Those plans can set this to `false` so that the
/// `PrepareMutator` work packets will not be created at all.
pub needs_prepare_mutator: bool,
pub needs_satb: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we don't need this. We can always do barrier == BarrierSelector::SATB.

Copy link
Collaborator Author

@tianleq tianleq Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we fix the needs_log_bit, then we no longer need this

impl<VM: VMBinding> SlotIterator<VM> {
pub fn iterate(
o: ObjectReference,
// should_discover_references: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I copied this part from lxr, it has something to do with OpenJDK's reference processing and class unloading. At this stage, we do not need them but lxr needs those

@@ -265,6 +265,21 @@ impl<VM: VMBinding> ImmixAllocator<VM> {
// Update the hole-searching cursor to None.
Some(end_line)
};
// mark objects if concurrent marking is active
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as the code in LargeObjectSpace::initialize_object_metadata -- to mark freshly allocated objects are marked/live. We may want to introduce a flag for policies, such as allocate_as_live.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we may introduce another alloc semantic. I did introduce alloc public in Iso

@@ -162,4 +162,7 @@ pub trait Collection<VM: VMBinding> {
fn create_gc_trigger() -> Box<dyn GCTriggerPolicy<VM>> {
unimplemented!()
}

/// Inform the VM of concurrent marking status
fn set_concurrent_marking_state(_active: bool) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The binding uses this to know if SATB is active and decides whether to call 'soft reference load barrier'. This flag can be stored as a global flag for SATB barriers, and the binding can access the barrier. In that case, we may not need this function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that barrier is an opaque pointer from the binding's perspective. If we want to store this flag in the barrier and let binding access it, then we need to make sure all rust barrier struct has the same size, and expose that to the binding. I do not think it is worthwhile to do so.

Copy link
Member

@qinsoon qinsoon Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it an opaque pointer? It is a dyn Barrier fat pointer, and allows Downcast. We can always access the barrier from the binding-side Rust code and store an active flag in the same static for the C++ to use. E.g. at the end of a pause, check if SATB is active. If so, set the static variable and enable the load reference barrier.

But anyway, the proposal above is based on an assumption that we do not want the function fn set_concurrent_marking_state(_active: bool). If it is okay to keep the function, then we don't need the change.

@@ -51,3 +53,9 @@ pub mod vm;
pub use crate::plan::{
AllocationSemantics, BarrierSelector, Mutator, MutatorContext, ObjectQueue, Plan,
};

static NUM_CONCURRENT_TRACING_PACKETS: AtomicUsize = AtomicUsize::new(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. concurrent_marking_packets_drained is only accessed by the plan, and ConcurrentTraceObjects (which has a reference to the plan). We can move this to the plan (or make it a method for trait ConcurrentPlan).

@wks
Copy link
Collaborator

wks commented Jul 29, 2025

What will happen if a mutator wants to fork() while concurrent marking is in progress? There are subtle interactions between preparing for forking (MMTK::prepare_to_fork), the concurrently running marking task (including the dirty mark bits, etc.), and the WorkerGoal which is currently only designed for triggering GC and triggering "prepare-to-fork".

We may temporarily tell the VM binding that MMTk doesn't currently support forking when using concurrent GC. We may fix it later. One simple solution is postponing the forking until the current GC finishes.

The status quo is that only CRuby and Android need forking. But CRuby will not support concurrent GC in a short term.

@k-sareen
Copy link
Collaborator

What will happen if a mutator wants to fork() while concurrent marking is in progress?

You can't let this happen. Either the binding needs to ensure that the mutator waits while concurrent marking is active, or you don't let a concurrent GC happen before forking (ART's method of dealing with this).

@wks
Copy link
Collaborator

wks commented Jul 29, 2025

You can't let this happen. Either the binding needs to ensure that the mutator waits while concurrent marking is active, or you don't let a concurrent GC happen before forking (ART's method of dealing with this).

Agreed. Fortunately, the current API doc for prepare_to_fork says:

    /// This function sends an asynchronous message to GC threads and returns immediately, but it
    /// is only safe for the VM to call `fork()` after the underlying **native threads** of the GC
    /// threads have exited.  After calling this function, the VM should wait for their underlying
    /// native threads to exit in VM-specific manner before calling `fork()`.

So a well-behaving VM binding shall wait for all the GC worker threads (which are created by the binding via VMCollection::spawn_gc_thread anyway) to exit before calling fork(). That's VM-specific, but not hard. Extending this API to support concurrent GC should not require the VM binding to rewrite this part.

@k-sareen
Copy link
Collaborator

Mentioning it before I forget: we need to change the name for the GC counter to say it's pauses or have a separate counter that counts pauses.

Comment on lines 33 to 40
if mmtk.is_user_triggered_collection() || is_emergency {
// user triggered collection is always stop-the-world
mmtk.get_plan().schedule_collection(worker.scheduler());
} else {
// Let the plan to schedule collection work
mmtk.get_plan()
.schedule_concurrent_collection(worker.scheduler());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this decision should be made by the plan. We are hard-coding the policy that "if it is user-triggered collection or emergency collection, then schedule a full-heap GC, otherwise schedule a concurrent GC". But this only makes sense for the ConcurrentImmix plan. For existing plans, there is no concurrent collection. And other concurrent plans to be implemented in the future may do concurrent collections in user-triggered and emergency collections, too.

I suggest we

  1. Remove Plan::schedule_concurrent_collection
  2. Push this if-else down to ConcurrentImmix::schedule_collection. All the needed information (whether it is emergency GC or user-triggered GC) can be looked up from mmtk.state.

If Plan::schedule_collection needs a &mut GCWorker argument (which can reach &MMTK and then reach the state), we should add it. Most GC-related methods of trait Plan (including prepare, release, etc.) have either &mut GCWorker or VMWorkerThread as parameter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this part should be in the concurrent plan trait similar to what we have for the generational plan

@@ -191,11 +199,24 @@ impl<VM: VMBinding> GCWork<VM> for ReleaseCollector {
///
/// TODO: Smaller work granularity
#[derive(Default)]
pub struct StopMutators<C: GCWorkContext>(PhantomData<C>);
pub struct StopMutators<C: GCWorkContext> {
pause: Pause,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Pause is relevant to non-concurrent GC, too. Semantically, it describes why MMTk is stopping (pausing) mutators. For a STW GC, it is only "Full", while concurrent GCs have more reasons to pause mutators, including InitialMark and FinalMark.

But I do think the term "Full" may need to be changed. See the other comment of mine.

#[repr(u8)]
#[derive(Debug, PartialEq, Eq, Copy, Clone, NoUninit)]
pub enum Pause {
Full = 1,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a better name than Full? I think here it means doing a full GC in one STW (as opposed to doing part of a GC, such as initial mark, final mark, or an incremental part of a GC). But it may be confused with full-heap GC (as opposed to nursery GC).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I have the same concern but I could not find a better term so I simply leave full here as a placeholder


fn load_reference(&mut self, _o: ObjectReference) {}

fn object_reference_clone_pre(&mut self, _obj: ObjectReference) {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method only has an blank implementation in BarrierSemantics::object_reference_clone_pre, and SATBBarrier::object_reference_clone_pre calls that. What is this method intended for?

Comment on lines 203 to 224
let worker = self.worker();
let mmtk = self.mmtk();
let w = ConcurrentTraceObjects::new(root_objects.clone(), mmtk);

match pause {
Pause::InitialMark => worker.scheduler().postpone(w),
_ => unreachable!(),
}

root_objects.clear();
}
}
}
if !root_objects.is_empty() {
let worker = self.worker();
let w = ConcurrentTraceObjects::new(root_objects.clone(), self.mmtk());

match pause {
Pause::InitialMark => worker.scheduler().postpone(w),
_ => unreachable!(),
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may extract a lambda closure to reduce code duplication.

Comment on lines +155 to +175
struct SlotIteratorImpl<VM: VMBinding, F: FnMut(VM::VMSlot)> {
f: F,
// should_discover_references: bool,
// should_claim_clds: bool,
// should_follow_clds: bool,
_p: PhantomData<VM>,
}

impl<VM: VMBinding, F: FnMut(VM::VMSlot)> SlotVisitor<VM::VMSlot> for SlotIteratorImpl<VM, F> {
fn visit_slot(&mut self, slot: VM::VMSlot) {
(self.f)(slot);
}
}

pub struct SlotIterator<VM: VMBinding> {
_p: PhantomData<VM>,
}

impl<VM: VMBinding> SlotIterator<VM> {
pub fn iterate(
o: ObjectReference,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may not really need SlotIterator or SlotIteratorImpl. Note that in mmtk-core/src/vm/scanning.rs we already have SlotVisitor and impl<SL: Slot, F: FnMut(SL)> SlotVisitor<SL> for F {. That's enough for us to pass a lambda (&mut |slot| { ... }) to Scanning::scan_object and use it as the slot_visitor argument.

I think the lxr branch introduced this struct to do some kind of filtering, and multiplexed multiple methods of Scanning, including the newly introduced Scanning::scan_object_with_klass.

However it is still nice to add a method ObjectReference::iterate_fields. It's convenient.

Comment on lines +704 to +706
pub fn iterate_fields<VM: VMBinding, F: FnMut(VM::VMSlot)>(self, f: F) {
SlotIterator::<VM>::iterate(self, f)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since FnMut(VM::VMSlot) already implements trait SlotVisitor, we can directly pass it to scan_object.

Suggested change
pub fn iterate_fields<VM: VMBinding, F: FnMut(VM::VMSlot)>(self, f: F) {
SlotIterator::<VM>::iterate(self, f)
}
pub fn iterate_fields<VM: VMBinding, F: FnMut(VM::VMSlot)>(self, mut f: F) {
<VM::VMScanning as Scanning<VM>>::scan_object(
VMWorkerThread(VMThread::UNINITIALIZED),
self,
&mut f,
)
}

However, this is still imperfect.

  1. Not all VMs support scan_object. CRuby is one notable exception. If we only intend to visit the value of all (non-null) fields without updating (forwarding) any of them, I suggest we introduce another method in the Scanning trait, such as Scanning::enumerate_children. Otherwise, consider invoking Scanning::support_slot_enqueuing and scan_object_and_trace_edges, too.
  2. This method doesn't have tls. For CRuby, I am implementing object scanning by setting a thread-local variable (which is accessed via the tls) as a call-back, calling a C function to visit fields, and the C function will call the call-back to let the Rust code visit the field value. But because this method is called from the barrier, it may not always provide a VMWorkerThread. I suggest the newly introduced method Scanning::enumerate_children to take a VMThread instead of VMWorkerThread as argument.

In summary, I am also suggesting we introduce a method

pub trait Scanning<VM: VMBinding> {
    /// Visit the children of an object.
    ///
    /// This method may be called during mutator time, and is required by concurrent GC. Currently,
    /// we don't support concurrent copying GC, so this method can assume no objects are moved by GC
    /// while this method is running.
    fn enumerate_children(
        tls: VMThread,
        object: ObjectReference,
        child_visitor: &mut impl FnMut(ObjectReference),
    );
}

And if we make that change, then ObjectReference::iterate_fields will call Scanning::enumerate_children instead.

Copy link
Collaborator

@wks wks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current "WorkerGoal" mechanism should be able to handle the case where mutators trigger another GC between InitialMark and FinalMark. We can remove WorkPacketStage::Initial and WorkPacketStage::ConcurrentSentinel, and use GCWorkScheduler::on_last_parked to transition the GC state from InitialMark to the concurrent marking to FinalMark and finally finish the GC. See inline comments for more details.

notify = true;
}
if notify {
self.wakeup_all_concurrent_workers();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wakes up other workers. However, the current GC worker considers itself the "last parked worker". So after the on_gc_finished function returns, it will clear the current "goal" (goals.on_current_goal_completed()) and try to handle another "goal" (i.e. a GC request) in self.respond_to_requests(worker, goals). Unless the mutator triggers another GC immediatelyu, the current GC worker thread will find that there is no requests, and park itself (return LastParkedResult::ParkSelf;). If we set MMTK_THREADS=1, it will never wake up again.

So we need to make some changes to GCWorkScheduler::on_last_parked so that if this is the end of the InitialMark STW, the on_last_parked method should return LastParkedResult::WakeAll so that all GC workers wake up again, including the current GC worker thread itself.

And we shouldn't clear the current "worker goal" if it is the end of InitialMark. The GC workers should still be working towards the "GC" goal. But it needs to be a bit stateful, knowing whether it is (1) during a "full GC", (2) during initial mark, (3) during final mark, or (4) between initial mark and final mark (we may just call it "concurrent tracing". As long as the GC workers are still working on a goal (regardless of concrete state), it will not accept other GC requests or requests for forking, which automatically solves the forking problem.

And if concurrent tracing finishes again (all the postponed work packets and their children are drained), the last parked worker will reach on_last_parked again. This time, it can immediately start scheduling the FinalMark stage (or should we call it a "super stage" to distinguish it from work bucket stage, or just call it an STW). When FinalMark finishes, the GC really finishes. GC workers can now accept requests for another GC or requests for forking.

@@ -250,3 +260,82 @@ impl<S: BarrierSemantics> Barrier<S::VM> for ObjectBarrier<S> {
}
}
}

pub struct SATBBarrier<S: BarrierSemantics> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a pre-write ObjectBarrier. The current ObjectBarrier only implements the post-write functions, and this implementation implements the pre-write functions.

@@ -29,6 +31,12 @@ pub struct GCWorkScheduler<VM: VMBinding> {
pub(crate) worker_monitor: Arc<WorkerMonitor>,
/// How to assign the affinity of each GC thread. Specified by the user.
affinity: AffinityKind,

pub(super) postponed_concurrent_work:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The work that gets 'postponed' is actual concurrent work packets. The implementation use these queues to temporarily store the packets, to avoid directly pushing them to the Unconstrained bucket and getting those work done in the current pause. Only after the pause is done, the scheduler transfers the work from the queue back to Unconstrained.

Why not use a separate bucket for concurrent work? The bucket is not opened during pause, and gets opened when the pause is done.

@@ -59,6 +60,8 @@ pub struct WorkBucket<VM: VMBinding> {
/// recursively, such as ephemerons and Java-style SoftReference and finalizers. Sentinels
/// can be used repeatedly to discover and process more such objects.
sentinel: Mutex<Option<Box<dyn GCWork<VM>>>>,
in_concurrent: AtomicBool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field seems never used.

@wks
Copy link
Collaborator

wks commented Jul 30, 2025

It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects (actually the mark bits of all words in a line are set). This is OK. But what if we need VO bits? The default strategy for maintaining VO bits is copying from the mark bits. Since those objects are not marked all words of the line are marked, their VO bits will be zero the VO bits will also be all ones if we blindly copy VO bits from the mark bits. But because post_alloc already sets the VO bits, we just need to keep their VO bits as is.

To properly support VO bits, we need to do two things:

  1. Reset the BumpPointer of ImmixAllocator so that after the InitialMark, all mutators start allocation from empty lines (i.e. make sure a line never contains both objects allocated before InitialMark and objects allocated after InitialMark).
  2. In Block::sweep,
    • if a line only contains objects allocated after InitialMark, we keep the VO bits as is;
    • if it only contains live objects before InitialMark, we copy the VO bits over from mark bits;
    • otherwise the line must be empty. We clear its VO bits.

There are multiple ways to know if a line only contains new objects or old objects.

  • If a line is marked, but the mark bits of a line are all zero, it must be a line that only contains objects allocated after InitialMark. This doesn't need extra metadata. (Update: This is not true. If a mutator adds an edge from a live old object to the new object, the new object will still be reachable.)
  • If a line is marked, and the mark bits are a superset of the VO bits, it must be a line that only contains objects allocated after InitialMark. If a block only contains objects before the InitialMark, the mark bits must be a subset of the VO bits. Only all-ones is a superset of all possible VO-bit patterns. If the mark bits and VO bits are identical (every word is an individual object, and they are either all live or all new), it doesn't matter if we copy the mark bits or retain the VO bits. Currently each Line is 256 bytes, corresponding to 32 bits of mark bits or VO bits. It should be easy and efficient to do 32-bit bit operation.
  • We introduce another one-bit-per-line metadata to record if it contains objects allocated after InitialMark.

@tianleq
Copy link
Collaborator Author

tianleq commented Jul 30, 2025

"It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects." This is not true. Both lines and every word within that line are marked. So if one blindly copy mark bits to vo bits, then vo bit will be 1 even if that address is not an object

@qinsoon
Copy link
Member

qinsoon commented Jul 30, 2025

It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects. This is OK. But what if we need VO bits?

I think we can just mark lines, and also mark each individual object. The bulk set only works for side mark bits anyway. Let's not get things entangled and complicated.

@tianleq
Copy link
Collaborator Author

tianleq commented Jul 30, 2025

It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects. This is OK. But what if we need VO bits?

I think we can just mark lines, and also mark each individual object. The bulk set only works for side mark bits anyway. Let's not get things entangled and complicated.

The problem is that we do not want to do the check in the fast path. If we want to mark each individual object, then in the allocation fast-path, we need to check if concurrent marking is active and then set the mark bit. While the current bulk setting way only set the mark bit in the slow-path

@wks
Copy link
Collaborator

wks commented Jul 30, 2025

"It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects." This is not true. Both lines and every word within that line are marked. So if one blindly copy mark bits to vo bits, then vo bit will be 1 even if that address is not an object

I think we can just mark lines, and also mark each individual object. The bulk set only works for side mark bits anyway. Let's not get things entangled and complicated.

Actually I think for correctness, we must not set mark bits of individual objects when allocating. Suppose there are objects A and B when GC is triggered, and root -> ... -> A -> B. During concurrent marking, a mutator allocated C, and changed the object graph to root -> ... -> A -> C -> B. Then a GC worker visits A for the first time. If C already has the mark bit, GC will not enqueue C, and will not enqueue B, either. It will consider B as dead.

I forgot the SATB barrier. It will remember B when we remove the edge A -> B.

@tianleq
Copy link
Collaborator Author

tianleq commented Jul 30, 2025

"It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects." This is not true. Both lines and every word within that line are marked. So if one blindly copy mark bits to vo bits, then vo bit will be 1 even if that address is not an object

I think we can just mark lines, and also mark each individual object. The bulk set only works for side mark bits anyway. Let's not get things entangled and complicated.

Actually I think for correctness, we must not set mark bits of individual objects when allocating. Suppose there are objects A and B when GC is triggered, and root -> ... -> A -> B. During concurrent marking, a mutator allocated C, and changed the object graph to root -> ... -> A -> C -> B. Then a GC worker visits A for the first time. If C already has the mark bit, GC will not enqueue C, and will not enqueue B, either. It will consider B as dead.

In your case, B will be captured by the SATB barrier, it will not be considered as dead. There is no need to scan those newly allocated objects because any children of those newly allocated objects must have been alive in the snapshot and thus is guaranteed to be traced

@qinsoon
Copy link
Member

qinsoon commented Jul 30, 2025

It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects. This is OK. But what if we need VO bits?

I think we can just mark lines, and also mark each individual object. The bulk set only works for side mark bits anyway. Let's not get things entangled and complicated.

The problem is that we do not want to do the check in the fast path. If we want to mark each individual object, then in the allocation fast-path, we need to check if concurrent marking is active and then set the mark bit. While the current bulk setting way only set the mark bit in the slow-path

The mark bit could be in the header, and we have to set it per object if it is in the header. We can differentiate header mark bit and side mark bit, and deal with each differently.

But bulk setting mark bits is still a bit hacky -- and this is why we would have issues with VO bits. VO bits copies from mark bits, assuming mark bits is only set for each individual object.

@k-sareen
Copy link
Collaborator

k-sareen commented Aug 1, 2025

The solution for the VO-bit is to not use the CopyFromMarkBits for the ConcurrentImmix plan. You would always need to do the ClearAndReconstruct strategy. @tianleq is right in that this is too complex to do in the fast-path/JIT compiler. Hence why it's better to do it in bulk in the slow-path when grabbing new pages for a space.

@qinsoon
Copy link
Member

qinsoon commented Aug 1, 2025

We would need every policy to check if they should allocate objects as live during initializing object metadata. That would be a boolean flag that can be set by concurrent Immix. This also means those policies will always have this condition, no matter they are used in concurrent Immix or not.
One way that I can think up to avoid this is to add a const (generic const) to each space type, something like MAY_ALLOCATE_AS_LIVE. Concurrent immix initializes spaces with MAY_ALLOCATE_AS_LIVE = true, and other plans (non concurrent plans) initialize it as MAY_ALLOCATE_AS_LIVE=false. This can turn a conditional check to a constant. But this sounds annoying, and complicates the code. So I don't quite like this approach.
An alternative is that we keep the condition check in MMTk core, and argue that a binding should generate a more efficient post alloc function that can omit this condition check for non-concurrent plans, by their JIT. So it is okay to have such a condition check in MMTk core.
@wks What do you think?

I think doing a condition check should be good enough. If the VM binding really calls into MMTk core to do alloc or post_alloc, the real bottleneck will be the virtual dispatch over space. If the binding cares about performance, it should always implement fast paths on the binding side, even if the VM is mostly interpreted like CRuby.

The problem is still how can fast-path do this. Different policy may have different requirement. For example, immix space not only needs to mark the object, it also need to mark lines.

The JIT generated fastpath mimics what MMTk core does. If a space has this check in the initialize_object_metadata which appears in the post alloc, the JIT should do what MMTk core does. If a space doesn't have anything in initialize_object_metadata, then JIT doesn't need to do anything.

@qinsoon
Copy link
Member

qinsoon commented Aug 1, 2025

@tianleq is right in that this is too complex to do in the fast-path/JIT compiler.

Why is it complex? It is trivial.

if current plan may allocate object as live:
  emit post_alloc with a check
else
  emit normal post_alloc

@tianleq
Copy link
Collaborator Author

tianleq commented Aug 1, 2025

@tianleq is right in that this is too complex to do in the fast-path/JIT compiler.

Why is it complex? It is trivial.

if current plan may allocate object as live:
  emit post_alloc with a check
else
  emit normal post_alloc

Because the check itself can be complicated. In concurrent marking immix, if the object is in immix space, lines also need to be marked. Writing this whole thing in fast-path is non-trivial and error-prone

@k-sareen
Copy link
Collaborator

k-sareen commented Aug 1, 2025

If your suggestion is that we should do a function call in the fast-path, then that's not a fast-path anymore. If your suggestion is to write the JIT-compiler instructions in the fast-path which set the metadata like line mark state, mark-bit, etc. then this is entirely policy dependent. For example, you would need to have separate JIT-generated instructions for an ImmixSpace object and a LOS object. Not to mention that the metadata address for local metadata is hard to calculate.

The whole idea of the fast-path is that we don't have to do extra checks per-object allocation and the object already has the correct state set when allocated.

@wks
Copy link
Collaborator

wks commented Aug 1, 2025

The solution for the VO-bit is to not use the CopyFromMarkBits for the ConcurrentImmix plan. You would always need to do the ClearAndReconstruct strategy. @tianleq is right in that this is too complex to do in the fast-path/JIT compiler. Hence why it's better to do it in bulk in the slow-path when grabbing new pages for a space.

I already have a solution for the CopyFromMarkBits strategy (I updated #1355 (comment)). In simple terms, we leave the VO bits as is if the mark bits of a line are a superset of the VO bits of the same line; otherwise copy the mark bits.

@wks
Copy link
Collaborator

wks commented Aug 1, 2025

I think what Tianle is currently doing, i.e. setting both side (per-object) mark bits and line mark bits to 1 in bulk when finding the next contiguous lines, is good enough.

Currently the ImmixSpace does not support in-header mark bits. If we let ImmixSpace support header mark bits, we can eagerly set the line mark bits, but set per-object mark bits in post_alloc (in Space::initialize_object_metadata or just Mutator::post_alloc if all spaces use the same mark bit metadata). Then we no longer need to mark the line in fast paths.

@qinsoon
Copy link
Member

qinsoon commented Aug 1, 2025

@tianleq

Because the check itself can be complicated.

I meant a simple check like what you do for SATB barrier for reference loading: https://github.com/tianleq/mmtk-openjdk/blob/5aa00cd32344bda65d35bba42f9df6b2757991ae/openjdk/barriers/mmtkSATBBarrier.cpp#L57
How is that complicated?

In concurrent marking immix, if the object is in immix space, lines also need to be marked. Writing this whole thing in fast-path is non-trivial and error-prone

Marking lines is not in post_alloc, and is not implemented as a fastpath by the binding. What made you think I implied implementing line marking in fast path?

@qinsoon
Copy link
Member

qinsoon commented Aug 1, 2025

@k-sareen

If your suggestion is that we should do a function call in the fast-path, then that's not a fast-path anymore.

I didn't suggest that. The pseudo code clearly is the compiler's code, not the generated code. At JIT time, JIT checks whether a plan may allocate object as marked, if so, JIT needs to emit code that checks if currently we should allocate objects as live or not.

If your suggestion is to write the JIT-compiler instructions in the fast-path which set the metadata like line mark state, mark-bit, etc. then this is entirely policy dependent.

First, I didn't suggest that.
Second, the allocation fastpath (including post alloc) is entirely policy dependent. It is and it always has been.

For example, you would need to have separate JIT-generated instructions for an ImmixSpace object and a LOS object. Not to mention that the metadata address for local metadata is hard to calculate.

Yes. It is not something new.
A JIT generates code as the fastpath to mimic what MMTk core does. This is the design, this is always the design.

The whole idea of the fast-path is that we don't have to do extra checks per-object allocation and the object already has the correct state set when allocated.

No. If a binding uses header mark bits, and needs to set the mark bits when SATB is active and not set the mark bits when SATB is not active, there is no other option. It is intrinsic to the algorithm.

@k-sareen
Copy link
Collaborator

k-sareen commented Aug 1, 2025

Yes. It is not something new. A JIT generates code as the fastpath to mimic what MMTk core does. This is the design, this is always the design.

I'm saying that this is too complicated to implement in the JIT. The control flow is non-trivial.

Marking lines is not in post_alloc, and is not implemented as a fastpath by the binding. What made you think I implied implementing line marking in fast path?

But lines need to be marked as no one will mark the line otherwise, i.e. we would have to change the post_alloc function to mark lines based on is_concurrent_marking_active.

@qinsoon
Copy link
Member

qinsoon commented Aug 1, 2025

Marking lines is not in post_alloc, and is not implemented as a fastpath by the binding. What made you think I implied implementing line marking in fast path?

But lines need to be marked as no one will mark the line otherwise, i.e. we would have to change the post_alloc function to mark lines based on is_concurrent_marking_active.

Why? We mark lines in the slowpath, we don't need to do that in the fastpath.

We avoid using the term "reference" to refer to weak reference objects
as in JVM.  We explicitly mention "weak reference" instead.

Also added comments to clarify the purpose of that barrier function.
@k-sareen
Copy link
Collaborator

k-sareen commented Aug 1, 2025

Sure. But then why not do this for the mark-bit as well? It avoids the overhead of figuring out the metadata address etc. etc. + amortizes the cost of setting the metadata. Everything we have in OpenJDK so far is for global metadata (unlog bit, alloc bit, etc.). No post_alloc has been inlined, and neither has any use of local metadata. I think this discussion is better continued in the next meeting.

Comment on lines +62 to +63
// If we disable moving in Immix, this is a non-moving plan.
moves_objects: false,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this plan falls back to STW GC, there is in principle nothing preventing us from doing defragmentation during STW GC. This means this plan can still move objects.

If it is easy to enable defragmentation, we should do it so that we won't blindly need 4x-7x min heap size. But if it is difficult or buggy, we can leave this version as fully non-moving, and enable moving in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be relative easy as we just need to make sure concurrent GC not doing defrag. But I never tried this so it can have bugs

Copy link
Collaborator

@wks wks Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried locally and set PlanConstraints::move_objects to true, and it works. But when I run a benchmark like lusearch, it only triggers one single STW GC in the beginning of the execution and it is a defrag GC. Then it only triggers concurrent GC and never executes STW GC again until the benchmark finishes. If I reduce the heap size, it will thrash, triggering concurrent GC again and again and the program will never make progress. I guess there is something wrong with the GC trigger so that it made the GC thrash.

I am not reducing it too small. For example, I ran lusearch at 80MB, and ConcurrentImmix thrashes when the progress is about 60%. In theory, it should be able to finish the benchmark by running only STW GC using plain Immix, in which case 80MB is too generous.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lusearch is a bad example since it's highly generational so it's quite possible that entire blocks get freed up even in a concurrent non-moving GC. Maybe try some other benchmark?

Copy link
Collaborator Author

@tianleq tianleq Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The concurrent GC triggering is naive, it blindly counts how many pages have been allocated and once 50% of the pages get allocated, it will start doing a concurrent GC. Reusable blocks are treated the same as clean blocks, so the counter is imprecise

@qinsoon
Copy link
Member

qinsoon commented Aug 1, 2025

Sure. But then why not do this for the mark-bit as well? It avoids the overhead of figuring out the metadata address etc. etc. + amortizes the cost of setting the metadata. Everything we have in OpenJDK so far is for global metadata (unlog bit, alloc bit, etc.). No post_alloc has been inlined, and neither has any use of local metadata. I think this discussion is better continued in the next meeting.

See #1355 (comment) and the few replies after it.

Clippy on the stable version of Rust seems to erroniously consider
`std::mem::swap(&mut a, &mut lg)` as swapping with a temporary, where
`lg` is a lock guard.  We may fix the warning by writing `&mut *lg`, but
the current use cases can all be expressed more concisely using
`std::mem::replace` and `std::mem::take`.

See: https://rust-lang.github.io/rust-clippy/master/index.html#swap_with_temporary
Copy link
Collaborator

@wks wks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the RwLock on BucketQueue::queue is a hack to allow replacing the queue. But adding a lock would hurt the performance of other plans. Since we are sure the Unconstrainted bucket must be empty when we schedule concurrent work, we can simply add work packets in a loop to achieve that, and that only requires a shared reference (&queue).

I have made a revision for that: https://github.com/wks/mmtk-core/ commit/9ac0d8234905deea5fd22d731f49ba37f7bded48

But I didn't push it into this PR. There is an alternative solution which @qinsoon mentioned, i.e. making a dedicated bucket for concurrent work packets. I'll leave it to @qinsoon to decide whether to cherry-pick my change, or add a new dedicated bucket.

@@ -7,34 +7,35 @@ use std::sync::atomic::{AtomicBool, Ordering};
use std::sync::{Arc, Mutex};

struct BucketQueue<VM: VMBinding> {
queue: Injector<Box<dyn GCWork<VM>>>,
// queue: Injector<Box<dyn GCWork<VM>>>,
queue: std::sync::RwLock<Injector<Box<dyn GCWork<VM>>>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This RwLock may affect the performance of other plans.

I think the sole purpose of adding RwLock is making it able to swap the queue of the Unconstrainted bucket with another Injector instance (the postponed queues). If we are sure the Unconstrainted bucket must be empty, we can simply call WorkBucket::bulk_add but without notifying GC workers.

@@ -29,6 +31,12 @@ pub struct GCWorkScheduler<VM: VMBinding> {
pub(crate) worker_monitor: Arc<WorkerMonitor>,
/// How to assign the affinity of each GC thread. Specified by the user.
affinity: AffinityKind,

pub(super) postponed_concurrent_work:
spin::RwLock<crossbeam::deque::Injector<Box<dyn GCWork<VM>>>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is using spin::RwLock, but I don't see a particular reason. We may use plain std::sync::RwLock instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spin::RwLock might have better performance, specially on non-x86 architectures. It's best if we could benchmark it before we make a decision.

@k-sareen
Copy link
Collaborator

k-sareen commented Aug 5, 2025

Like we discussed in Monday's meeting, the VO-bit strategy actually depends on the plan as well, not just the VM. I think this should not be a const function since it could depend on the selected plan:

/// Select a strategy for the VM. It is a `const` function so it always returns the same strategy
/// for a given VM.
const fn strategy<VM: VMBinding>() -> VOBitUpdateStrategy {
// CopyFromMarkBits performs better than ClearAndReconstruct, and it also allows using
// VO bits during tracing. We use it as the default strategy.
// TODO: Revisit this choice in the future if non-trivial changes are made and the performance
// characterestics may change for the strategies.
match VM::VMObjectModel::LOCAL_MARK_BIT_SPEC.as_spec() {
// Note that currently ImmixSpace doesn't support in-header mark bits,
// but the DummyVM for testing declares mark bits to be "in header" as a place holder
// because it never runs GC.
MetadataSpec::InHeader(_) => VOBitUpdateStrategy::ClearAndReconstruct,
MetadataSpec::OnSide(_) => VOBitUpdateStrategy::CopyFromMarkBits,
}
}

@@ -59,6 +60,8 @@ pub struct WorkBucket<VM: VMBinding> {
/// recursively, such as ephemerons and Java-style SoftReference and finalizers. Sentinels
/// can be used repeatedly to discover and process more such objects.
sentinel: Mutex<Option<Box<dyn GCWork<VM>>>>,
in_concurrent: AtomicBool,
disable: AtomicBool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also never used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants