Concurrent immix #1355

tianleq · 2025-07-28T01:54:06Z

Draft PR for concurrent non-moving immix

tianleq · 2025-07-28T01:59:31Z

src/policy/immix/mod.rs

@@ -16,4 +16,4 @@ pub const BLOCK_ONLY: bool = false;

 /// Mark lines when scanning objects.
 /// Otherwise, do it at mark time.
-pub const MARK_LINE_AT_SCAN_TIME: bool = true;
+pub const MARK_LINE_AT_SCAN_TIME: bool = false;


Concurrent marking does not have scan object packet any more, for simplicity, I just change this value

…-immix

src/plan/barriers.rs

k-sareen · 2025-07-28T05:47:06Z

src/lib.rs

@@ -51,3 +53,9 @@ pub mod vm;
 pub use crate::plan::{
    AllocationSemantics, BarrierSelector, Mutator, MutatorContext, ObjectQueue, Plan,
 };
+
+static NUM_CONCURRENT_TRACING_PACKETS: AtomicUsize = AtomicUsize::new(0);


I think this should be moved into GlobalState. Whoever picks up this PR can do that refactoring. I think we should reduce our reliance on statics in general.

Right. concurrent_marking_packets_drained is only accessed by the plan, and ConcurrentTraceObjects (which has a reference to the plan). We can move this to the plan (or make it a method for trait ConcurrentPlan).

tianleq · 2025-07-28T06:19:33Z

src/policy/space.rs

@@ -524,6 +528,7 @@ pub struct CommonSpace<VM: VMBinding> {
    /// This field equals to needs_log_bit in the plan constraints.
    // TODO: This should be a constant for performance.
    pub needs_log_bit: bool,
+    pub needs_satb: bool,


I added this because needs_log_bit has a generational semantic associated with it.

tianleq · 2025-07-28T06:37:18Z

src/policy/largeobjectspace.rs

    pub fn prepare(&mut self, full_heap: bool) {
        if full_heap {
-            debug_assert!(self.treadmill.is_from_space_empty());
+            // debug_assert!(self.treadmill.is_from_space_empty());


I have to comment out this assertion as a full stop-the-world GC might be triggered during concurrent marking, this assertion will be triggered in this case.

qinsoon · 2025-07-28T04:57:40Z

src/plan/global.rs

@@ -337,6 +349,9 @@ pub trait Plan: 'static + HasSpaces + Sync + Downcast {
            space.verify_side_metadata_sanity(&mut side_metadata_sanity_checker);
        })
    }
+
+    fn gc_pause_start(&self, _scheduler: &GCWorkScheduler<Self::VM>) {}
+    fn gc_pause_end(&self) {}


Why do we need a separate gc_pause_end? It seems to be called at the same location as end_of_gc. Was the method added before we had end_of_gc?

gc_pause_start is called before MMTk asks mutators to stop. That looks like a reasonable place for hooks.

Strictly speaking, the semantic of gc_pause_end and end_of_gc are different. GC is not done yet when the initial pause finishes.

Those two hooks are called at the same location. I think this is just our old assumption that GC == pause/STW.

Yes, they are similar at the moment. Ideally, we should treat them differently

I renamed gc_pause_start to notify_mutator_paused. After mutators are stopped, we notify both the plan and the scheduler.

mmtk.get_plan().notify_mutators_paused(&mmtk.scheduler); mmtk.scheduler.notify_mutators_paused(mmtk);

I merged gc_pause_end with end_of_gc. I will rename end_of_gc in separate pull requests to address #1362.

qinsoon · 2025-07-28T04:58:49Z

src/scheduler/gc_work.rs

@@ -191,11 +199,24 @@ impl<VM: VMBinding> GCWork<VM> for ReleaseCollector {
 ///
 /// TODO: Smaller work granularity
 #[derive(Default)]
-pub struct StopMutators<C: GCWorkContext>(PhantomData<C>);
+pub struct StopMutators<C: GCWorkContext> {
+    pause: Pause,


Pause is defined for concurrent plans, and is leaking outside the plan.

Or we can change this pause to some boolean flag to indicate whether stack scanning is needed. As for the final pause, no stack scanning is needed. It only needs to bring all mutators to safepoint

The Pause abstraction is good though. You can imagine an incremental GC using the same plumbing. An incremental GC is not a concurrent GC. Making into a boolean might lose semantics.

I think the Pause is relevant to non-concurrent GC, too. Semantically, it describes why MMTk is stopping (pausing) mutators. For a STW GC, it is only "Full", while concurrent GCs have more reasons to pause mutators, including InitialMark and FinalMark.

But I do think the term "Full" may need to be changed. See the other comment of mine.

I think the Pause is relevant to non-concurrent GC, too.

It depends on what you meant by Pause. The current Pause is defined as an enum of [Full, InitialMark, FinalMark]. This is clearly specific to the plan.

We can have a type to describe Pause, but its definition should be plan specific.

In this case, the pause is used by StopMutators to decide whether to schedule mutator root scanning or not. What we could have is: when the plan schedules StopMutators for the final mark, it sets a flag to StopMutators to ask the work packet to skip mutator root scanning.

qinsoon · 2025-07-28T05:03:35Z

src/policy/largeobjectspace.rs

@@ -308,7 +354,7 @@ impl<VM: VMBinding> LargeObjectSpace<VM> {
            #[cfg(feature = "vo_bit")]
            crate::util::metadata::vo_bit::unset_vo_bit(object);
            // Clear log bits for dead objects to prevent a new nursery object having the unlog bit set
-            if self.common.needs_log_bit {
+            if self.common.needs_log_bit || self.common.needs_satb {


We should just check needs_log_bit. Tianle mentioned that there are assumptions in the code base that log bit is related with generational. We should fix that first.

Yes, this needs_satb is a quick and dirty hack

qinsoon · 2025-07-28T06:27:06Z

src/global_state.rs

@@ -49,6 +49,8 @@ pub struct GlobalState {
    pub(crate) malloc_bytes: AtomicUsize,
    /// This stores the live bytes and the used bytes (by pages) for each space in last GC. This counter is only updated in the GC release phase.
    pub(crate) live_bytes_in_last_gc: AtomicRefCell<HashMap<&'static str, LiveBytesStats>>,
+    pub(crate) concurrent_marking_active: AtomicBool,


These two fields are maintained here so all policies can access them.

concurrent_marking_active basically means there is concurrent marking going on, and policies should allocate objects as if they were marked. I think we should add a flag such as allocate_as_live per space, and set the flag for the spaces.

concurrent_marking_threshold is more like a global counter for allocated pages. We could maintain such a counter regardless the plan is concurrent or not.

Each policy already has an initialize_object_metadata function that takes in a parameter alloc: bool. The parameter is currently unused. But this is fine-grained, and usually spaces would want to bulk set this metadata instead of per-object.

I think alloc means something different. It was initially from here:
https://github.com/JikesRVM/JikesRVM/blob/5072f19761115d987b6ee162f49a03522d36c697/MMTk/src/org/mmtk/policy/ExplicitLargeObjectSpace.java#L108-L109

We also misused it here

mmtk-core/src/mmtk.rs

Lines 594 to 597 in d93262b

self.get_plan()

.base()

.vm_space

.initialize_object_metadata(object, false)

qinsoon · 2025-07-28T06:44:27Z

src/plan/concurrent/immix/global.rs

+            return true;
+        }
+        let threshold = self.get_total_pages() >> 1;
+        let concurrent_marking_threshold = self


This is comparing accumulated allocated pages with total pages. I guess that means we only want to do concurrent marking if we have allocated enough, otherwise we just do a normal full heap GC.

This is a quick and dirty hack. It is a naive heuristic, triggering concurrent marking when half of the heap has been allocated. I even did not differentiate between clean blocks and reusable blocks. always increase this counter by Block::PAGES regardless of the block types

qinsoon · 2025-07-29T01:01:49Z

src/plan/plan_constraints.rs

@@ -45,6 +45,7 @@ pub struct PlanConstraints {
    /// `MutatorConfig::prepare_func`).  Those plans can set this to `false` so that the
    /// `PrepareMutator` work packets will not be created at all.
    pub needs_prepare_mutator: bool,
+    pub needs_satb: bool,


Maybe we don't need this. We can always do barrier == BarrierSelector::SATB.

If we fix the needs_log_bit, then we no longer need this

qinsoon · 2025-07-29T01:13:17Z

src/plan/tracing.rs

+impl<VM: VMBinding> SlotIterator<VM> {
+    pub fn iterate(
+        o: ObjectReference,
+        // should_discover_references: bool,


I am not sure why we need these fields, and why we need an optional klass. Those are very OpenJDK specific. Just for future reference, the original code is from:

https://github.com/wenyuzhao/mmtk-core/blob/8355c0ad11ec0b86f8f93ddee0abd6597f638cc3/src/plan/tracing.rs#L210

https://github.com/wenyuzhao/mmtk-openjdk/blob/46546ed41d28dd101eb57dca5422b73148f6ee06/mmtk/src/scanning.rs#L61

https://github.com/wenyuzhao/mmtk-openjdk/blob/46546ed41d28dd101eb57dca5422b73148f6ee06/mmtk/src/object_scanning.rs#L270

Yes, I copied this part from lxr, it has something to do with OpenJDK's reference processing and class unloading. At this stage, we do not need them but lxr needs those

qinsoon · 2025-07-29T01:16:35Z

src/util/alloc/immix_allocator.rs

@@ -265,6 +265,21 @@ impl<VM: VMBinding> ImmixAllocator<VM> {
                    // Update the hole-searching cursor to None.
                    Some(end_line)
                };
+                // mark objects if concurrent marking is active


This is the same as the code in LargeObjectSpace::initialize_object_metadata -- to mark freshly allocated objects are marked/live. We may want to introduce a flag for policies, such as allocate_as_live.

Yes, we may introduce another alloc semantic. I did introduce alloc public in Iso

qinsoon · 2025-07-29T01:26:02Z

src/vm/collection.rs

@@ -162,4 +162,7 @@ pub trait Collection<VM: VMBinding> {
    fn create_gc_trigger() -> Box<dyn GCTriggerPolicy<VM>> {
        unimplemented!()
    }
+
+    /// Inform the VM of concurrent marking status
+    fn set_concurrent_marking_state(_active: bool) {}


The binding uses this to know if SATB is active and decides whether to call 'soft reference load barrier'. This flag can be stored as a global flag for SATB barriers, and the binding can access the barrier. In that case, we may not need this function.

The problem is that barrier is an opaque pointer from the binding's perspective. If we want to store this flag in the barrier and let binding access it, then we need to make sure all rust barrier struct has the same size, and expose that to the binding. I do not think it is worthwhile to do so.

Why is it an opaque pointer? It is a dyn Barrier fat pointer, and allows Downcast. We can always access the barrier from the binding-side Rust code and store an active flag in the same static for the C++ to use. E.g. at the end of a pause, check if SATB is active. If so, set the static variable and enable the load reference barrier.

But anyway, the proposal above is based on an assumption that we do not want the function fn set_concurrent_marking_state(_active: bool). If it is okay to keep the function, then we don't need the change.

qinsoon · 2025-07-29T01:34:26Z

src/lib.rs

@@ -51,3 +53,9 @@ pub mod vm;
 pub use crate::plan::{
    AllocationSemantics, BarrierSelector, Mutator, MutatorContext, ObjectQueue, Plan,
 };
+
+static NUM_CONCURRENT_TRACING_PACKETS: AtomicUsize = AtomicUsize::new(0);


Right. concurrent_marking_packets_drained is only accessed by the plan, and ConcurrentTraceObjects (which has a reference to the plan). We can move this to the plan (or make it a method for trait ConcurrentPlan).

src/scheduler/work_bucket.rs

wks · 2025-07-29T04:21:04Z

What will happen if a mutator wants to fork() while concurrent marking is in progress? There are subtle interactions between preparing for forking (MMTK::prepare_to_fork), the concurrently running marking task (including the dirty mark bits, etc.), and the WorkerGoal which is currently only designed for triggering GC and triggering "prepare-to-fork".

We may temporarily tell the VM binding that MMTk doesn't currently support forking when using concurrent GC. We may fix it later. One simple solution is postponing the forking until the current GC finishes.

The status quo is that only CRuby and Android need forking. But CRuby will not support concurrent GC in a short term.

k-sareen · 2025-07-29T04:24:18Z

What will happen if a mutator wants to fork() while concurrent marking is in progress?

You can't let this happen. Either the binding needs to ensure that the mutator waits while concurrent marking is active, or you don't let a concurrent GC happen before forking (ART's method of dealing with this).

wks · 2025-07-29T04:48:28Z

You can't let this happen. Either the binding needs to ensure that the mutator waits while concurrent marking is active, or you don't let a concurrent GC happen before forking (ART's method of dealing with this).

Agreed. Fortunately, the current API doc for prepare_to_fork says:

    /// This function sends an asynchronous message to GC threads and returns immediately, but it
    /// is only safe for the VM to call `fork()` after the underlying **native threads** of the GC
    /// threads have exited.  After calling this function, the VM should wait for their underlying
    /// native threads to exit in VM-specific manner before calling `fork()`.

So a well-behaving VM binding shall wait for all the GC worker threads (which are created by the binding via VMCollection::spawn_gc_thread anyway) to exit before calling fork(). That's VM-specific, but not hard. Extending this API to support concurrent GC should not require the VM binding to rewrite this part.

k-sareen · 2025-07-29T06:39:22Z

Mentioning it before I forget: we need to change the name for the GC counter to say it's pauses or have a separate counter that counts pauses.

wks · 2025-07-29T05:09:24Z

src/scheduler/gc_work.rs

+        if mmtk.is_user_triggered_collection() || is_emergency {
+            // user triggered collection is always stop-the-world
+            mmtk.get_plan().schedule_collection(worker.scheduler());
+        } else {
+            // Let the plan to schedule collection work
+            mmtk.get_plan()
+                .schedule_concurrent_collection(worker.scheduler());
+        }


I think this decision should be made by the plan. We are hard-coding the policy that "if it is user-triggered collection or emergency collection, then schedule a full-heap GC, otherwise schedule a concurrent GC". But this only makes sense for the ConcurrentImmix plan. For existing plans, there is no concurrent collection. And other concurrent plans to be implemented in the future may do concurrent collections in user-triggered and emergency collections, too.

I suggest we

Remove Plan::schedule_concurrent_collection

Push this if-else down to ConcurrentImmix::schedule_collection. All the needed information (whether it is emergency GC or user-triggered GC) can be looked up from mmtk.state.

If Plan::schedule_collection needs a &mut GCWorker argument (which can reach &MMTK and then reach the state), we should add it. Most GC-related methods of trait Plan (including prepare, release, etc.) have either &mut GCWorker or VMWorkerThread as parameter.

Yes, this part should be in the concurrent plan trait similar to what we have for the generational plan

wks · 2025-07-29T05:17:17Z

src/scheduler/gc_work.rs

@@ -191,11 +199,24 @@ impl<VM: VMBinding> GCWork<VM> for ReleaseCollector {
 ///
 /// TODO: Smaller work granularity
 #[derive(Default)]
-pub struct StopMutators<C: GCWorkContext>(PhantomData<C>);
+pub struct StopMutators<C: GCWorkContext> {
+    pause: Pause,


I think the Pause is relevant to non-concurrent GC, too. Semantically, it describes why MMTk is stopping (pausing) mutators. For a STW GC, it is only "Full", while concurrent GCs have more reasons to pause mutators, including InitialMark and FinalMark.

But I do think the term "Full" may need to be changed. See the other comment of mine.

wks · 2025-07-29T05:20:38Z

src/plan/concurrent/mod.rs

+#[repr(u8)]
+#[derive(Debug, PartialEq, Eq, Copy, Clone, NoUninit)]
+pub enum Pause {
+    Full = 1,


Do we have a better name than Full? I think here it means doing a full GC in one STW (as opposed to doing part of a GC, such as initial mark, final mark, or an incremental part of a GC). But it may be confused with full-heap GC (as opposed to nursery GC).

Actually, I have the same concern but I could not find a better term so I simply leave full here as a placeholder

src/plan/global.rs

wks · 2025-07-29T05:27:41Z

src/plan/barriers.rs

+
+    fn load_reference(&mut self, _o: ObjectReference) {}
+
+    fn object_reference_clone_pre(&mut self, _obj: ObjectReference) {}


This method only has an blank implementation in BarrierSemantics::object_reference_clone_pre, and SATBBarrier::object_reference_clone_pre calls that. What is this method intended for?

src/plan/barriers.rs

wks · 2025-07-29T06:27:33Z

src/plan/concurrent/concurrent_marking_work.rs

+                        let worker = self.worker();
+                        let mmtk = self.mmtk();
+                        let w = ConcurrentTraceObjects::new(root_objects.clone(), mmtk);
+
+                        match pause {
+                            Pause::InitialMark => worker.scheduler().postpone(w),
+                            _ => unreachable!(),
+                        }
+
+                        root_objects.clear();
+                    }
+                }
+            }
+            if !root_objects.is_empty() {
+                let worker = self.worker();
+                let w = ConcurrentTraceObjects::new(root_objects.clone(), self.mmtk());
+
+                match pause {
+                    Pause::InitialMark => worker.scheduler().postpone(w),
+                    _ => unreachable!(),
+                }
+            }


We may extract a lambda closure to reduce code duplication.

wks · 2025-07-29T06:44:08Z

src/plan/tracing.rs

+struct SlotIteratorImpl<VM: VMBinding, F: FnMut(VM::VMSlot)> {
+    f: F,
+    // should_discover_references: bool,
+    // should_claim_clds: bool,
+    // should_follow_clds: bool,
+    _p: PhantomData<VM>,
+}
+
+impl<VM: VMBinding, F: FnMut(VM::VMSlot)> SlotVisitor<VM::VMSlot> for SlotIteratorImpl<VM, F> {
+    fn visit_slot(&mut self, slot: VM::VMSlot) {
+        (self.f)(slot);
+    }
+}
+
+pub struct SlotIterator<VM: VMBinding> {
+    _p: PhantomData<VM>,
+}
+
+impl<VM: VMBinding> SlotIterator<VM> {
+    pub fn iterate(
+        o: ObjectReference,


We may not really need SlotIterator or SlotIteratorImpl. Note that in mmtk-core/src/vm/scanning.rs we already have SlotVisitor and impl<SL: Slot, F: FnMut(SL)> SlotVisitor<SL> for F {. That's enough for us to pass a lambda (&mut |slot| { ... }) to Scanning::scan_object and use it as the slot_visitor argument.

I think the lxr branch introduced this struct to do some kind of filtering, and multiplexed multiple methods of Scanning, including the newly introduced Scanning::scan_object_with_klass.

However it is still nice to add a method ObjectReference::iterate_fields. It's convenient.

wks · 2025-07-29T07:01:34Z

src/util/address.rs

+    pub fn iterate_fields<VM: VMBinding, F: FnMut(VM::VMSlot)>(self, f: F) {
+        SlotIterator::<VM>::iterate(self, f)
+    }


Since FnMut(VM::VMSlot) already implements trait SlotVisitor, we can directly pass it to scan_object.

Suggested change

pub fn iterate_fields<VM: VMBinding, F: FnMut(VM::VMSlot)>(self, f: F) {

SlotIterator::<VM>::iterate(self, f)

}

pub fn iterate_fields<VM: VMBinding, F: FnMut(VM::VMSlot)>(self, mut f: F) {

<VM::VMScanning as Scanning<VM>>::scan_object(

VMWorkerThread(VMThread::UNINITIALIZED),

self,

&mut f,

)

}

However, this is still imperfect.

Not all VMs support scan_object. CRuby is one notable exception. If we only intend to visit the value of all (non-null) fields without updating (forwarding) any of them, I suggest we introduce another method in the Scanning trait, such as Scanning::enumerate_children. Otherwise, consider invoking Scanning::support_slot_enqueuing and scan_object_and_trace_edges, too.

This method doesn't have tls. For CRuby, I am implementing object scanning by setting a thread-local variable (which is accessed via the tls) as a call-back, calling a C function to visit fields, and the C function will call the call-back to let the Rust code visit the field value. But because this method is called from the barrier, it may not always provide a VMWorkerThread. I suggest the newly introduced method Scanning::enumerate_children to take a VMThread instead of VMWorkerThread as argument.

In summary, I am also suggesting we introduce a method

pub trait Scanning<VM: VMBinding> { /// Visit the children of an object. /// /// This method may be called during mutator time, and is required by concurrent GC. Currently, /// we don't support concurrent copying GC, so this method can assume no objects are moved by GC /// while this method is running. fn enumerate_children( tls: VMThread, object: ObjectReference, child_visitor: &mut impl FnMut(ObjectReference), ); }

And if we make that change, then ObjectReference::iterate_fields will call Scanning::enumerate_children instead.

wks

The current "WorkerGoal" mechanism should be able to handle the case where mutators trigger another GC between InitialMark and FinalMark. We can remove WorkPacketStage::Initial and WorkPacketStage::ConcurrentSentinel, and use GCWorkScheduler::on_last_parked to transition the GC state from InitialMark to the concurrent marking to FinalMark and finally finish the GC. See inline comments for more details.

src/scheduler/work_bucket.rs

wks · 2025-07-29T09:13:17Z

src/scheduler/scheduler.rs

+            notify = true;
+        }
+        if notify {
+            self.wakeup_all_concurrent_workers();


This wakes up other workers. However, the current GC worker considers itself the "last parked worker". So after the on_gc_finished function returns, it will clear the current "goal" (goals.on_current_goal_completed()) and try to handle another "goal" (i.e. a GC request) in self.respond_to_requests(worker, goals). Unless the mutator triggers another GC immediatelyu, the current GC worker thread will find that there is no requests, and park itself (return LastParkedResult::ParkSelf;). If we set MMTK_THREADS=1, it will never wake up again.

So we need to make some changes to GCWorkScheduler::on_last_parked so that if this is the end of the InitialMark STW, the on_last_parked method should return LastParkedResult::WakeAll so that all GC workers wake up again, including the current GC worker thread itself.

And we shouldn't clear the current "worker goal" if it is the end of InitialMark. The GC workers should still be working towards the "GC" goal. But it needs to be a bit stateful, knowing whether it is (1) during a "full GC", (2) during initial mark, (3) during final mark, or (4) between initial mark and final mark (we may just call it "concurrent tracing". As long as the GC workers are still working on a goal (regardless of concrete state), it will not accept other GC requests or requests for forking, which automatically solves the forking problem.

And if concurrent tracing finishes again (all the postponed work packets and their children are drained), the last parked worker will reach on_last_parked again. This time, it can immediately start scheduling the FinalMark stage (or should we call it a "super stage" to distinguish it from work bucket stage, or just call it an STW). When FinalMark finishes, the GC really finishes. GC workers can now accept requests for another GC or requests for forking.

src/scheduler/work_bucket.rs

qinsoon · 2025-07-30T00:56:05Z

src/plan/barriers.rs

@@ -250,3 +260,82 @@ impl<S: BarrierSemantics> Barrier<S::VM> for ObjectBarrier<S> {
        }
    }
 }
+
+pub struct SATBBarrier<S: BarrierSemantics> {


This seems to be a pre-write ObjectBarrier. The current ObjectBarrier only implements the post-write functions, and this implementation implements the pre-write functions.

qinsoon · 2025-07-30T01:34:58Z

src/scheduler/scheduler.rs

@@ -29,6 +31,12 @@ pub struct GCWorkScheduler<VM: VMBinding> {
    pub(crate) worker_monitor: Arc<WorkerMonitor>,
    /// How to assign the affinity of each GC thread. Specified by the user.
    affinity: AffinityKind,
+
+    pub(super) postponed_concurrent_work:


The work that gets 'postponed' is actual concurrent work packets. The implementation use these queues to temporarily store the packets, to avoid directly pushing them to the Unconstrained bucket and getting those work done in the current pause. Only after the pause is done, the scheduler transfers the work from the queue back to Unconstrained.

Why not use a separate bucket for concurrent work? The bucket is not opened during pause, and gets opened when the pause is done.

qinsoon · 2025-07-30T01:39:18Z

src/scheduler/work_bucket.rs

@@ -59,6 +60,8 @@ pub struct WorkBucket<VM: VMBinding> {
    /// recursively, such as ephemerons and Java-style SoftReference and finalizers.  Sentinels
    /// can be used repeatedly to discover and process more such objects.
    sentinel: Mutex<Option<Box<dyn GCWork<VM>>>>,
+    in_concurrent: AtomicBool,


This field seems never used.

wks · 2025-07-30T07:17:16Z

It looks like we keep newly allocated objects alive by marking the lines, ~~but not marking those objects~~ (actually the mark bits of all words in a line are set). This is OK. But what if we need VO bits? The default strategy for maintaining VO bits is copying from the mark bits. Since ~~those objects are not marked~~ all words of the line are marked, ~~their VO bits will be zero~~ the VO bits will also be all ones if we blindly copy VO bits from the mark bits. But because post_alloc already sets the VO bits, we just need to keep their VO bits as is.

To properly support VO bits, we need to do two things:

Reset the BumpPointer of ImmixAllocator so that after the InitialMark, all mutators start allocation from empty lines (i.e. make sure a line never contains both objects allocated before InitialMark and objects allocated after InitialMark).
In Block::sweep,
- if a line only contains objects allocated after InitialMark, we keep the VO bits as is;
- if it only contains live objects before InitialMark, we copy the VO bits over from mark bits;
- otherwise the line must be empty. We clear its VO bits.

There are multiple ways to know if a line only contains new objects or old objects.

~~If a line is marked, but the mark bits of a line are all zero, it must be a line that only contains objects allocated after InitialMark. This doesn't need extra metadata.~~ (Update: This is not true. If a mutator adds an edge from a live old object to the new object, the new object will still be reachable.)
If a line is marked, and the mark bits are a superset of the VO bits, it must be a line that only contains objects allocated after InitialMark. If a block only contains objects before the InitialMark, the mark bits must be a subset of the VO bits. Only all-ones is a superset of all possible VO-bit patterns. If the mark bits and VO bits are identical (every word is an individual object, and they are either all live or all new), it doesn't matter if we copy the mark bits or retain the VO bits. Currently each Line is 256 bytes, corresponding to 32 bits of mark bits or VO bits. It should be easy and efficient to do 32-bit bit operation.
We introduce another one-bit-per-line metadata to record if it contains objects allocated after InitialMark.

tianleq · 2025-07-30T07:22:24Z

"It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects." This is not true. Both lines and every word within that line are marked. So if one blindly copy mark bits to vo bits, then vo bit will be 1 even if that address is not an object

qinsoon · 2025-07-30T07:25:00Z

It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects. This is OK. But what if we need VO bits?

I think we can just mark lines, and also mark each individual object. The bulk set only works for side mark bits anyway. Let's not get things entangled and complicated.

tianleq · 2025-07-30T07:28:10Z

It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects. This is OK. But what if we need VO bits?

I think we can just mark lines, and also mark each individual object. The bulk set only works for side mark bits anyway. Let's not get things entangled and complicated.

The problem is that we do not want to do the check in the fast path. If we want to mark each individual object, then in the allocation fast-path, we need to check if concurrent marking is active and then set the mark bit. While the current bulk setting way only set the mark bit in the slow-path

wks · 2025-07-30T07:30:21Z

"It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects." This is not true. Both lines and every word within that line are marked. So if one blindly copy mark bits to vo bits, then vo bit will be 1 even if that address is not an object

I think we can just mark lines, and also mark each individual object. The bulk set only works for side mark bits anyway. Let's not get things entangled and complicated.

Actually I think for correctness, we must not set mark bits of individual objects when allocating. Suppose there are objects A and B when GC is triggered, and root -> ... -> A -> B. During concurrent marking, a mutator allocated C, and changed the object graph to root -> ... -> A -> C -> B. Then a GC worker visits A for the first time. If C already has the mark bit, GC will not enqueue C, and will not enqueue B, either. It will consider B as dead.

I forgot the SATB barrier. It will remember B when we remove the edge A -> B.

tianleq · 2025-07-30T07:33:04Z

"It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects." This is not true. Both lines and every word within that line are marked. So if one blindly copy mark bits to vo bits, then vo bit will be 1 even if that address is not an object

I think we can just mark lines, and also mark each individual object. The bulk set only works for side mark bits anyway. Let's not get things entangled and complicated.

Actually I think for correctness, we must not set mark bits of individual objects when allocating. Suppose there are objects A and B when GC is triggered, and root -> ... -> A -> B. During concurrent marking, a mutator allocated C, and changed the object graph to root -> ... -> A -> C -> B. Then a GC worker visits A for the first time. If C already has the mark bit, GC will not enqueue C, and will not enqueue B, either. It will consider B as dead.

In your case, B will be captured by the SATB barrier, it will not be considered as dead. There is no need to scan those newly allocated objects because any children of those newly allocated objects must have been alive in the snapshot and thus is guaranteed to be traced

qinsoon · 2025-07-30T07:41:26Z

It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects. This is OK. But what if we need VO bits?

I think we can just mark lines, and also mark each individual object. The bulk set only works for side mark bits anyway. Let's not get things entangled and complicated.

The problem is that we do not want to do the check in the fast path. If we want to mark each individual object, then in the allocation fast-path, we need to check if concurrent marking is active and then set the mark bit. While the current bulk setting way only set the mark bit in the slow-path

The mark bit could be in the header, and we have to set it per object if it is in the header. We can differentiate header mark bit and side mark bit, and deal with each differently.

But bulk setting mark bits is still a bit hacky -- and this is why we would have issues with VO bits. VO bits copies from mark bits, assuming mark bits is only set for each individual object.

k-sareen · 2025-08-01T04:50:58Z

The solution for the VO-bit is to not use the CopyFromMarkBits for the ConcurrentImmix plan. You would always need to do the ClearAndReconstruct strategy. @tianleq is right in that this is too complex to do in the fast-path/JIT compiler. Hence why it's better to do it in bulk in the slow-path when grabbing new pages for a space.

qinsoon · 2025-08-01T05:06:31Z

We would need every policy to check if they should allocate objects as live during initializing object metadata. That would be a boolean flag that can be set by concurrent Immix. This also means those policies will always have this condition, no matter they are used in concurrent Immix or not.
One way that I can think up to avoid this is to add a const (generic const) to each space type, something like MAY_ALLOCATE_AS_LIVE. Concurrent immix initializes spaces with MAY_ALLOCATE_AS_LIVE = true, and other plans (non concurrent plans) initialize it as MAY_ALLOCATE_AS_LIVE=false. This can turn a conditional check to a constant. But this sounds annoying, and complicates the code. So I don't quite like this approach.
An alternative is that we keep the condition check in MMTk core, and argue that a binding should generate a more efficient post alloc function that can omit this condition check for non-concurrent plans, by their JIT. So it is okay to have such a condition check in MMTk core.
@wks What do you think?

I think doing a condition check should be good enough. If the VM binding really calls into MMTk core to do alloc or post_alloc, the real bottleneck will be the virtual dispatch over space. If the binding cares about performance, it should always implement fast paths on the binding side, even if the VM is mostly interpreted like CRuby.

The problem is still how can fast-path do this. Different policy may have different requirement. For example, immix space not only needs to mark the object, it also need to mark lines.

The JIT generated fastpath mimics what MMTk core does. If a space has this check in the initialize_object_metadata which appears in the post alloc, the JIT should do what MMTk core does. If a space doesn't have anything in initialize_object_metadata, then JIT doesn't need to do anything.

qinsoon · 2025-08-01T05:07:22Z

@tianleq is right in that this is too complex to do in the fast-path/JIT compiler.

Why is it complex? It is trivial.

if current plan may allocate object as live:
  emit post_alloc with a check
else
  emit normal post_alloc

tianleq · 2025-08-01T05:14:47Z

@tianleq is right in that this is too complex to do in the fast-path/JIT compiler.

Why is it complex? It is trivial.
if current plan may allocate object as live:
  emit post_alloc with a check
else
  emit normal post_alloc

Because the check itself can be complicated. In concurrent marking immix, if the object is in immix space, lines also need to be marked. Writing this whole thing in fast-path is non-trivial and error-prone

k-sareen · 2025-08-01T05:19:03Z

If your suggestion is that we should do a function call in the fast-path, then that's not a fast-path anymore. If your suggestion is to write the JIT-compiler instructions in the fast-path which set the metadata like line mark state, mark-bit, etc. then this is entirely policy dependent. For example, you would need to have separate JIT-generated instructions for an ImmixSpace object and a LOS object. Not to mention that the metadata address for local metadata is hard to calculate.

The whole idea of the fast-path is that we don't have to do extra checks per-object allocation and the object already has the correct state set when allocated.

wks · 2025-08-01T05:33:18Z

The solution for the VO-bit is to not use the CopyFromMarkBits for the ConcurrentImmix plan. You would always need to do the ClearAndReconstruct strategy. @tianleq is right in that this is too complex to do in the fast-path/JIT compiler. Hence why it's better to do it in bulk in the slow-path when grabbing new pages for a space.

I already have a solution for the CopyFromMarkBits strategy (I updated #1355 (comment)). In simple terms, we leave the VO bits as is if the mark bits of a line are a superset of the VO bits of the same line; otherwise copy the mark bits.

wks · 2025-08-01T05:37:56Z

I think what Tianle is currently doing, i.e. setting both side (per-object) mark bits and line mark bits to 1 in bulk when finding the next contiguous lines, is good enough.

Currently the ImmixSpace does not support in-header mark bits. If we let ImmixSpace support header mark bits, we can eagerly set the line mark bits, but set per-object mark bits in post_alloc (in Space::initialize_object_metadata or just Mutator::post_alloc if all spaces use the same mark bit metadata). Then we no longer need to mark the line in fast paths.

qinsoon · 2025-08-01T05:57:27Z

@tianleq

Because the check itself can be complicated.

I meant a simple check like what you do for SATB barrier for reference loading: https://github.com/tianleq/mmtk-openjdk/blob/5aa00cd32344bda65d35bba42f9df6b2757991ae/openjdk/barriers/mmtkSATBBarrier.cpp#L57
How is that complicated?

In concurrent marking immix, if the object is in immix space, lines also need to be marked. Writing this whole thing in fast-path is non-trivial and error-prone

Marking lines is not in post_alloc, and is not implemented as a fastpath by the binding. What made you think I implied implementing line marking in fast path?

qinsoon · 2025-08-01T06:04:51Z

@k-sareen

If your suggestion is that we should do a function call in the fast-path, then that's not a fast-path anymore.

I didn't suggest that. The pseudo code clearly is the compiler's code, not the generated code. At JIT time, JIT checks whether a plan may allocate object as marked, if so, JIT needs to emit code that checks if currently we should allocate objects as live or not.

If your suggestion is to write the JIT-compiler instructions in the fast-path which set the metadata like line mark state, mark-bit, etc. then this is entirely policy dependent.

First, I didn't suggest that.
Second, the allocation fastpath (including post alloc) is entirely policy dependent. It is and it always has been.

For example, you would need to have separate JIT-generated instructions for an ImmixSpace object and a LOS object. Not to mention that the metadata address for local metadata is hard to calculate.

Yes. It is not something new.
A JIT generates code as the fastpath to mimic what MMTk core does. This is the design, this is always the design.

The whole idea of the fast-path is that we don't have to do extra checks per-object allocation and the object already has the correct state set when allocated.

No. If a binding uses header mark bits, and needs to set the mark bits when SATB is active and not set the mark bits when SATB is not active, there is no other option. It is intrinsic to the algorithm.

k-sareen · 2025-08-01T06:13:03Z

Yes. It is not something new. A JIT generates code as the fastpath to mimic what MMTk core does. This is the design, this is always the design.

I'm saying that this is too complicated to implement in the JIT. The control flow is non-trivial.

Marking lines is not in post_alloc, and is not implemented as a fastpath by the binding. What made you think I implied implementing line marking in fast path?

But lines need to be marked as no one will mark the line otherwise, i.e. we would have to change the post_alloc function to mark lines based on is_concurrent_marking_active.

qinsoon · 2025-08-01T06:15:17Z

Marking lines is not in post_alloc, and is not implemented as a fastpath by the binding. What made you think I implied implementing line marking in fast path?

But lines need to be marked as no one will mark the line otherwise, i.e. we would have to change the post_alloc function to mark lines based on is_concurrent_marking_active.

Why? We mark lines in the slowpath, we don't need to do that in the fastpath.

We avoid using the term "reference" to refer to weak reference objects as in JVM. We explicitly mention "weak reference" instead. Also added comments to clarify the purpose of that barrier function.

k-sareen · 2025-08-01T06:48:30Z

Sure. But then why not do this for the mark-bit as well? It avoids the overhead of figuring out the metadata address etc. etc. + amortizes the cost of setting the metadata. Everything we have in OpenJDK so far is for global metadata (unlog bit, alloc bit, etc.). No post_alloc has been inlined, and neither has any use of local metadata. I think this discussion is better continued in the next meeting.

wks · 2025-08-01T06:53:58Z

src/plan/concurrent/immix/global.rs

+    // If we disable moving in Immix, this is a non-moving plan.
+    moves_objects: false,


Since this plan falls back to STW GC, there is in principle nothing preventing us from doing defragmentation during STW GC. This means this plan can still move objects.

If it is easy to enable defragmentation, we should do it so that we won't blindly need 4x-7x min heap size. But if it is difficult or buggy, we can leave this version as fully non-moving, and enable moving in the future.

It might be relative easy as we just need to make sure concurrent GC not doing defrag. But I never tried this so it can have bugs

I tried locally and set PlanConstraints::move_objects to true, and it works. But when I run a benchmark like lusearch, it only triggers one single STW GC in the beginning of the execution and it is a defrag GC. Then it only triggers concurrent GC and never executes STW GC again until the benchmark finishes. If I reduce the heap size, it will thrash, triggering concurrent GC again and again and the program will never make progress. I guess there is something wrong with the GC trigger so that it made the GC thrash.

I am not reducing it too small. For example, I ran lusearch at 80MB, and ConcurrentImmix thrashes when the progress is about 60%. In theory, it should be able to finish the benchmark by running only STW GC using plain Immix, in which case 80MB is too generous.

lusearch is a bad example since it's highly generational so it's quite possible that entire blocks get freed up even in a concurrent non-moving GC. Maybe try some other benchmark?

The concurrent GC triggering is naive, it blindly counts how many pages have been allocated and once 50% of the pages get allocated, it will start doing a concurrent GC. Reusable blocks are treated the same as clean blocks, so the counter is imprecise

qinsoon · 2025-08-01T08:55:48Z

Sure. But then why not do this for the mark-bit as well? It avoids the overhead of figuring out the metadata address etc. etc. + amortizes the cost of setting the metadata. Everything we have in OpenJDK so far is for global metadata (unlog bit, alloc bit, etc.). No post_alloc has been inlined, and neither has any use of local metadata. I think this discussion is better continued in the next meeting.

See #1355 (comment) and the few replies after it.

Clippy on the stable version of Rust seems to erroniously consider `std::mem::swap(&mut a, &mut lg)` as swapping with a temporary, where `lg` is a lock guard. We may fix the warning by writing `&mut *lg`, but the current use cases can all be expressed more concisely using `std::mem::replace` and `std::mem::take`. See: https://rust-lang.github.io/rust-clippy/master/index.html#swap_with_temporary

wks

I think the RwLock on BucketQueue::queue is a hack to allow replacing the queue. But adding a lock would hurt the performance of other plans. Since we are sure the Unconstrainted bucket must be empty when we schedule concurrent work, we can simply add work packets in a loop to achieve that, and that only requires a shared reference (&queue).

I have made a revision for that: https://github.com/wks/mmtk-core/ commit/9ac0d8234905deea5fd22d731f49ba37f7bded48

But I didn't push it into this PR. There is an alternative solution which @qinsoon mentioned, i.e. making a dedicated bucket for concurrent work packets. I'll leave it to @qinsoon to decide whether to cherry-pick my change, or add a new dedicated bucket.

wks · 2025-08-01T09:54:53Z

src/scheduler/work_bucket.rs

@@ -7,34 +7,35 @@ use std::sync::atomic::{AtomicBool, Ordering};
 use std::sync::{Arc, Mutex};

 struct BucketQueue<VM: VMBinding> {
-    queue: Injector<Box<dyn GCWork<VM>>>,
+    // queue: Injector<Box<dyn GCWork<VM>>>,
+    queue: std::sync::RwLock<Injector<Box<dyn GCWork<VM>>>>,


This RwLock may affect the performance of other plans.

I think the sole purpose of adding RwLock is making it able to swap the queue of the Unconstrainted bucket with another Injector instance (the postponed queues). If we are sure the Unconstrainted bucket must be empty, we can simply call WorkBucket::bulk_add but without notifying GC workers.

wks · 2025-08-01T09:56:50Z

src/scheduler/scheduler.rs

@@ -29,6 +31,12 @@ pub struct GCWorkScheduler<VM: VMBinding> {
    pub(crate) worker_monitor: Arc<WorkerMonitor>,
    /// How to assign the affinity of each GC thread. Specified by the user.
    affinity: AffinityKind,
+
+    pub(super) postponed_concurrent_work:
+        spin::RwLock<crossbeam::deque::Injector<Box<dyn GCWork<VM>>>>,


This is using spin::RwLock, but I don't see a particular reason. We may use plain std::sync::RwLock instead.

spin::RwLock might have better performance, specially on non-x86 architectures. It's best if we could benchmark it before we make a decision.

k-sareen · 2025-08-05T06:49:30Z

Like we discussed in Monday's meeting, the VO-bit strategy actually depends on the plan as well, not just the VM. I think this should not be a const function since it could depend on the selected plan:

mmtk-core/src/util/metadata/vo_bit/helper.rs

Lines 81 to 95 in 7d798ad

    
           /// Select a strategy for the VM.  It is a `const` function so it always returns the same strategy 
        
           /// for a given VM. 
        
           const fn strategy<VM: VMBinding>() -> VOBitUpdateStrategy { 
        
               // CopyFromMarkBits performs better than ClearAndReconstruct, and it also allows using 
        
               // VO bits during tracing. We use it as the default strategy. 
        
               // TODO: Revisit this choice in the future if non-trivial changes are made and the performance 
        
               // characterestics may change for the strategies. 
        
               match VM::VMObjectModel::LOCAL_MARK_BIT_SPEC.as_spec() { 
        
                   // Note that currently ImmixSpace doesn't support in-header mark bits, 
        
                   // but the DummyVM for testing declares mark bits to be "in header" as a place holder 
        
                   // because it never runs GC. 
        
                   MetadataSpec::InHeader(_) => VOBitUpdateStrategy::ClearAndReconstruct, 
        
                   MetadataSpec::OnSide(_) => VOBitUpdateStrategy::CopyFromMarkBits, 
        
               } 
        
           }

all spaces.

qinsoon · 2025-08-14T04:31:39Z

src/scheduler/work_bucket.rs

@@ -59,6 +60,8 @@ pub struct WorkBucket<VM: VMBinding> {
    /// recursively, such as ephemerons and Java-style SoftReference and finalizers.  Sentinels
    /// can be used repeatedly to discover and process more such objects.
    sentinel: Mutex<Option<Box<dyn GCWork<VM>>>>,
+    in_concurrent: AtomicBool,
+    disable: AtomicBool,


This is also never used.

tianleq added 2 commits July 25, 2025 03:07

WIP

fb12863

minor

9ac22bd

tianleq commented Jul 28, 2025

View reviewed changes

Merge branch 'master' of github.com:tianleq/mmtk-core into concurrent…

04fee38

…-immix

wks reviewed Jul 28, 2025

View reviewed changes

src/plan/barriers.rs Outdated Show resolved Hide resolved

minor

100c049

k-sareen reviewed Jul 28, 2025

View reviewed changes

tianleq commented Jul 28, 2025

View reviewed changes

qinsoon reviewed Jul 29, 2025

View reviewed changes

tianleq commented Jul 29, 2025

View reviewed changes

src/scheduler/work_bucket.rs Outdated Show resolved Hide resolved

wks reviewed Jul 29, 2025

View reviewed changes

qinsoon reviewed Jul 30, 2025

View reviewed changes

wks mentioned this pull request Jul 30, 2025

Cancellable (or upgradeable) concurrent GC #1357

Open

qinsoon added 2 commits July 31, 2025 02:17

Add ref/finalizer packets for final pause. Use log instead of println.

2bbb200

schedule_concurrent_packets before resuming mutators

90f4518

Rename load_reference to load_weak_reference

1db4743

We avoid using the term "reference" to refer to weak reference objects as in JVM. We explicitly mention "weak reference" instead. Also added comments to clarify the purpose of that barrier function.

wks reviewed Aug 1, 2025

View reviewed changes

Remove schedule_concurrent_collection

a95e94a

qinsoon mentioned this pull request Aug 5, 2025

Clarify pause vs STW vs GC #1362

Open

qinsoon added 2 commits August 5, 2025 01:43

Rename gc_pause_start. Merge gc_pause_end with end_of_gc

eacc959

Merge branch 'master' into concurrent-immix

028bd0e

qinsoon added 4 commits August 6, 2025 05:21

Disallow new weak reference before ref enqueue

fe29529

Introduce ConcurrentPlan. Make ConcurrentTraceObjects trace objects in

26c2a54

all spaces.

Merge branch 'master' into concurrent-immix

80024dd

Fix rayon-core version for MSRV

7487c1b

wks mentioned this pull request Aug 13, 2025

Systematic approach to clear/set side metadata #1256

Open

More assertions and minor fix.

6bb03c0

qinsoon reviewed Aug 14, 2025

View reviewed changes

Wrong assertions

ad41d7e

	self.get_plan()
	.base()
	.vm_space
	.initialize_object_metadata(object, false)


		fn load_reference(&mut self, _o: ObjectReference) {}

		fn object_reference_clone_pre(&mut self, _obj: ObjectReference) {}

-    pub fn iterate_fields<VM: VMBinding, F: FnMut(VM::VMSlot)>(self, f: F) {
-        SlotIterator::<VM>::iterate(self, f)
-    }
+    pub fn iterate_fields<VM: VMBinding, F: FnMut(VM::VMSlot)>(self, mut f: F) {
+        <VM::VMScanning as Scanning<VM>>::scan_object(
+            VMWorkerThread(VMThread::UNINITIALIZED),
+            self,
+            &mut f,
+        )
+    }

		// If we disable moving in Immix, this is a non-moving plan.
		moves_objects: false,

Concurrent immix #1355

Are you sure you want to change the base?

Concurrent immix #1355

Uh oh!

Conversation

tianleq commented Jul 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tianleq Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qinsoon Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wks commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianleq Jul 29, 2025 •

edited

Loading

qinsoon Jul 29, 2025 •

edited

Loading

wks commented Jul 29, 2025 •

edited

Loading