-
Notifications
You must be signed in to change notification settings - Fork 78
Concurrent immix #1355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Concurrent immix #1355
Conversation
@@ -16,4 +16,4 @@ pub const BLOCK_ONLY: bool = false; | |||
|
|||
/// Mark lines when scanning objects. | |||
/// Otherwise, do it at mark time. | |||
pub const MARK_LINE_AT_SCAN_TIME: bool = true; | |||
pub const MARK_LINE_AT_SCAN_TIME: bool = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concurrent marking does not have scan object packet any more, for simplicity, I just change this value
@@ -51,3 +53,9 @@ pub mod vm; | |||
pub use crate::plan::{ | |||
AllocationSemantics, BarrierSelector, Mutator, MutatorContext, ObjectQueue, Plan, | |||
}; | |||
|
|||
static NUM_CONCURRENT_TRACING_PACKETS: AtomicUsize = AtomicUsize::new(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be moved into GlobalState
. Whoever picks up this PR can do that refactoring. I think we should reduce our reliance on statics in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. concurrent_marking_packets_drained
is only accessed by the plan, and ConcurrentTraceObjects
(which has a reference to the plan). We can move this to the plan (or make it a method for trait ConcurrentPlan
).
@@ -524,6 +528,7 @@ pub struct CommonSpace<VM: VMBinding> { | |||
/// This field equals to needs_log_bit in the plan constraints. | |||
// TODO: This should be a constant for performance. | |||
pub needs_log_bit: bool, | |||
pub needs_satb: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this because needs_log_bit
has a generational semantic associated with it.
pub fn prepare(&mut self, full_heap: bool) { | ||
if full_heap { | ||
debug_assert!(self.treadmill.is_from_space_empty()); | ||
// debug_assert!(self.treadmill.is_from_space_empty()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to comment out this assertion as a full stop-the-world GC might be triggered during concurrent marking, this assertion will be triggered in this case.
src/plan/global.rs
Outdated
@@ -337,6 +349,9 @@ pub trait Plan: 'static + HasSpaces + Sync + Downcast { | |||
space.verify_side_metadata_sanity(&mut side_metadata_sanity_checker); | |||
}) | |||
} | |||
|
|||
fn gc_pause_start(&self, _scheduler: &GCWorkScheduler<Self::VM>) {} | |||
fn gc_pause_end(&self) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a separate gc_pause_end
? It seems to be called at the same location as end_of_gc
. Was the method added before we had end_of_gc
?
gc_pause_start
is called before MMTk asks mutators to stop. That looks like a reasonable place for hooks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, the semantic of gc_pause_end
and end_of_gc
are different. GC is not done yet when the initial pause finishes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those two hooks are called at the same location. I think this is just our old assumption that GC == pause/STW
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they are similar at the moment. Ideally, we should treat them differently
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I renamed gc_pause_start
to notify_mutator_paused
. After mutators are stopped, we notify both the plan and the scheduler.
mmtk.get_plan().notify_mutators_paused(&mmtk.scheduler);
mmtk.scheduler.notify_mutators_paused(mmtk);
I merged gc_pause_end
with end_of_gc
. I will rename end_of_gc
in separate pull requests to address #1362.
@@ -191,11 +199,24 @@ impl<VM: VMBinding> GCWork<VM> for ReleaseCollector { | |||
/// | |||
/// TODO: Smaller work granularity | |||
#[derive(Default)] | |||
pub struct StopMutators<C: GCWorkContext>(PhantomData<C>); | |||
pub struct StopMutators<C: GCWorkContext> { | |||
pause: Pause, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pause
is defined for concurrent plans, and is leaking outside the plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we can change this pause to some boolean flag to indicate whether stack scanning is needed. As for the final pause, no stack scanning is needed. It only needs to bring all mutators to safepoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Pause
abstraction is good though. You can imagine an incremental GC using the same plumbing. An incremental GC is not a concurrent GC. Making into a boolean might lose semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the Pause
is relevant to non-concurrent GC, too. Semantically, it describes why MMTk is stopping (pausing) mutators. For a STW GC, it is only "Full", while concurrent GCs have more reasons to pause mutators, including InitialMark and FinalMark.
But I do think the term "Full" may need to be changed. See the other comment of mine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the
Pause
is relevant to non-concurrent GC, too.
It depends on what you meant by Pause
. The current Pause
is defined as an enum of [Full, InitialMark, FinalMark]
. This is clearly specific to the plan.
We can have a type to describe Pause
, but its definition should be plan specific.
In this case, the pause
is used by StopMutators
to decide whether to schedule mutator root scanning or not. What we could have is: when the plan schedules StopMutators
for the final mark, it sets a flag to StopMutators
to ask the work packet to skip mutator root scanning.
@@ -308,7 +354,7 @@ impl<VM: VMBinding> LargeObjectSpace<VM> { | |||
#[cfg(feature = "vo_bit")] | |||
crate::util::metadata::vo_bit::unset_vo_bit(object); | |||
// Clear log bits for dead objects to prevent a new nursery object having the unlog bit set | |||
if self.common.needs_log_bit { | |||
if self.common.needs_log_bit || self.common.needs_satb { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should just check needs_log_bit
. Tianle mentioned that there are assumptions in the code base that log bit is related with generational. We should fix that first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this needs_satb
is a quick and dirty hack
src/global_state.rs
Outdated
@@ -49,6 +49,8 @@ pub struct GlobalState { | |||
pub(crate) malloc_bytes: AtomicUsize, | |||
/// This stores the live bytes and the used bytes (by pages) for each space in last GC. This counter is only updated in the GC release phase. | |||
pub(crate) live_bytes_in_last_gc: AtomicRefCell<HashMap<&'static str, LiveBytesStats>>, | |||
pub(crate) concurrent_marking_active: AtomicBool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two fields are maintained here so all policies can access them.
concurrent_marking_active
basically means there is concurrent marking going on, and policies should allocate objects as if they were marked. I think we should add a flag such asallocate_as_live
per space, and set the flag for the spaces.concurrent_marking_threshold
is more like a global counter for allocated pages. We could maintain such a counter regardless the plan is concurrent or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each policy already has an initialize_object_metadata
function that takes in a parameter alloc: bool
. The parameter is currently unused. But this is fine-grained, and usually spaces would want to bulk set this metadata instead of per-object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think alloc
means something different. It was initially from here:
https://github.com/JikesRVM/JikesRVM/blob/5072f19761115d987b6ee162f49a03522d36c697/MMTk/src/org/mmtk/policy/ExplicitLargeObjectSpace.java#L108-L109
We also misused it here
Lines 594 to 597 in d93262b
self.get_plan() | |
.base() | |
.vm_space | |
.initialize_object_metadata(object, false) |
return true; | ||
} | ||
let threshold = self.get_total_pages() >> 1; | ||
let concurrent_marking_threshold = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is comparing accumulated allocated pages with total pages. I guess that means we only want to do concurrent marking if we have allocated enough, otherwise we just do a normal full heap GC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a quick and dirty hack. It is a naive heuristic, triggering concurrent marking when half of the heap has been allocated. I even did not differentiate between clean blocks and reusable blocks. always increase this counter by Block::PAGES
regardless of the block types
@@ -45,6 +45,7 @@ pub struct PlanConstraints { | |||
/// `MutatorConfig::prepare_func`). Those plans can set this to `false` so that the | |||
/// `PrepareMutator` work packets will not be created at all. | |||
pub needs_prepare_mutator: bool, | |||
pub needs_satb: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we don't need this. We can always do barrier == BarrierSelector::SATB
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we fix the needs_log_bit
, then we no longer need this
impl<VM: VMBinding> SlotIterator<VM> { | ||
pub fn iterate( | ||
o: ObjectReference, | ||
// should_discover_references: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why we need these fields, and why we need an optional klass. Those are very OpenJDK specific. Just for future reference, the original code is from:
- https://github.com/wenyuzhao/mmtk-core/blob/8355c0ad11ec0b86f8f93ddee0abd6597f638cc3/src/plan/tracing.rs#L210
- https://github.com/wenyuzhao/mmtk-openjdk/blob/46546ed41d28dd101eb57dca5422b73148f6ee06/mmtk/src/scanning.rs#L61
- https://github.com/wenyuzhao/mmtk-openjdk/blob/46546ed41d28dd101eb57dca5422b73148f6ee06/mmtk/src/object_scanning.rs#L270
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I copied this part from lxr, it has something to do with OpenJDK's reference processing and class unloading. At this stage, we do not need them but lxr needs those
@@ -265,6 +265,21 @@ impl<VM: VMBinding> ImmixAllocator<VM> { | |||
// Update the hole-searching cursor to None. | |||
Some(end_line) | |||
}; | |||
// mark objects if concurrent marking is active |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same as the code in LargeObjectSpace::initialize_object_metadata
-- to mark freshly allocated objects are marked/live. We may want to introduce a flag for policies, such as allocate_as_live
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we may introduce another alloc semantic. I did introduce alloc public in Iso
@@ -162,4 +162,7 @@ pub trait Collection<VM: VMBinding> { | |||
fn create_gc_trigger() -> Box<dyn GCTriggerPolicy<VM>> { | |||
unimplemented!() | |||
} | |||
|
|||
/// Inform the VM of concurrent marking status | |||
fn set_concurrent_marking_state(_active: bool) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The binding uses this to know if SATB is active and decides whether to call 'soft reference load barrier'. This flag can be stored as a global flag for SATB barriers, and the binding can access the barrier. In that case, we may not need this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that barrier is an opaque pointer from the binding's perspective. If we want to store this flag in the barrier and let binding access it, then we need to make sure all rust barrier struct has the same size, and expose that to the binding. I do not think it is worthwhile to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it an opaque pointer? It is a dyn Barrier
fat pointer, and allows Downcast
. We can always access the barrier from the binding-side Rust code and store an active
flag in the same static for the C++ to use. E.g. at the end of a pause, check if SATB is active. If so, set the static variable and enable the load reference barrier.
But anyway, the proposal above is based on an assumption that we do not want the function fn set_concurrent_marking_state(_active: bool)
. If it is okay to keep the function, then we don't need the change.
@@ -51,3 +53,9 @@ pub mod vm; | |||
pub use crate::plan::{ | |||
AllocationSemantics, BarrierSelector, Mutator, MutatorContext, ObjectQueue, Plan, | |||
}; | |||
|
|||
static NUM_CONCURRENT_TRACING_PACKETS: AtomicUsize = AtomicUsize::new(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. concurrent_marking_packets_drained
is only accessed by the plan, and ConcurrentTraceObjects
(which has a reference to the plan). We can move this to the plan (or make it a method for trait ConcurrentPlan
).
What will happen if a mutator wants to We may temporarily tell the VM binding that MMTk doesn't currently support forking when using concurrent GC. We may fix it later. One simple solution is postponing the forking until the current GC finishes. The status quo is that only CRuby and Android need forking. But CRuby will not support concurrent GC in a short term. |
You can't let this happen. Either the binding needs to ensure that the mutator waits while concurrent marking is active, or you don't let a concurrent GC happen before forking (ART's method of dealing with this). |
Agreed. Fortunately, the current API doc for /// This function sends an asynchronous message to GC threads and returns immediately, but it
/// is only safe for the VM to call `fork()` after the underlying **native threads** of the GC
/// threads have exited. After calling this function, the VM should wait for their underlying
/// native threads to exit in VM-specific manner before calling `fork()`.
So a well-behaving VM binding shall wait for all the GC worker threads (which are created by the binding via |
Mentioning it before I forget: we need to change the name for the |
src/scheduler/gc_work.rs
Outdated
if mmtk.is_user_triggered_collection() || is_emergency { | ||
// user triggered collection is always stop-the-world | ||
mmtk.get_plan().schedule_collection(worker.scheduler()); | ||
} else { | ||
// Let the plan to schedule collection work | ||
mmtk.get_plan() | ||
.schedule_concurrent_collection(worker.scheduler()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this decision should be made by the plan. We are hard-coding the policy that "if it is user-triggered collection or emergency collection, then schedule a full-heap GC, otherwise schedule a concurrent GC". But this only makes sense for the ConcurrentImmix plan. For existing plans, there is no concurrent collection. And other concurrent plans to be implemented in the future may do concurrent collections in user-triggered and emergency collections, too.
I suggest we
- Remove
Plan::schedule_concurrent_collection
- Push this if-else down to
ConcurrentImmix::schedule_collection
. All the needed information (whether it is emergency GC or user-triggered GC) can be looked up frommmtk.state
.
If Plan::schedule_collection
needs a &mut GCWorker
argument (which can reach &MMTK
and then reach the state
), we should add it. Most GC-related methods of trait Plan
(including prepare
, release
, etc.) have either &mut GCWorker
or VMWorkerThread
as parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this part should be in the concurrent plan trait similar to what we have for the generational plan
@@ -191,11 +199,24 @@ impl<VM: VMBinding> GCWork<VM> for ReleaseCollector { | |||
/// | |||
/// TODO: Smaller work granularity | |||
#[derive(Default)] | |||
pub struct StopMutators<C: GCWorkContext>(PhantomData<C>); | |||
pub struct StopMutators<C: GCWorkContext> { | |||
pause: Pause, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the Pause
is relevant to non-concurrent GC, too. Semantically, it describes why MMTk is stopping (pausing) mutators. For a STW GC, it is only "Full", while concurrent GCs have more reasons to pause mutators, including InitialMark and FinalMark.
But I do think the term "Full" may need to be changed. See the other comment of mine.
#[repr(u8)] | ||
#[derive(Debug, PartialEq, Eq, Copy, Clone, NoUninit)] | ||
pub enum Pause { | ||
Full = 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a better name than Full
? I think here it means doing a full GC in one STW (as opposed to doing part of a GC, such as initial mark, final mark, or an incremental part of a GC). But it may be confused with full-heap GC (as opposed to nursery GC).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I have the same concern but I could not find a better term so I simply leave full here as a placeholder
|
||
fn load_reference(&mut self, _o: ObjectReference) {} | ||
|
||
fn object_reference_clone_pre(&mut self, _obj: ObjectReference) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method only has an blank implementation in BarrierSemantics::object_reference_clone_pre
, and SATBBarrier::object_reference_clone_pre
calls that. What is this method intended for?
let worker = self.worker(); | ||
let mmtk = self.mmtk(); | ||
let w = ConcurrentTraceObjects::new(root_objects.clone(), mmtk); | ||
|
||
match pause { | ||
Pause::InitialMark => worker.scheduler().postpone(w), | ||
_ => unreachable!(), | ||
} | ||
|
||
root_objects.clear(); | ||
} | ||
} | ||
} | ||
if !root_objects.is_empty() { | ||
let worker = self.worker(); | ||
let w = ConcurrentTraceObjects::new(root_objects.clone(), self.mmtk()); | ||
|
||
match pause { | ||
Pause::InitialMark => worker.scheduler().postpone(w), | ||
_ => unreachable!(), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may extract a lambda closure to reduce code duplication.
struct SlotIteratorImpl<VM: VMBinding, F: FnMut(VM::VMSlot)> { | ||
f: F, | ||
// should_discover_references: bool, | ||
// should_claim_clds: bool, | ||
// should_follow_clds: bool, | ||
_p: PhantomData<VM>, | ||
} | ||
|
||
impl<VM: VMBinding, F: FnMut(VM::VMSlot)> SlotVisitor<VM::VMSlot> for SlotIteratorImpl<VM, F> { | ||
fn visit_slot(&mut self, slot: VM::VMSlot) { | ||
(self.f)(slot); | ||
} | ||
} | ||
|
||
pub struct SlotIterator<VM: VMBinding> { | ||
_p: PhantomData<VM>, | ||
} | ||
|
||
impl<VM: VMBinding> SlotIterator<VM> { | ||
pub fn iterate( | ||
o: ObjectReference, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may not really need SlotIterator
or SlotIteratorImpl
. Note that in mmtk-core/src/vm/scanning.rs
we already have SlotVisitor
and impl<SL: Slot, F: FnMut(SL)> SlotVisitor<SL> for F {
. That's enough for us to pass a lambda (&mut |slot| { ... }
) to Scanning::scan_object
and use it as the slot_visitor
argument.
I think the lxr
branch introduced this struct to do some kind of filtering, and multiplexed multiple methods of Scanning
, including the newly introduced Scanning::scan_object_with_klass
.
However it is still nice to add a method ObjectReference::iterate_fields
. It's convenient.
pub fn iterate_fields<VM: VMBinding, F: FnMut(VM::VMSlot)>(self, f: F) { | ||
SlotIterator::<VM>::iterate(self, f) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since FnMut(VM::VMSlot)
already implements trait SlotVisitor
, we can directly pass it to scan_object
.
pub fn iterate_fields<VM: VMBinding, F: FnMut(VM::VMSlot)>(self, f: F) { | |
SlotIterator::<VM>::iterate(self, f) | |
} | |
pub fn iterate_fields<VM: VMBinding, F: FnMut(VM::VMSlot)>(self, mut f: F) { | |
<VM::VMScanning as Scanning<VM>>::scan_object( | |
VMWorkerThread(VMThread::UNINITIALIZED), | |
self, | |
&mut f, | |
) | |
} |
However, this is still imperfect.
- Not all VMs support
scan_object
. CRuby is one notable exception. If we only intend to visit the value of all (non-null) fields without updating (forwarding) any of them, I suggest we introduce another method in theScanning
trait, such asScanning::enumerate_children
. Otherwise, consider invokingScanning::support_slot_enqueuing
andscan_object_and_trace_edges
, too. - This method doesn't have
tls
. For CRuby, I am implementing object scanning by setting a thread-local variable (which is accessed via thetls
) as a call-back, calling a C function to visit fields, and the C function will call the call-back to let the Rust code visit the field value. But because this method is called from the barrier, it may not always provide aVMWorkerThread
. I suggest the newly introduced methodScanning::enumerate_children
to take aVMThread
instead ofVMWorkerThread
as argument.
In summary, I am also suggesting we introduce a method
pub trait Scanning<VM: VMBinding> {
/// Visit the children of an object.
///
/// This method may be called during mutator time, and is required by concurrent GC. Currently,
/// we don't support concurrent copying GC, so this method can assume no objects are moved by GC
/// while this method is running.
fn enumerate_children(
tls: VMThread,
object: ObjectReference,
child_visitor: &mut impl FnMut(ObjectReference),
);
}
And if we make that change, then ObjectReference::iterate_fields
will call Scanning::enumerate_children
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current "WorkerGoal" mechanism should be able to handle the case where mutators trigger another GC between InitialMark and FinalMark. We can remove WorkPacketStage::Initial
and WorkPacketStage::ConcurrentSentinel
, and use GCWorkScheduler::on_last_parked
to transition the GC state from InitialMark to the concurrent marking to FinalMark and finally finish the GC. See inline comments for more details.
src/scheduler/scheduler.rs
Outdated
notify = true; | ||
} | ||
if notify { | ||
self.wakeup_all_concurrent_workers(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wakes up other workers. However, the current GC worker considers itself the "last parked worker". So after the on_gc_finished
function returns, it will clear the current "goal" (goals.on_current_goal_completed()
) and try to handle another "goal" (i.e. a GC request) in self.respond_to_requests(worker, goals)
. Unless the mutator triggers another GC immediatelyu, the current GC worker thread will find that there is no requests, and park itself (return LastParkedResult::ParkSelf;
). If we set MMTK_THREADS=1
, it will never wake up again.
So we need to make some changes to GCWorkScheduler::on_last_parked
so that if this is the end of the InitialMark
STW, the on_last_parked
method should return LastParkedResult::WakeAll
so that all GC workers wake up again, including the current GC worker thread itself.
And we shouldn't clear the current "worker goal" if it is the end of InitialMark. The GC workers should still be working towards the "GC" goal. But it needs to be a bit stateful, knowing whether it is (1) during a "full GC", (2) during initial mark, (3) during final mark, or (4) between initial mark and final mark (we may just call it "concurrent tracing". As long as the GC workers are still working on a goal (regardless of concrete state), it will not accept other GC requests or requests for forking, which automatically solves the forking problem.
And if concurrent tracing finishes again (all the postponed work packets and their children are drained), the last parked worker will reach on_last_parked
again. This time, it can immediately start scheduling the FinalMark
stage (or should we call it a "super stage" to distinguish it from work bucket stage, or just call it an STW). When FinalMark
finishes, the GC really finishes. GC workers can now accept requests for another GC or requests for forking.
@@ -250,3 +260,82 @@ impl<S: BarrierSemantics> Barrier<S::VM> for ObjectBarrier<S> { | |||
} | |||
} | |||
} | |||
|
|||
pub struct SATBBarrier<S: BarrierSemantics> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be a pre-write ObjectBarrier
. The current ObjectBarrier
only implements the post-write functions, and this implementation implements the pre-write functions.
@@ -29,6 +31,12 @@ pub struct GCWorkScheduler<VM: VMBinding> { | |||
pub(crate) worker_monitor: Arc<WorkerMonitor>, | |||
/// How to assign the affinity of each GC thread. Specified by the user. | |||
affinity: AffinityKind, | |||
|
|||
pub(super) postponed_concurrent_work: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The work that gets 'postponed' is actual concurrent work packets. The implementation use these queues to temporarily store the packets, to avoid directly pushing them to the Unconstrained
bucket and getting those work done in the current pause. Only after the pause is done, the scheduler transfers the work from the queue back to Unconstrained
.
Why not use a separate bucket for concurrent work? The bucket is not opened during pause, and gets opened when the pause is done.
@@ -59,6 +60,8 @@ pub struct WorkBucket<VM: VMBinding> { | |||
/// recursively, such as ephemerons and Java-style SoftReference and finalizers. Sentinels | |||
/// can be used repeatedly to discover and process more such objects. | |||
sentinel: Mutex<Option<Box<dyn GCWork<VM>>>>, | |||
in_concurrent: AtomicBool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This field seems never used.
It looks like we keep newly allocated objects alive by marking the lines, To properly support VO bits, we need to do two things:
There are multiple ways to know if a line only contains new objects or old objects.
|
"It looks like we keep newly allocated objects alive by marking the lines, but not marking those objects." This is not true. Both lines and every word within that line are marked. So if one blindly copy mark bits to vo bits, then vo bit will be 1 even if that address is not an object |
I think we can just mark lines, and also mark each individual object. The bulk set only works for side mark bits anyway. Let's not get things entangled and complicated. |
The problem is that we do not want to do the check in the fast path. If we want to mark each individual object, then in the allocation fast-path, we need to check if concurrent marking is active and then set the mark bit. While the current bulk setting way only set the mark bit in the slow-path |
I forgot the SATB barrier. It will remember B when we remove the edge |
In your case, B will be captured by the SATB barrier, it will not be considered as dead. There is no need to scan those newly allocated objects because any children of those newly allocated objects must have been alive in the snapshot and thus is guaranteed to be traced |
The mark bit could be in the header, and we have to set it per object if it is in the header. We can differentiate header mark bit and side mark bit, and deal with each differently. But bulk setting mark bits is still a bit hacky -- and this is why we would have issues with VO bits. VO bits copies from mark bits, assuming mark bits is only set for each individual object. |
The solution for the VO-bit is to not use the |
The JIT generated fastpath mimics what MMTk core does. If a space has this check in the initialize_object_metadata which appears in the post alloc, the JIT should do what MMTk core does. If a space doesn't have anything in initialize_object_metadata, then JIT doesn't need to do anything. |
Why is it complex? It is trivial.
|
Because the check itself can be complicated. In concurrent marking immix, if the object is in immix space, lines also need to be marked. Writing this whole thing in fast-path is non-trivial and error-prone |
If your suggestion is that we should do a function call in the fast-path, then that's not a fast-path anymore. If your suggestion is to write the JIT-compiler instructions in the fast-path which set the metadata like line mark state, mark-bit, etc. then this is entirely policy dependent. For example, you would need to have separate JIT-generated instructions for an ImmixSpace object and a LOS object. Not to mention that the metadata address for local metadata is hard to calculate. The whole idea of the fast-path is that we don't have to do extra checks per-object allocation and the object already has the correct state set when allocated. |
I already have a solution for the |
I think what Tianle is currently doing, i.e. setting both side (per-object) mark bits and line mark bits to 1 in bulk when finding the next contiguous lines, is good enough. Currently the ImmixSpace does not support in-header mark bits. If we let ImmixSpace support header mark bits, we can eagerly set the line mark bits, but set per-object mark bits in |
I meant a simple check like what you do for SATB barrier for reference loading: https://github.com/tianleq/mmtk-openjdk/blob/5aa00cd32344bda65d35bba42f9df6b2757991ae/openjdk/barriers/mmtkSATBBarrier.cpp#L57
Marking lines is not in post_alloc, and is not implemented as a fastpath by the binding. What made you think I implied implementing line marking in fast path? |
I didn't suggest that. The pseudo code clearly is the compiler's code, not the generated code. At JIT time, JIT checks whether a plan may allocate object as marked, if so, JIT needs to emit code that checks if currently we should allocate objects as live or not.
First, I didn't suggest that.
Yes. It is not something new.
No. If a binding uses header mark bits, and needs to set the mark bits when SATB is active and not set the mark bits when SATB is not active, there is no other option. It is intrinsic to the algorithm. |
I'm saying that this is too complicated to implement in the JIT. The control flow is non-trivial.
But lines need to be marked as no one will mark the line otherwise, i.e. we would have to change the |
Why? We mark lines in the slowpath, we don't need to do that in the fastpath. |
We avoid using the term "reference" to refer to weak reference objects as in JVM. We explicitly mention "weak reference" instead. Also added comments to clarify the purpose of that barrier function.
Sure. But then why not do this for the mark-bit as well? It avoids the overhead of figuring out the metadata address etc. etc. + amortizes the cost of setting the metadata. Everything we have in OpenJDK so far is for global metadata (unlog bit, alloc bit, etc.). No |
// If we disable moving in Immix, this is a non-moving plan. | ||
moves_objects: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this plan falls back to STW GC, there is in principle nothing preventing us from doing defragmentation during STW GC. This means this plan can still move objects.
If it is easy to enable defragmentation, we should do it so that we won't blindly need 4x-7x min heap size. But if it is difficult or buggy, we can leave this version as fully non-moving, and enable moving in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be relative easy as we just need to make sure concurrent GC not doing defrag. But I never tried this so it can have bugs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried locally and set PlanConstraints::move_objects
to true
, and it works. But when I run a benchmark like lusearch
, it only triggers one single STW GC in the beginning of the execution and it is a defrag GC. Then it only triggers concurrent GC and never executes STW GC again until the benchmark finishes. If I reduce the heap size, it will thrash, triggering concurrent GC again and again and the program will never make progress. I guess there is something wrong with the GC trigger so that it made the GC thrash.
I am not reducing it too small. For example, I ran lusearch
at 80MB, and ConcurrentImmix thrashes when the progress is about 60%. In theory, it should be able to finish the benchmark by running only STW GC using plain Immix
, in which case 80MB is too generous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lusearch
is a bad example since it's highly generational so it's quite possible that entire blocks get freed up even in a concurrent non-moving GC. Maybe try some other benchmark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concurrent GC triggering is naive, it blindly counts how many pages have been allocated and once 50% of the pages get allocated, it will start doing a concurrent GC. Reusable blocks are treated the same as clean blocks, so the counter is imprecise
See #1355 (comment) and the few replies after it. |
Clippy on the stable version of Rust seems to erroniously consider `std::mem::swap(&mut a, &mut lg)` as swapping with a temporary, where `lg` is a lock guard. We may fix the warning by writing `&mut *lg`, but the current use cases can all be expressed more concisely using `std::mem::replace` and `std::mem::take`. See: https://rust-lang.github.io/rust-clippy/master/index.html#swap_with_temporary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the RwLock
on BucketQueue::queue
is a hack to allow replacing the queue. But adding a lock would hurt the performance of other plans. Since we are sure the Unconstrainted bucket must be empty when we schedule concurrent work, we can simply add work packets in a loop to achieve that, and that only requires a shared reference (&queue
).
I have made a revision for that: https://github.com/wks/mmtk-core/ commit/9ac0d8234905deea5fd22d731f49ba37f7bded48
But I didn't push it into this PR. There is an alternative solution which @qinsoon mentioned, i.e. making a dedicated bucket for concurrent work packets. I'll leave it to @qinsoon to decide whether to cherry-pick my change, or add a new dedicated bucket.
@@ -7,34 +7,35 @@ use std::sync::atomic::{AtomicBool, Ordering}; | |||
use std::sync::{Arc, Mutex}; | |||
|
|||
struct BucketQueue<VM: VMBinding> { | |||
queue: Injector<Box<dyn GCWork<VM>>>, | |||
// queue: Injector<Box<dyn GCWork<VM>>>, | |||
queue: std::sync::RwLock<Injector<Box<dyn GCWork<VM>>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This RwLock
may affect the performance of other plans.
I think the sole purpose of adding RwLock
is making it able to swap the queue of the Unconstrainted
bucket with another Injector
instance (the postponed queues). If we are sure the Unconstrainted
bucket must be empty, we can simply call WorkBucket::bulk_add
but without notifying GC workers.
@@ -29,6 +31,12 @@ pub struct GCWorkScheduler<VM: VMBinding> { | |||
pub(crate) worker_monitor: Arc<WorkerMonitor>, | |||
/// How to assign the affinity of each GC thread. Specified by the user. | |||
affinity: AffinityKind, | |||
|
|||
pub(super) postponed_concurrent_work: | |||
spin::RwLock<crossbeam::deque::Injector<Box<dyn GCWork<VM>>>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is using spin::RwLock
, but I don't see a particular reason. We may use plain std::sync::RwLock
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spin::RwLock
might have better performance, specially on non-x86 architectures. It's best if we could benchmark it before we make a decision.
Like we discussed in Monday's meeting, the VO-bit strategy actually depends on the plan as well, not just the VM. I think this should not be a mmtk-core/src/util/metadata/vo_bit/helper.rs Lines 81 to 95 in 7d798ad
|
@@ -59,6 +60,8 @@ pub struct WorkBucket<VM: VMBinding> { | |||
/// recursively, such as ephemerons and Java-style SoftReference and finalizers. Sentinels | |||
/// can be used repeatedly to discover and process more such objects. | |||
sentinel: Mutex<Option<Box<dyn GCWork<VM>>>>, | |||
in_concurrent: AtomicBool, | |||
disable: AtomicBool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also never used.
Draft PR for concurrent non-moving immix