GSoC'25: Mesa-Frames: Stats & Event Driven Data Collection with Streamed Storage #151
Replies: 1 comment
-
Hey @Ben-geo, First of all, congratulations again on your acceptance to GSoC — and really great work on the proposal! Below is a comprehensive roadmap to help structure collaboration on the Mesa-Frames DataCollector. We can approach the development in well-defined phases: starting with a high-level architecture (which your diagrams already capture very well), then an abstract API interface that mirrors Mesa’s conventions, and finally diving into the implementation for MVP and examples. 1 ArchitectureYour proposed architecture already lays down a solid and intuitive structure — Model → DataCollector → Storage backend. One important shift for Mesa-Frames is to use Polars LazyFrames as the main internal representation, rather than standard DataFrames. This enables deferred computation, better performance with large datasets, and compatibility with Polars-native operations. Here are some guiding principles and architectural decisions to keep in mind:
2 API
from mesa_frames import DataCollector, every_n_steps
# minimal example
dc = DataCollector(
model_reporters = {
"total_wealth": lambda m: m.agents["wealth"].sum()
},
agent_reporters = {
"wealth": "wealth"
},
# Frames‑specific knobs(?)
stats = { # optional, see §3
"wealth": ["mean", "max", "count:nz"],
"gini": lambda lf: gini_expr(lf["wealth"])
},
trigger = every_n_steps(10),
storage = "parquet:./runs/exp42/*.parquet",
) 2.2 Trigger HelpersWe provide two primary ways to trigger data collection:
However, we are deferring the implementation of more complex event-driven collection for now. Mesa-Frames currently doesn’t support events natively, and Mesa-core’s event hook design is still evolving. We can revisit this in a later phase if there's time. 2.3 Stats ConfigurationUsers can specify computed statistics either via named presets or custom callables. Working with
This approach ensures that computations are composable, lazy, and expressive. It also keeps the API clean while letting advanced users write powerful metrics. 2.4 Storage URI SchemaWe support a unified URI scheme to specify where data should go:
3 Stats Layer (Built on Polars Expr)We define a small catalog of commonly used
We also want to allow user-extensible statistics. Developers can register their own custom functions that return a valid def gini_expr(s):
return (2 * s.rank().sum() / (len(s) * s.sum())) - (len(s) + 1) / len(s)
dc.register_stat("gini", gini_expr) 4 Implementation Guidelines4.1 Collect‑all algorithm (single scan)
lf = (
model.agents.lazy() #this will be lazy by default in the future
.select([*required_cols, *exprs])
.with_columns(step=pl.lit(model.time))
)
dc._frames.append(lf) Datacollector could have a background loop (thread or asyncio) that batches N steps or T seconds. 5 Event‑Driven Collection (future work)Mesa core currently does not expose explicit event hooks, and Mesa-Frames lacks an internal event mechanism or lifecycle-based pub/sub system. Introducing full event-driven data collection would require infrastructure that does not yet exist in either project. To keep Phase 1 focused and feasible, we will postpone event-based collection mechanisms — such as tracking custom agent-level transitions, or listening to internal simulation milestones — until a later stage. For now, we'll rely on periodic or conditional predicate-based collection (e.g., 6 Deliverables
Stretch goals may include Event-Driven collection or integration with dashboards for real-time insight. These can be explored if time allows, but are not required for successful completion of the core project. |
Beta Was this translation helpful? Give feedback.
-
Overview
This proposal outlines my plan to significantly enhance Mesa-Frames’ data collection capabilities during Google Summer of Code 2025. The focus is on developing a flexible, efficient, and scalable framework tailored for advanced researchers working with large-scale agent-based simulations.
The core enhancements include:
mean
,max
,min
,count
etc) they want to collect, reducing memory usage and computational overhead.lambda model: model.sheep_count < 10
) or time-based triggers (every_n_steps=50
) ensuring researchers capture only important insights while avoiding unnecessary storage.Each of these ideas, along with their motivations and potential benefits, are explored in greater depth in my proposal. I encourage you to take a look. : proposal
Code draft from proposal :
Beta Was this translation helpful? Give feedback.
All reactions