GSoC'25: Mesa-Frames: Stats & Event Driven Data Collection with Streamed Storage #151

Ben-geo · 2025-05-13T20:52:02Z

Ben-geo
May 13, 2025
Collaborator

Overview

This proposal outlines my plan to significantly enhance Mesa-Frames’ data collection capabilities during Google Summer of Code 2025. The focus is on developing a flexible, efficient, and scalable framework tailored for advanced researchers working with large-scale agent-based simulations.

The core enhancements include:

Stats Collection – A lightweight module that allows users to specify exactly which statistics (e.g., mean, max, min, count etc) they want to collect, reducing memory usage and computational overhead.
Event Driven Collection – A mechanism to record data only when predefined conditions are met, via predicate functions (e.g., lambda model: model.sheep_count < 10) or time-based triggers (every_n_steps=50) ensuring researchers capture only important insights while avoiding unnecessary storage.
External Data Storage Integration – Direct support for storing collected data in PostgreSQL/S3 or other databases, minimizing memory footprint and enabling large-scale analysis.

Each of these ideas, along with their motivations and potential benefits, are explored in greater depth in my proposal. I encourage you to take a look. : proposal

Code draft from proposal :

class DataCollector:
    def __init__(self, model, reporters=None, trigger=None, stat_config=None):
        self.model = model
        self.reporters = reporters or {}
        self.trigger = trigger or (lambda model: True)
        self.stat_config = stat_config or {}
        self.data = pl.DataFrame()
        self._stat_methods = {
            "max": lambda df, var: df[var].max(),
            "min": lambda df, var: df[var].min(),
            "mean": lambda df, var: df[var].mean(),
            "sum": lambda df, var: df[var].sum(),
        }

    def collect(self):
        if not self.trigger(self.model):
            return
        row = {"timestep": self.model._steps}
        for name, func in self.reporters.items():
            row[name] = func(self.model)
        if self.stat_config:
            agent_df = self.model._agents._agentsets[0]
            row.update(self.compute_stats(agent_df))
        self.data = self.data.vstack(pl.DataFrame([row]))

    def get_data(self):
        return self.data

    def compute_stats(self, df: pl.DataFrame) -> dict:
        results = {}
        cache = {}
        for key, stats in self.stat_config.items():
            for stat in stats:
                if stat.startswith("count"):
                    parts = stat.split(":")
                    mode = parts[1] if len(parts) > 1 else None
                    val = self.get_count(df, key, mode, cache)
                    label = f"{key}_count" if not mode else f"{key}_{mode}_count"
                    results[label] = val
                elif stat in self._stat_methods:
                    val = self._stat_methods[stat](df, key)
                    results[f"{key}_{stat}"] = val
                    cache[stat] = val
        return results

    def get_count(self, df, var, mode=None, cache=None):
        if mode in self._stat_methods:
            val = cache.get(mode) or self._stat_methods[mode](df, var)
            cache[mode] = val
            return (df[var] == val).sum()
        return len(df)

   def external_data_storage():
       # This function is a placeholder for the external data storage logic.
       pass

adamamer20 · 2025-05-14T20:07:44Z

adamamer20
May 14, 2025
Maintainer

Hey @Ben-geo,

First of all, congratulations again on your acceptance to GSoC — and really great work on the proposal!

Below is a comprehensive roadmap to help structure collaboration on the Mesa-Frames DataCollector. We can approach the development in well-defined phases: starting with a high-level architecture (which your diagrams already capture very well), then an abstract API interface that mirrors Mesa’s conventions, and finally diving into the implementation for MVP and examples.

1 Architecture

Your proposed architecture already lays down a solid and intuitive structure — Model → DataCollector → Storage backend. One important shift for Mesa-Frames is to use Polars LazyFrames as the main internal representation, rather than standard DataFrames. This enables deferred computation, better performance with large datasets, and compatibility with Polars-native operations. Here are some guiding principles and architectural decisions to keep in mind:

Lazy by default: Instead of materializing data every step, each collection builds a pl.LazyFrame plan. This plan is only executed (i.e., turned into actual data) when explicitly requested (e.g., get_data().collect()) or when flushed by a writer (e.g., to Parquet).
Single scan per step: It's crucial that agent attributes or variables are not read multiple times during the same model tick. This is in line with the mesa design conversation here, and ensures scalability and efficiency,
Abstract Interface + Polars Backend: As we did with AgentSet, it's a good idea to clearly separate an abstract interface (e.g., AbstractDataCollector) from its concrete Polars-based implementation. This design pattern improves modularity, allows us to introduce other backends in the future and makes reasoning about system components much easier and cleaner during development and maintenance.
Mesa Compatibility: We want our DataCollector API to mirror Mesa's as closely as possible. This includes having the same scope names (model_reporters, agent_reporters, etc.) and similar calling semantics, so that users can easily migrate their code and benefit from Mesa-Frames performance enhancements with minimal refactoring.
Injection vs Explicit Instantiation: One open decision is whether to inject the DataCollector automatically within ModelDF.__init__() (making it an opt-out default), or require the user to create and manage it explicitly. I currently lean toward opt-in for clarity and flexibility, but this is up for discussion and feedback from early users will be key.

2 API

Goal: Mirror Mesa’s signatures, extend only where Frames genuinely needs it.

from mesa_frames import DataCollector, every_n_steps

# minimal example

dc = DataCollector(
    model_reporters = {
        "total_wealth": lambda m: m.agents["wealth"].sum()
    },
    agent_reporters = {
        "wealth": "wealth"
    },
    # Frames‑specific knobs(?)
    stats   = {               # optional, see §3
        "wealth": ["mean", "max", "count:nz"],
        "gini":   lambda lf: gini_expr(lf["wealth"])
    },
    trigger = every_n_steps(10),
    storage = "parquet:./runs/exp42/*.parquet",
)

2.2 Trigger Helpers

We provide two primary ways to trigger data collection:

every_n_steps(n): A simple convenience wrapper that triggers collection every n steps. Internally, this is just a lambda function checking model.time % n == 0.
Any Callable[[ModelDF], bool]: This allows users to define custom conditions for triggering data collection. For instance:
```
trigger = lambda model: model.agents["wealth"].mean() < 100
```
This provides flexibility to collect only under certain behavioral or system-state conditions.

However, we are deferring the implementation of more complex event-driven collection for now. Mesa-Frames currently doesn’t support events natively, and Mesa-core’s event hook design is still evolving. We can revisit this in a later phase if there's time.

2.3 Stats Configuration

Users can specify computed statistics either via named presets or custom callables. Working with pl.Exprs directly is ideal for performance and composability with the rest of the LazyFrame pipeline.

You can define stats using a dictionary that maps variable names to a list of statistic identifiers:

stats = {
    "wealth": ["mean", "max", "count:nz"],
    "gini": lambda lf: gini_expr(lf["wealth"])
}

Each identifier (like "mean" or "count:nz") resolves to a registered pl.Expr under the hood. We maintain a simple internal registry to map these names to expressions. Custom callables (like the lambda above) can also return a full expression directly.

This approach ensures that computations are composable, lazy, and expressive. It also keeps the API clean while letting advanced users write powerful metrics.

2.4 Storage URI Schema

We support a unified URI scheme to specify where data should go:

memory: (default) – Keeps all collected LazyFrames in memory. No materialization unless explicitly triggered.
parquet:/absolute/or/relative/path/*.parquet – Writes step-wise LazyFrames to Parquet files locally.
parquet:s3://bucket/prefix-{step}.parquet – For cloud users, supports writing to AWS S3 (or other compatible buckets).
postgres://user:pass@host:5432/db#table=collector – Sends data to a PostgreSQL table using batched inserts or COPY.

3 Stats Layer (Built on Polars Expr)

We define a small catalog of commonly used pl.Expr-based statistics, which are resolved by name. These expressions are optimized for lazy execution and integrate naturally into the broader LazyFrame collection pipeline.

name	Polars expression	notes
`mean`	`pl.col(c).mean()`
`max`	`pl.col(c).max()`
`count:nz`	`(pl.col(c) != 0).sum()`	non-zero count
`median`	`pl.col(c).median()`	optional support

We also want to allow user-extensible statistics. Developers can register their own custom functions that return a valid pl.Expr, making it easy to support domain-specific needs while reusing the same lazy infrastructure:

def gini_expr(s):
    return (2 * s.rank().sum() / (len(s) * s.sum())) - (len(s) + 1) / len(s)

dc.register_stat("gini", gini_expr)

4 Implementation Guidelines

4.1 Collect‑all algorithm (single scan)

Discover reporters: at dc.collect() gather all requested columns & custom
expressions for this tick.
Build plan: one LazyFrame pipeline that projects all those columns &
attaches any computed expressions.
Tag with current step/time and cache.
Defer: materialise only when a writer flushes or get_data() triggers
.collect() down the line.

lf = (
    model.agents.lazy() #this will be lazy by default in the future
        .select([*required_cols, *exprs])
        .with_columns(step=pl.lit(model.time))
)
dc._frames.append(lf)

Datacollector could have a background loop (thread or asyncio) that batches N steps or T seconds.

5 Event‑Driven Collection (future work)

Mesa core currently does not expose explicit event hooks, and Mesa-Frames lacks an internal event mechanism or lifecycle-based pub/sub system. Introducing full event-driven data collection would require infrastructure that does not yet exist in either project.

To keep Phase 1 focused and feasible, we will postpone event-based collection mechanisms — such as tracking custom agent-level transitions, or listening to internal simulation milestones — until a later stage.

For now, we'll rely on periodic or conditional predicate-based collection (e.g., every_n_steps, or lambdas) which should cover a broad range of use cases.

6 Deliverables

Item	Description
Abstract Interface PR	Initial `DataCollector` class with clear docstrings and method signatures
Concrete Implementation	Working Polars-based backend that respects the abstract interface
Stats Registry	Built-in stat expressions and support for user-defined extensions
Docs / Example Notebook	e.g. Boltzmann Wealth model using `DataCollector` every 10 steps

Stretch goals may include Event-Driven collection or integration with dashboards for real-time insight. These can be explored if time allows, but are not required for successful completion of the core project.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSoC'25: Mesa-Frames: Stats & Event Driven Data Collection with Streamed Storage #151

{{title}}

Replies: 1 comment

{{title}}

Select a reply

GSoC'25: Mesa-Frames: Stats & Event Driven Data Collection with Streamed Storage #151

Ben-geo May 13, 2025 Collaborator

Overview

Code draft from proposal :

Replies: 1 comment

adamamer20 May 14, 2025 Maintainer

1 Architecture

2 API

2.2 Trigger Helpers

2.3 Stats Configuration

2.4 Storage URI Schema

3 Stats Layer (Built on Polars Expr)

4 Implementation Guidelines

4.1 Collect‑all algorithm (single scan)

5 Event‑Driven Collection (future work)

6 Deliverables

Ben-geo
May 13, 2025
Collaborator

adamamer20
May 14, 2025
Maintainer