Batched inference CEBRA & padding at the `Solver` level #168

CeliaBenquet · 2024-08-23T12:01:47Z

fix https://github.com/AdaptiveMotorControlLab/CEBRA-dev/pull/746
fix #199

This PR adds the following features:

Inference (using CEBRA.transform() or solver.transform()) can be performed in batch, allowing inference on larger datasets or with larger models in a memory-efficient way (fix https://github.com/AdaptiveMotorControlLab/CEBRA-dev/issues/624).
All sklearn-level functionalities regarding input padding and transform() are now in the PyTorch API directly, making the PyTorch API more easily usable for developers (fix https://github.com/AdaptiveMotorControlLab/CEBRA-dev/issues/637, fix https://github.com/AdaptiveMotorControlLab/CEBRA-dev/pull/594).
Corresponding tests for batched transform with both APIs.
Some extra tests on the xCEBRA integration (using the transform method).

Example Usage of the new PyTorch API:

    import numpy as np
    import cebra.datasets
    import torch

    if torch.cuda.is_available():
        device = "cuda"
    else:
        device = "cpu"

    neural_data = cebra.load_data(file="neural_data.npz", key="neural")

    discrete_label = cebra.load_data(
        file="auxiliary_behavior_data.h5", key="auxiliary_variables", columns=["discrete"],
    )

    # 1. Define a CEBRA-ready dataset
    input_data = cebra.data.TensorDataset(
        torch.from_numpy(neural_data).type(torch.FloatTensor),
        discrete=torch.from_numpy(np.array(discrete_label[:, 0])).type(torch.LongTensor),
    ).to(device)

    # 2. Define a CEBRA model
    neural_model = cebra.models.init(
        name="offset10-model",
        num_neurons=input_data.input_dimension,
        num_units=32,
        num_output=2,
    ).to(device)

    input_data.configure_for(neural_model)

    # 3. Define the Loss Function Criterion and Optimizer
    crit = cebra.models.criterions.LearnableCosineInfoNCE(
        temperature=1,
    ).to(device)

    opt = torch.optim.Adam(
        list(neural_model.parameters()) + list(crit.parameters()),
        lr=0.001,
        weight_decay=0,
    )

    # 4. Initialize the CEBRA model
    solver = cebra.solver.init(
        name="single-session",
        model=neural_model,
        criterion=crit,
        optimizer=opt,
        tqdm_on=True,
    ).to(device)

    # 5. Define Data Loader
    loader = cebra.data.single_session.DiscreteDataLoader(
        dataset=input_data, num_steps=10, batch_size=200, prior="uniform"
    )

    # 6. Fit Model
    solver.fit(loader=loader)

    # 7. Transform Embedding
    x_train_emb = solver.transform(
        torch.from_numpy(neural_data).type(torch.FloatTensor).to(device),
        pad_before_transform=True,
        batch_size=512).to(device)

    # 8. Plot Embedding
    cebra.plot_embedding(
        x_train_emb.cpu(),
        discrete_label[:,0],
        markersize=10,
    )

all is similar to previous implementation but the inference part, which doesn't require to handle the padding of the input before passing it to the model.

…ional models in _transform

…est accordingly

MMathisLab · 2025-04-23T16:50:10Z

=========================== short test summary info ============================
ERROR tests/test_benchmark.py - NameError: name 'get_datapath' is not defined
ERROR tests/test_benchmark.py - NameError: name 'get_datapath' is not defined
ERROR tests/test_benchmark.py - NameError: name 'get_datapath' is not defined
ERROR tests/test_data_helper.py - NameError: name 'get_datapath' is not defined
ERROR tests/test_data_helper.py - NameError: name 'get_datapath' is not defined
ERROR tests/test_data_helper.py - NameError: name 'get_datapath' is not defined
ERROR tests/test_datasets.py - NameError: name 'get_datapath' is not defined
ERROR tests/test_datasets.py - NameError: name 'get_datapath' is not defined
ERROR tests/test_datasets.py - NameError: name 'get_datapath' is not defined
ERROR tests/test_distributions.py - NameError: name 'get_datapath' is not defined
ERROR tests/test_distributions.py - NameError: name 'get_datapath' is not defined
ERROR tests/test_distributions.py - NameError: name 'get_datapath' is not defined
ERROR tests/test_integration_train.py - NameError: name 'get_datapath' is not defined
ERROR tests/test_integration_train.py - NameError: name 'get_datapath' is not defined
ERROR tests/test_integration_train.py - NameError: name 'get_datapath' is not defined
ERROR tests/test_solver.py - NameError: name 'get_datapath' is not defined
ERROR tests/test_solver.py - NameError: name 'get_datapath' is not defined
ERROR tests/test_solver.py - NameError: name 'get_datapath' is not defined
ERROR cebra/config.py - NameError: name 'get_datapath' is not defined
ERROR cebra/config.py - NameError: name 'get_datapath' is not defined
!!!!!!!!!!!!!!!!!!! Interrupted: 20 errors during collection !!!!!!!!!!!!!!!!!!!

stes

some early comments; apologies if i have asked some of these before

stes · 2025-04-24T09:15:30Z

tests/test_solver.py

+@pytest.mark.parametrize(
+    "data_name, loader_initfunc, model_architecture, solver_initfunc",
+    multi_session_tests)
+def test_multi_session(data_name, loader_initfunc, model_architecture,
+                       solver_initfunc):
+    data = cebra.datasets.init(data_name)
+    loader = _get_loader(data, loader_initfunc)


why the changes here? i.e. did anything change that would cause the "old" multi session tests to break?

I restablished _get_loader as it was but added a return value as I need the dataset to configure it with the model.

Else,

I added the model_architecture as offset1-model is a special case for padding at transform.

I added the configure_for(model) as now this is handled in the solver.

I added some tests on the transform (was not done at all before), similar to the sklearn tests but at the pytorch level.

stes · 2025-04-24T09:17:50Z

tests/test_solver_batched.py

+single_session_tests_select_model = []
+single_session_hybrid_tests_select_model = []
+for model_name in ["offset1-model", "offset10-model"]:
+    for session_id in [None, 0, 5]:
+        for args in [
+            ("demo-discrete", model_name, session_id,
+             cebra.data.DiscreteDataLoader),
+            ("demo-continuous", model_name, session_id,
+             cebra.data.ContinuousDataLoader),
+            ("demo-mixed", model_name, session_id, cebra.data.MixedDataLoader),
+        ]:
+            single_session_tests_select_model.append(
+                (*args, cebra.solver.SingleSessionSolver))
+            single_session_hybrid_tests_select_model.append(
+                (*args, cebra.solver.SingleSessionHybridSolver))
+
+multi_session_tests_select_model = []
+for model_name in ["offset10-model"]:
+    for session_id in [None, 0, 1, 5, 2, 6, 4]:
+        for args in [("demo-continuous-multisession", model_name, session_id,
+                      cebra.data.ContinuousMultiSessionDataLoader)]:
+            multi_session_tests_select_model.append(
+                (*args, cebra.solver.MultiSessionSolver))


can you wrap the for loops here (quite complex) in functions, and only do the assingment on the global level?

I proposed something lmk if that's what you meant :)

tests/test_solver_batched.py

cebra/data/base.py

cebra/solver/single_session.py

cebra/solver/multiobjective.py

MMathisLab · 2025-04-24T09:54:25Z

doc error is: /home/runner/work/CEBRA/CEBRA/cebra/data/single_session.py:docstring of cebra.data.single_session.SingleSessionDataset.configure_for:3: WARNING: py:attr reference target not found: cebra_data.Dataset.offset /home/runner/work/CEBRA/CEBRA/cebra/data/multi_session.py:docstring of cebra.data.multi_session.MultiSessionDataset.configure_for:3: WARNING: py:attr reference target not found: cebra_data.Dataset.offset

MMathisLab · 2025-04-24T15:17:56Z

@CeliaBenquet not sure I see your edits post review; did you push them?

stes

Left some initial comments; broader discussion is a bit on the api design in the solver/base class --- lets discuss offline.

cebra/data/base.py

cebra/data/datasets.py

cebra/data/multi_session.py

cebra/integrations/sklearn/cebra.py

cebra/integrations/sklearn/utils.py

cebra/solver/base.py

cebra/solver/single_session.py

tests/test_solver.py

Co-authored-by: Steffen Schneider <steffen@bethgelab.org>

stes

Ok, review got a bit longer again; I realized I missed a few things on the last review. High level comments:

I made some comments in solver which could be fine; I think some arguments were moved from the sklearn class to the solver class, but the motivation for that is not entirely clear. Mostly needs one round of discussion so we can settle on a good API design for these. Specifically, what is the usecase for storing these variables now in the solver, where are they called?
the new transform function adds a lot of duplicated code that should be unified; again, could be first discussed

cebra/solver/base.py

stes · 2025-05-01T15:38:47Z

cebra/solver/base.py

+        if hasattr(self, "n_features"):
+            state_dict["n_features"] = self.n_features


Why is this an attribute of the solver, vs. being returned directly from the model? For sklearn it makes sense to fix this, but for the solver this could also simply be a property to be returned from the model? Where is this used?

E.g. what would happen for an xCEBRA solver, where you have not a single feature dim, but multiple

for the multisession case that's already the case and that's a list.

num_features cannot be a property I think, because that can be defined only based on the inputs provided to the fit(), and later if we adapt the solver, it needs to be reset. This is used to be saved with the solver as it's needed when reloading it + to be called in the sklearn + to see if the solver is fitted when calling transform().

for xcebra that's just similar to the original sklearn one but at a lower level, so yes we need to think about it but we would have had to in any case.

cebra/solver/base.py

cebra/solver/single_session.py

stes · 2025-05-01T16:03:17Z

cebra/solver/single_session.py

@@ -127,12 +317,27 @@ def _inference(self, batch):

 @register("single-session-hybrid")
 @dataclasses.dataclass
-class SingleSessionHybridSolver(abc_.MultiobjectiveSolver):
+class SingleSessionHybridSolver(abc_.MultiobjectiveSolver, SingleSessionSolver):


This does not work, I think. Both inherit from Solver base, this might have some weird effects; what was the motivation though?

it's the same transform() method as well as all the checks methods etc. so that's to avoid a lot of duplicate code.

I thought so as well but all tests pass, and they don't have redefined methods in common. Else happy to hear your suggestion to avoid duplication.

What are the methods that would be duplicated, could you list here, @CeliaBenquet ? If this is the issue, the proper way is that we write a Mixin functionality that puts the otherwise duplicated functions with same functionality in a new class.

duplicate (because single session mode)

parameters

_set_fitted_params

_check_is_inputs_valid

_check_is_session_id_valid

differences:

_get_model

_inference

note that this is similar for the multi session mode and multiobjective and auxiliary variable option in single session mode.

notes, but not final

class BaseSolverMixin(abc.ABC): # all abstract parameters _set_fitted_params _check_is_inputs_valid _check_is_session_id_valid class SingleSessionMixin(BaseSolverMixin) ... class MultiSessionMixin(BaseSolverMixin) ...

stes · 2025-05-01T16:04:34Z

tests/test_integration_xcebra.py

@@ -1,5 +1,7 @@
 import pickle

+import _utils_deprecated


multiobjective is tested, single objective is not

single objective solver transform was not tested before that PR + not used in the CEBRA() class, so we test the CEBRA transform (single objective) and the multiobjective one, as the solver transform was more similar to the new structure (padding, etc).

tests/test_solver.py

stes · 2025-05-08T10:26:19Z

cebra/solver/multiobjective.py

@@ -209,6 +210,13 @@ def __post_init__(self):
            renormalize=self.renormalize,
        )

+    def parameters(self, session_id: Optional[int] = None):
+        """Iterate over all parameters."""
+        super().parameters(session_id=session_id)


is this an error? this does not do anything besides checking, right? shouldnt the params be also returend?

gonlairo and others added 30 commits August 23, 2024 13:54

first proposal for batching in tranform method

283de06

first running version of padding with batched inference

202e379

start tests

1f1989d

add pad_before_transform to fit function and add support for convolut…

8665660

…ional models in _transform

remove print statements

8d5b114

first passing test

32c5ecd

add support for hybrid models

9928f63

rewrite transform in sklearn API

be5630a

baseline version of a torch.Datset

1300b20

move batching logic outside solver

bc6af24

move functionality to base file in solver and separate in functions

ec377b9

add test_select_model for single session

6f9ca98

add checks and test for _process_batch

fbe7eb4

add test_select_model for multisession

463b0f8

make self.num_sessions compatible with single session training

5219171

improve test_batched_transform_singlesession

f9bd1a6

make it work with small batches

e23a7ef

make test with multisession work

19c3f87

change to torch padding

87bebac

add argument to sklearn api

f0303e0

add torch padding to _transform

8c8be85

convert to torch if numpy array as inputs

59df402

add distinction between pad with data and pad with zeros and modify t…

1aadc8b

…est accordingly

differentiate between data padding and zero padding

bc8ee25

remove float16

5e7a14c

change argument position

928d882

clean test

07bac1c

clean test

0823b54

Fix warning

9fe3af3

Improve modularity remove duplicate code and todos

b417a23

MMathisLab and others added 7 commits April 23, 2025 19:59

Merge branch 'main' into batched-inference-and-padding

1453885

Fix gaussian mixture dataset import

acd2111

Fix all tests but xcebra tests

217a8a7

Fix pytorch API usage example

a1218aa

Make xCEBRA compatible with the batched inference & padding in solver

64d1db8

Add some tests on transform() with xCEBRA

9875a38

Add some docstrings and typings and clean unnecessary changes

65fc455

CeliaBenquet requested a review from MMathisLab April 24, 2025 09:01

stes reviewed Apr 24, 2025

View reviewed changes

MMathisLab closed this Apr 24, 2025

MMathisLab reopened this Apr 24, 2025

Implement review comments

1d0c498

CeliaBenquet requested a review from stes April 24, 2025 16:06

Fix sklearn test

4a25899

stes reviewed Apr 25, 2025

View reviewed changes

CeliaBenquet and others added 5 commits April 25, 2025 19:51

Add name in NOTE

0d56e44

Co-authored-by: Steffen Schneider <steffen@bethgelab.org>

Implement reviews on tests and typing

c5dc011

Fix import errors

c9fa5c8

Add select_model to aux solvers

4632c04

Fix docs error

22e3c47

MMathisLab approved these changes May 1, 2025

View reviewed changes

stes requested changes May 1, 2025

View reviewed changes

CeliaBenquet added 3 commits May 2, 2025 11:50

Add tests on the private functions in base solver

2fcfb7f

Update tests and duplicate code based on review

66fc6aa

Merge branch 'main' into batched-inference-and-padding

4d68110

stes reviewed May 8, 2025

View reviewed changes

MMathisLab requested a review from stes May 12, 2025 09:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batched inference CEBRA & padding at the `Solver` level #168

Batched inference CEBRA & padding at the `Solver` level #168

CeliaBenquet commented Aug 23, 2024 •

edited

Loading

MMathisLab commented Apr 23, 2025

stes left a comment

stes Apr 24, 2025

CeliaBenquet Apr 24, 2025 •

edited

Loading

stes Apr 24, 2025

CeliaBenquet Apr 24, 2025

MMathisLab commented Apr 24, 2025

MMathisLab commented Apr 24, 2025

stes left a comment

stes left a comment

stes May 1, 2025

CeliaBenquet May 2, 2025 •

edited

Loading

stes May 1, 2025

CeliaBenquet May 2, 2025

stes May 6, 2025

CeliaBenquet May 12, 2025

stes May 12, 2025

stes May 1, 2025

CeliaBenquet May 12, 2025

stes May 8, 2025

		if hasattr(self, "n_features"):
		state_dict["n_features"] = self.n_features

Batched inference CEBRA & padding at the Solver level #168

Are you sure you want to change the base?

Batched inference CEBRA & padding at the Solver level #168

Conversation

CeliaBenquet commented Aug 23, 2024 • edited Loading

MMathisLab commented Apr 23, 2025

stes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CeliaBenquet Apr 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MMathisLab commented Apr 24, 2025

MMathisLab commented Apr 24, 2025

stes left a comment

Choose a reason for hiding this comment

stes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CeliaBenquet May 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Batched inference CEBRA & padding at the `Solver` level #168

Batched inference CEBRA & padding at the `Solver` level #168

CeliaBenquet commented Aug 23, 2024 •

edited

Loading

CeliaBenquet Apr 24, 2025 •

edited

Loading

CeliaBenquet May 2, 2025 •

edited

Loading