Failed to run in `mk` mode when `batch_size` is greater than 1 #2

JackeyHua-SJTU · 2025-06-03T11:19:36Z

Labels: bugs, help needed

Issue Description

I can't run the benchmark code in mk mode when batch_size is greater than 1. The model I use is Llama-3.2-1B-Instruct, batch size is 2. All other parameters of ScriptConfig is set to default value.
Take the following instruction as an example.

python megakernels/scripts/generate.py mode=mk prompt="tell me a funny joke about cookies" ntok=100 batch_size=2

The traceback info is as below.

Traceback (most recent call last):
  File "/root/Megakernels/megakernels/scripts/generate.py", line 211, in <module>
    pydra.run(main)
  File "/venv/lib/python3.12/site-packages/pydra/cli.py", line 146, in run
    return _apply_overrides_and_call(fn, first_arg_type, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/pydra/cli.py", line 118, in _apply_overrides_and_call
    return fn(config)
           ^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/Megakernels/megakernels/scripts/generate.py", line 174, in main
    gen.generate(output_tokens, prompt_len, config.ntok - 1)
  File "/root/Megakernels/megakernels/generators.py", line 165, in generate
    output_ids = self.run(input_ids, pos_id=pos_id)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/Megakernels/megakernels/generators.py", line 132, in run
    self.schedule.globs.hidden_states[:] = hiddens.squeeze(1)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
RuntimeError: expand(CUDABFloat16Type{[2, 2048]}, size=[2048]): the number of sizes provided (1) must be greater or equal to the number of dime
nsions in the tensor (2)

Potential cause

I think the problem lies in the shape of BaseGlobals.hidden_state. It is initialized in make_global() function of Megakernels/megakernels/demos/latency/scheduler.py.

hidden_states=make_buffer(config.hidden_size)

So the hidden_states has only one dimension because config.hidden_size is a model-related constant, let it be hidden_size. But if out batch size is greater than 1, let it be n, then in run function of MK_Generator, the input_ids should have shape (n, 1). And hiddens should have size (n, 1, hidden_size), which can not be squeezed into self.schedule.globs.hidden_states (the shape is (hidden_size)).

Environment

GPU: H800
OS: Linux x86_64
CUDA: 12.8
Python: 3.12

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failed to run in `mk` mode when `batch_size` is greater than 1 #2

Failed to run in `mk` mode when `batch_size` is greater than 1 #2

JackeyHua-SJTU commented Jun 3, 2025

Failed to run in mk mode when batch_size is greater than 1 #2

Failed to run in mk mode when batch_size is greater than 1 #2

Comments

JackeyHua-SJTU commented Jun 3, 2025

Issue Description

Potential cause

Environment

Failed to run in `mk` mode when `batch_size` is greater than 1 #2

Failed to run in `mk` mode when `batch_size` is greater than 1 #2