Skip to content

Failed to run in mk mode when batch_size is greater than 1 #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
JackeyHua-SJTU opened this issue Jun 3, 2025 · 0 comments
Open

Comments

@JackeyHua-SJTU
Copy link

Labels: bugs, help needed

Issue Description

I can't run the benchmark code in mk mode when batch_size is greater than 1. The model I use is Llama-3.2-1B-Instruct, batch size is 2. All other parameters of ScriptConfig is set to default value.
Take the following instruction as an example.

python megakernels/scripts/generate.py mode=mk prompt="tell me a funny joke about cookies" ntok=100 batch_size=2          

The traceback info is as below.

Traceback (most recent call last):
  File "/root/Megakernels/megakernels/scripts/generate.py", line 211, in <module>
    pydra.run(main)
  File "/venv/lib/python3.12/site-packages/pydra/cli.py", line 146, in run
    return _apply_overrides_and_call(fn, first_arg_type, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/pydra/cli.py", line 118, in _apply_overrides_and_call
    return fn(config)
           ^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/Megakernels/megakernels/scripts/generate.py", line 174, in main
    gen.generate(output_tokens, prompt_len, config.ntok - 1)
  File "/root/Megakernels/megakernels/generators.py", line 165, in generate
    output_ids = self.run(input_ids, pos_id=pos_id)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/Megakernels/megakernels/generators.py", line 132, in run
    self.schedule.globs.hidden_states[:] = hiddens.squeeze(1)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
RuntimeError: expand(CUDABFloat16Type{[2, 2048]}, size=[2048]): the number of sizes provided (1) must be greater or equal to the number of dime
nsions in the tensor (2)

Potential cause

I think the problem lies in the shape of BaseGlobals.hidden_state. It is initialized in make_global() function of Megakernels/megakernels/demos/latency/scheduler.py.

hidden_states=make_buffer(config.hidden_size)

So the hidden_states has only one dimension because config.hidden_size is a model-related constant, let it be hidden_size. But if out batch size is greater than 1, let it be n, then in run function of MK_Generator, the input_ids should have shape (n, 1). And hiddens should have size (n, 1, hidden_size), which can not be squeezed into self.schedule.globs.hidden_states (the shape is (hidden_size)).

Environment

  • GPU: H800
  • OS: Linux x86_64
  • CUDA: 12.8
  • Python: 3.12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant