[work in progress] Support of generate #261

bigximik · 2025-05-12T08:00:11Z

✨ Description

Adds support for generate and extends support for forward without handling cache, past_key_values, labels, attention output, or inputs_embeds.
position_ids are ignored and reconstructed from the attention mask.
Currently, only data-parallel generation is supported.

Closes #217

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

Change A
Change B

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

🐋 I have updated the Docker configuration or dependencies, if applicable.
🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

📊 I have run benchmarks where applicable to evaluate the performance impact.
✅ The benchmarks show no performance regression.
🚀 The benchmarks indicate a potential performance improvement.
⚠️ The benchmarks indicate a potential performance degradation.
📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:

🗒️ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

…nges

bigximik · 2025-05-12T08:00:41Z

Next steps are to crate tests and add docs

bigximik · 2025-05-12T11:22:44Z

There are 3 open questions:

updated parameter in from_pretrained
Should the keys be tuples, or should we enable support for dot notation inside from_pretrained (if possible)? If not, should we wrap it in a simple dot notation parser?
BatchConfig support
Do we want to support the full BatchConfig, including sequential_micro_batches? Or is it sufficient as it is now, where only micro_batch_size is passed and the full batch size is computed as data_parallel_groups * sequential_micro_batches?
Factory methods
Should we have dedicated from_trainer and from_evaluator methods, or is the current unified from_model interface sufficient?

oleksost · 2025-05-12T12:54:29Z

As a side note, to support downstream evals during training, we do not necessarily need .generate function, do we?
I.e. we need it only if we want to support generation tasks (which usually take longer), but if we want to support multiple choice/likelihood based evals, we might not need it at all, which could make the implementation easier for some models like SSMs and maybe even transformers, since there is potentially no need to care about KV/state caching etc.? (afaiu lm_eval harness is not using .generate for likelihood based evals?)

jlamypoirier · 2025-05-12T13:31:05Z

updated parameter in from_pretrained
Should the keys be tuples, or should we enable support for dot notation inside from_pretrained (if possible)? If not, should we wrap it in a simple dot notation parser?

Let's keep the existing format and not use a new rule. cls.config_class.from_dict can deal with updates.

BatchConfig support
Do we want to support the full BatchConfig, including sequential_micro_batches? Or is it sufficient as it is now, where only micro_batch_size is passed and the full batch size is computed as data_parallel_groups * sequential_micro_batches?

The micro_batch_size is useless by itself. It's only used for pipeline parallelism and throughput computation, and both of which need the full batch config. It's ok to just raise an error for features we don't want.

Factory methods
Should we have dedicated from_trainer and from_evaluator methods, or is the current unified from_model interface sufficient?

I don't really see the point of from_model, seems like an unnecessary wrapper. from_trainer would be more useful since it would do actual work in gathering the necessary information from the trainer.

bigximik · 2025-05-12T13:49:43Z

@oleksost This is a split from the larger sandbox PR for full lm_eval integration (#222).
You're right — lm_eval does not use generate for some tasks, but it does for others, which we may need to support either during training or as a separate model evaluation step.

bigximik · 2025-05-12T14:23:45Z

Let's keep the existing format and not use a new rule. cls.config_class.from_dict can deal with updates.

I just added future parity with FastLLMModel.from_pretrained, so that I can pass updates to from_dict, which is called inside it.

However, I meant a different thing here — Fast-LLM's runnable supports dot notation to apply updates from the command line:
https://github.com/ServiceNow/Fast-LLM/blob/denis/generate_final/fast_llm/engine/config_utils/runnable.py#L183
The from_dict method on config does not support this, right?

That said, @tscholak is okay with using tuples as keys in from_pretraining, so let's leave it as implemented for now.

bigximik · 2025-05-12T15:05:36Z

The micro_batch_size is useless by itself. It's only used for pipeline parallelism and throughput computation, both of which need the full batch config. It's ok to just raise an error for features we don't want.

We still need to support different global batch sizes for the same micro_batch_size, which usedas to the batch size per GPU.

For example, consider a training scenario with:

sequence_length = 8K
global_batch_size = 32
micro_batch_size = 1 (per GPU)
Model = 8B
GPUs = 16

This configuration uses sequential_micro_batches = 2 if without model parallelism because the parameters wouldn't otherwise fit on a single GPU.

However, when performing evaluation during training, we cannot use the same BatchConfig, as we do not yet support sequential_micro_batches > 1 in inference_runner. But we need to leave micro_batch_size = 1 to not to get OOM on GPUs. So we can't accept the full training BatchConfig as-is. Instead, we need to extract just the micro_batch_size and multiply it by the number of GPUs to derive the global batch size for evaluation.

Additionally, sequence_length will differ: for generate or forward, it will not be maximum as in training, but for evaluation tasks, it varies. This means it doesn't make much sense to carry it over from training, and, if we eventually want to compute FLOPs accurately for evaluation, we’ll need to consider the actual tensor sizes in each forward pass — but that’s outside the scope of this PR.

bigximik · 2025-05-12T15:17:47Z

I don't really see the point of from_model, seems like an unnecessary wrapper. from_trainer would be more useful since it would do actual work in gathering the necessary information from the trainer.

Yeah, you're right — from_model just passes arguments one-to-one to the constructor, so I'll remove it.
That said, I also don't see much value in adding from_trainer right now, since it's just a matter of passing a few additional parameters manually.

jlamypoirier · 2025-05-12T16:32:01Z

However, I meant a different thing here — Fast-LLM's runnable supports dot notation to apply updates from the command line: https://github.com/ServiceNow/Fast-LLM/blob/denis/generate_final/fast_llm/engine/config_utils/runnable.py#L183 The from_dict method on config does not support this, right?

The tuple and dot formats are basically the same, it's just parsed vs unparsed. I don't think there is much use for the unparsed version outside of a runnable, because dicts are better in-code.

However, when performing evaluation during training, we cannot use the same BatchConfig, as we do not yet support sequential_micro_batches > 1 in inference_runner. But we need to leave micro_batch_size = 1 to not to get OOM on GPUs. So we can't accept the full training BatchConfig as-is. Instead, we need to extract just the micro_batch_size and multiply it by the number of GPUs to derive the global batch size for evaluation.

Additionally, sequence_length will differ: for generate or forward, it will not be maximum as in training, but for evaluation tasks, it varies. This means it doesn't make much sense to carry it over from training, and, if we eventually want to compute FLOPs accurately for evaluation, we’ll need to consider the actual tensor sizes in each forward pass — but that’s outside the scope of this PR.

Again, forward and generate don't care about the batch config. They take a single, already preprocessed micro-batch and run it no matter the content. The BatchConfig we have in the inference runner is just a placeholder because the schedule runner expects one.

Note that sequential micro-batches are irrelevant for inference, it's just separate batches for all practical purposes.

copy from denis/generate of only generate support part, with some cha…

a46d581

…nges

bigximik added 3 commits May 12, 2025 11:00

fix to use right config param

9a81f89

added basic generate tests

667aacf

clean up

124cfa7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[work in progress] Support of generate #261

[work in progress] Support of generate #261

bigximik commented May 12, 2025 •

edited

Loading

bigximik commented May 12, 2025

bigximik commented May 12, 2025

oleksost commented May 12, 2025 •

edited

Loading

jlamypoirier commented May 12, 2025

bigximik commented May 12, 2025

bigximik commented May 12, 2025

bigximik commented May 12, 2025 •

edited

Loading

bigximik commented May 12, 2025

jlamypoirier commented May 12, 2025

[work in progress] Support of generate #261

Are you sure you want to change the base?

[work in progress] Support of generate #261

Conversation

bigximik commented May 12, 2025 • edited Loading

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Dependencies and Configuration

Testing

Performance Impact

📊 Performance Impact Details

🗒️ Additional Notes

bigximik commented May 12, 2025

bigximik commented May 12, 2025

oleksost commented May 12, 2025 • edited Loading

jlamypoirier commented May 12, 2025

bigximik commented May 12, 2025

bigximik commented May 12, 2025

bigximik commented May 12, 2025 • edited Loading

bigximik commented May 12, 2025

jlamypoirier commented May 12, 2025

bigximik commented May 12, 2025 •

edited

Loading

oleksost commented May 12, 2025 •

edited

Loading

bigximik commented May 12, 2025 •

edited

Loading