Skip to content

[work in progress] Support of generate #261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

Conversation

bigximik
Copy link
Contributor

@bigximik bigximik commented May 12, 2025

✨ Description

Adds support for generate and extends support for forward without handling cache, past_key_values, labels, attention output, or inputs_embeds.
position_ids are ignored and reconstructed from the attention mask.
Currently, only data-parallel generation is supported.

Closes #217

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

  1. Change A
  2. Change B

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

  • 📜 I have read and followed the contributing guidelines.
  • 🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
  • 🎉 The functionality is complete, and I have tested the changes.
  • 📝 I have updated the documentation if needed.
  • ⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
  • 🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

  • 🐋 I have updated the Docker configuration or dependencies, if applicable.
  • 🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

  • 🧪 I have added or updated tests to cover my changes.
  • ✔️ New and existing tests pass locally with my changes.
  • 🚦 I have tested these changes on GPUs and verified training stability.
  • 🏋️ I have tested the changes on realistic training workloads, if applicable.

Performance Impact

  • 📊 I have run benchmarks where applicable to evaluate the performance impact.
  • ✅ The benchmarks show no performance regression.
  • 🚀 The benchmarks indicate a potential performance improvement.
  • ⚠️ The benchmarks indicate a potential performance degradation.
  • 📈 I have provided benchmark results and detailed any performance impact below, if applicable.

📊 Performance Impact Details

If there is any impact on performance, describe it and provide benchmark results, if applicable:


🗒️ Additional Notes

Include any additional context, information, or considerations here, such as known issues, follow-up tasks, or backward compatibility concerns.

@bigximik
Copy link
Contributor Author

Next steps are to crate tests and add docs

@bigximik
Copy link
Contributor Author

There are 3 open questions:

  1. updated parameter in from_pretrained
    Should the keys be tuples, or should we enable support for dot notation inside from_pretrained (if possible)? If not, should we wrap it in a simple dot notation parser?

  2. BatchConfig support
    Do we want to support the full BatchConfig, including sequential_micro_batches? Or is it sufficient as it is now, where only micro_batch_size is passed and the full batch size is computed as data_parallel_groups * sequential_micro_batches?

  3. Factory methods
    Should we have dedicated from_trainer and from_evaluator methods, or is the current unified from_model interface sufficient?

@oleksost
Copy link
Contributor

oleksost commented May 12, 2025

As a side note, to support downstream evals during training, we do not necessarily need .generate function, do we?
I.e. we need it only if we want to support generation tasks (which usually take longer), but if we want to support multiple choice/likelihood based evals, we might not need it at all, which could make the implementation easier for some models like SSMs and maybe even transformers, since there is potentially no need to care about KV/state caching etc.? (afaiu lm_eval harness is not using .generate for likelihood based evals?)

@jlamypoirier
Copy link
Collaborator

  1. updated parameter in from_pretrained
    Should the keys be tuples, or should we enable support for dot notation inside from_pretrained (if possible)? If not, should we wrap it in a simple dot notation parser?

Let's keep the existing format and not use a new rule. cls.config_class.from_dict can deal with updates.

  1. BatchConfig support
    Do we want to support the full BatchConfig, including sequential_micro_batches? Or is it sufficient as it is now, where only micro_batch_size is passed and the full batch size is computed as data_parallel_groups * sequential_micro_batches?

The micro_batch_size is useless by itself. It's only used for pipeline parallelism and throughput computation, and both of which need the full batch config. It's ok to just raise an error for features we don't want.

  1. Factory methods
    Should we have dedicated from_trainer and from_evaluator methods, or is the current unified from_model interface sufficient?

I don't really see the point of from_model, seems like an unnecessary wrapper. from_trainer would be more useful since it would do actual work in gathering the necessary information from the trainer.

@bigximik
Copy link
Contributor Author

@oleksost This is a split from the larger sandbox PR for full lm_eval integration (#222).
You're right — lm_eval does not use generate for some tasks, but it does for others, which we may need to support either during training or as a separate model evaluation step.

@bigximik
Copy link
Contributor Author

Let's keep the existing format and not use a new rule. cls.config_class.from_dict can deal with updates.

I just added future parity with FastLLMModel.from_pretrained, so that I can pass updates to from_dict, which is called inside it.

However, I meant a different thing here — Fast-LLM's runnable supports dot notation to apply updates from the command line:
https://github.com/ServiceNow/Fast-LLM/blob/denis/generate_final/fast_llm/engine/config_utils/runnable.py#L183
The from_dict method on config does not support this, right?

That said, @tscholak is okay with using tuples as keys in from_pretraining, so let's leave it as implemented for now.

@bigximik
Copy link
Contributor Author

bigximik commented May 12, 2025

The micro_batch_size is useless by itself. It's only used for pipeline parallelism and throughput computation, both of which need the full batch config. It's ok to just raise an error for features we don't want.

We still need to support different global batch sizes for the same micro_batch_size, which usedas to the batch size per GPU.

For example, consider a training scenario with:

  • sequence_length = 8K
  • global_batch_size = 32
  • micro_batch_size = 1 (per GPU)
  • Model = 8B
  • GPUs = 16

This configuration uses sequential_micro_batches = 2 if without model parallelism because the parameters wouldn't otherwise fit on a single GPU.

However, when performing evaluation during training, we cannot use the same BatchConfig, as we do not yet support sequential_micro_batches > 1 in inference_runner. But we need to leave micro_batch_size = 1 to not to get OOM on GPUs. So we can't accept the full training BatchConfig as-is. Instead, we need to extract just the micro_batch_size and multiply it by the number of GPUs to derive the global batch size for evaluation.

Additionally, sequence_length will differ: for generate or forward, it will not be maximum as in training, but for evaluation tasks, it varies. This means it doesn't make much sense to carry it over from training, and, if we eventually want to compute FLOPs accurately for evaluation, we’ll need to consider the actual tensor sizes in each forward pass — but that’s outside the scope of this PR.

@bigximik
Copy link
Contributor Author

I don't really see the point of from_model, seems like an unnecessary wrapper. from_trainer would be more useful since it would do actual work in gathering the necessary information from the trainer.

Yeah, you're right — from_model just passes arguments one-to-one to the constructor, so I'll remove it.
That said, I also don't see much value in adding from_trainer right now, since it's just a matter of passing a few additional parameters manually.

@jlamypoirier
Copy link
Collaborator

However, I meant a different thing here — Fast-LLM's runnable supports dot notation to apply updates from the command line: https://github.com/ServiceNow/Fast-LLM/blob/denis/generate_final/fast_llm/engine/config_utils/runnable.py#L183 The from_dict method on config does not support this, right?

The tuple and dot formats are basically the same, it's just parsed vs unparsed. I don't think there is much use for the unparsed version outside of a runnable, because dicts are better in-code.

However, when performing evaluation during training, we cannot use the same BatchConfig, as we do not yet support sequential_micro_batches > 1 in inference_runner. But we need to leave micro_batch_size = 1 to not to get OOM on GPUs. So we can't accept the full training BatchConfig as-is. Instead, we need to extract just the micro_batch_size and multiply it by the number of GPUs to derive the global batch size for evaluation.

Additionally, sequence_length will differ: for generate or forward, it will not be maximum as in training, but for evaluation tasks, it varies. This means it doesn't make much sense to carry it over from training, and, if we eventually want to compute FLOPs accurately for evaluation, we’ll need to consider the actual tensor sizes in each forward pass — but that’s outside the scope of this PR.

Again, forward and generate don't care about the batch config. They take a single, already preprocessed micro-batch and run it no matter the content. The BatchConfig we have in the inference runner is just a placeholder because the schedule runner expects one.

Note that sequential micro-batches are irrelevant for inference, it's just separate batches for all practical purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement .generate (greedy decoding only)
3 participants