Add more training models and RLHF algorithms #6368

sglucas · 2025-07-23T02:04:40Z

📌 Checklist before creating the PR

I have created an issue for this PR for traceability
The title follows the standard format: [doc/gemini/tensor/...]: A concise description
I have added relevant tags if possible for us to better distinguish different PRs
I have installed pre-commit: pip install pre-commit && pre-commit install

🚨 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge

e.g. fixed #1234, closed #1234, resolved #1234

📝 What does this PR do?

Summarize your work here.
if you have any plots/diagrams/screenshots/tables, please attach them here.

Add more training models (LLaMA3, Qwen3) and RLHF algorithms (REINFORCE++, RLOO).

💥 Checklist before requesting a review

I have linked my PR to an issue (instruction)
My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
I have performed a self-review of my code
I have added thorough tests.
I have added docstrings for all the functions/methods I implemented

⭐️ Do you enjoy contributing to Colossal-AI?

🌝 Yes, I do.
🌚 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

for more information, see https://pre-commit.ci

YeAnbang · 2025-08-04T01:52:59Z

applications/ColossalChat/coati/distributed/grpo_consumer.py

-        # [minibatch_size x num_of_generation]
-        loss_mask = torch.ones(action_mask.size(0), device=action_mask.device).bool()
+            # [minibatch_size x num_of_generation]
+            loss_mask = torch.ones(action_mask.size(0), device=action_mask.device).bool()



may be better to move the common calculations outside of the if statements for conciseness

YeAnbang · 2025-08-04T02:33:19Z

applications/ColossalChat/coati/distributed/grpo_consumer.py

+            # [minibatch_size x num_generations]
+            advantages = ((reward - reward_mean)).unsqueeze(dim=-1)
+
+            advantages_mean = advantages.mean(dim=0)


Isn't the advantages_mean always 0 as advantage is already zero-centered in the previous step?

YeAnbang · 2025-08-04T02:45:20Z

applications/ColossalChat/coati/distributed/grpo_consumer.py

+            advantages_std = advantages.std(dim=0)
+
+            advantages = (advantages - advantages_mean) / (advantages_std + 1e-4)
+


maybe consider double-checking the reinforce++ baseline advantage calculation. In reinforce ++, each sample's advantage is calculated by subtracting the mean reward of all generation in the global batch, not per prompt mean

For reinforce++, we should calculate norm adv using batch level mean and std.

YeAnbang · 2025-08-04T02:46:35Z

applications/ColossalChat/coati/distributed/untitled.txt

@@ -0,0 +1,2 @@
+4.51.0: qwen2.5 + grpo, qwen3 + grpo, cannot: llama2, llama3.2
+4.47.0:


remove test log file

Please remove this file.

YeAnbang · 2025-08-04T02:47:08Z

applications/ColossalChat/rl_example.py

@@ -227,13 +227,13 @@
    os.environ["TOKENIZERS_PARALLELISM"] = "false"  # Disable tokenizers parallelism to avoid deadlock

    inference_model_config = dict(path=args.model)
-    train_model_config = dict(path=args.model, use_flash_attention_2=True, use_cache=False)
+    train_model_config = dict(path=args.model, use_flash_attention_2=False, use_cache=False)


why is flash attention not supported?

YeAnbang · 2025-08-04T02:47:25Z

applications/ColossalChat/rl_example.py

    generate_config = dict(top_k=args.top_k, top_p=args.top_p, temperature=args.temperature)

    if args.backend == "transformers":
        inference_model_config.update(
            dict(
-                use_flash_attention_2=True,
+                use_flash_attention_2=False,


YeAnbang · 2025-08-04T02:49:28Z

applications/ColossalChat/rl_example.py

probably also consider force num_generation to 1 for reinforce++

TongLi3701

Thanks, we left some comments.

TongLi3701 · 2025-08-21T07:30:57Z

applications/ColossalChat/coati/distributed/grpo_consumer.py

+            advantages_std = advantages.std(dim=0)
+
+            advantages = (advantages - advantages_mean) / (advantages_std + 1e-4)
+


For reinforce++, we should calculate norm adv using batch level mean and std.

TongLi3701 · 2025-08-21T08:03:58Z

applications/ColossalChat/coati/distributed/untitled.txt

@@ -0,0 +1,2 @@
+4.51.0: qwen2.5 + grpo, qwen3 + grpo, cannot: llama2, llama3.2
+4.47.0:


Please remove this file.

root and others added 3 commits July 21, 2025 10:02

Merge upstream/grpo-latest and reapply my local changes

47ee955

Merge upstream/grpo-latest and reapply my local changes

77bd4a4

Merge upstream/grpo-latest and reapply my local changes

dd08277

sglucas requested a review from a team as a code owner July 23, 2025 02:04

sglucas changed the base branch from main to grpo-latest July 23, 2025 02:05

[pre-commit.ci] auto fixes from pre-commit.com hooks

9da096f

for more information, see https://pre-commit.ci

YeAnbang reviewed Aug 4, 2025

View reviewed changes

applications/ColossalChat/rl_example.py

Copy link

Contributor

YeAnbang Aug 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably also consider force num_generation to 1 for reinforce++

TongLi3701 changed the base branch from grpo-latest to main August 21, 2025 06:55

TongLi3701 changed the base branch from main to grpo-latest August 21, 2025 06:55

TongLi3701 requested changes Aug 21, 2025

View reviewed changes

sglucas closed this by deleting the head repository Aug 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add more training models and RLHF algorithms #6368

Add more training models and RLHF algorithms #6368

Uh oh!

sglucas commented Jul 23, 2025

Uh oh!

YeAnbang Aug 4, 2025

Uh oh!

YeAnbang Aug 4, 2025

Uh oh!

YeAnbang Aug 4, 2025

Uh oh!

TongLi3701 Aug 21, 2025

Uh oh!

YeAnbang Aug 4, 2025

Uh oh!

TongLi3701 Aug 21, 2025

Uh oh!

YeAnbang Aug 4, 2025

Uh oh!

YeAnbang Aug 4, 2025

Uh oh!

YeAnbang Aug 4, 2025

Uh oh!

TongLi3701 left a comment

Uh oh!

TongLi3701 Aug 21, 2025

Uh oh!

TongLi3701 Aug 21, 2025

Uh oh!

Uh oh!

		advantages_std = advantages.std(dim=0)

		advantages = (advantages - advantages_mean) / (advantages_std + 1e-4)

		@@ -0,0 +1,2 @@
		4.51.0: qwen2.5 + grpo, qwen3 + grpo, cannot: llama2, llama3.2
		4.47.0:

Add more training models and RLHF algorithms #6368

Add more training models and RLHF algorithms #6368

Uh oh!

Conversation

sglucas commented Jul 23, 2025

📌 Checklist before creating the PR

🚨 Issue number

📝 What does this PR do?

💥 Checklist before requesting a review

⭐️ Do you enjoy contributing to Colossal-AI?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TongLi3701 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!