Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
5269 commits
Select commit Hold shift + click to select a range
e82a8e7
ADLR/megatron-lm!3216 - [dist ckpt] Reduce memory usage for common st…
ananthsub Jul 28, 2025
afc749e
Merge branch 'avoid-rank0-validation-memory' into 'main'
ko3n1g Jul 28, 2025
018b1a9
ADLR/megatron-lm!3690 - fix the unpausing algorithm to unpause reques…
sidsingh-nvidia Jul 29, 2025
2af9630
Merge branch 'siddharth/fix-unpause-alg' into 'main'
shanmugamr1992 Jul 29, 2025
a77a883
ADLR/megatron-lm!3667 - fix: fix get_transformer_layer_offset
ZhiyuLi-Nvidia Jul 29, 2025
26c5430
Merge branch 'zhiyul/fix_get_transformer_layer_offset' into 'main'
jaredcasper Jul 29, 2025
e38ba9e
ADLR/megatron-lm!3659 - feat(optmizer offloading): Support optimizer …
yuzhongw-nvidia Jul 29, 2025
abbde02
Merge branch 'fix_optim_offload' into 'main'
jaredcasper Jul 29, 2025
a00ad16
ADLR/megatron-lm!3342 - Reduce memory usage after saving fp8 checkpoi…
BestJuly Jul 29, 2025
7af09be
Merge branch 'lit/reduce_memory_after_ckpt_save' into 'main'
jaredcasper Jul 29, 2025
720c8b4
ADLR/megatron-lm!3576 - perf(MLA): MLA down proj switch back to TELinear
yuzhongw-nvidia Jul 29, 2025
2fbd646
Merge branch 'mla_down_proj_telinear' into 'main'
ko3n1g Jul 29, 2025
fad83fc
ADLR/megatron-lm!3642 - Fix DCP OOM during SwiGLU load
mikolajblaz Jul 30, 2025
2a97a16
Merge branch 'mblaz/mlp-glu-oom-fix' into 'main'
ko3n1g Jul 30, 2025
8e9d56d
ADLR/megatron-lm!3455 - Fix reshardable checkpoint format by removing…
skierat Jul 30, 2025
3967229
Merge branch 'skierat/avoid_prefix' into 'main'
ko3n1g Jul 30, 2025
733b682
ADLR/megatron-lm!3573 - Clean RoPE fusion
yaox12 Jul 31, 2025
472ecf1
Merge branch 'xiny/clean_rope' into 'main'
ko3n1g Jul 31, 2025
f7de536
ADLR/megatron-lm!3698 - Fix: Update OneLogger Instrumentation Points …
PytLab Jul 31, 2025
8ee323a
Merge branch 'main' into 'main'
jaredcasper Jul 31, 2025
84cc522
ADLR/megatron-lm!3699 - Fix Issue with EP and TP
wdykas Aug 1, 2025
84cf979
Merge branch 'fix-ep-tp' into 'main'
ko3n1g Aug 1, 2025
e7c55de
ADLR/megatron-lm!3473 - Changes to enable full iteration CUDA graph
vasunvidia Aug 1, 2025
ed1001d
Merge branch 'vrengasamy/full_cuda_graph2' into 'main'
ko3n1g Aug 1, 2025
b0ccc17
ADLR/megatron-lm!3752 - docs: fix typo in custom_fsdp.md
shjwudp Aug 1, 2025
af962d8
Merge branch 'jianbinc/fix_cfsdp_doc_typo' into 'main'
ko3n1g Aug 1, 2025
ae1c882
ADLR/megatron-lm!3470 - feat(MoE): Support Expert Parallel A2A Overla…
Aug 1, 2025
0c6c117
Merge branch 'hongbinl/support_pp_1' into 'main'
ko3n1g Aug 1, 2025
1f6a0a3
ADLR/megatron-lm!3762 - CI Fix: fix FT failure for A2A Overlap
Aug 2, 2025
7f7439f
Merge branch 'hongbinl/fix_ft_for_a2a_overlap' into 'main'
ko3n1g Aug 2, 2025
bbb4c5f
chore: Version bump
Aug 4, 2025
482f87e
ADLR/megatron-lm!3767 - chore: Upgrade dependencies (2025-08-04)
Aug 4, 2025
dee9e78
Merge branch 'ci-bot/build/upgrade-dependencies-2025-08-04' into 'main'
ko3n1g Aug 4, 2025
bea7d5e
ADLR/megatron-lm!3737 - Abstract _forward_impl for linears to a membe…
cjluo-nv Aug 4, 2025
a58d514
Merge branch 'chenjiel/linear_forward_impl' into 'main'
ericharper Aug 4, 2025
ed24391
ADLR/megatron-lm!3748 - build: Add mla-flash
ko3n1g Aug 5, 2025
c606697
Merge branch 'ko3n1g/build/mla-flash' into 'main'
ko3n1g Aug 5, 2025
11a2c15
ADLR/megatron-lm!3763 - build: Readd jet-client
ko3n1g Aug 5, 2025
75e2efd
Merge branch 'ko3n1g/ci/re-add-jet-client' into 'main'
ko3n1g Aug 5, 2025
d8023b6
ADLR/megatron-lm!3778 - ci: Unpin packaging requirement
chtruong814 Aug 5, 2025
14e2a53
Merge branch 'chtruong/packaging' into 'main'
ko3n1g Aug 5, 2025
51ba755
ADLR/megatron-lm!3705 - Update pytorch to 25.06 and TE 2.6
thomasdhc Aug 6, 2025
71328b8
Merge branch '25.09-dep-update' into 'main'
ko3n1g Aug 6, 2025
21b93a6
ADLR/megatron-lm!3681 - build: Bump NVRx and modelopt for 25.09
ko3n1g Aug 6, 2025
cc227f7
Merge branch 'ko3n1g/build/test-te-2.6' into 'main'
ko3n1g Aug 6, 2025
e46ec94
ci(hotfix): Update _run_training.sh
ko3n1g Aug 7, 2025
f26e958
ADLR/megatron-lm!3789 - ci: Add required reviewer
chtruong814 Aug 7, 2025
d514c6f
Merge branch 'chtruong/add-final-reviewer' into 'main'
ko3n1g Aug 7, 2025
9667275
ADLR/megatron-lm!3320 - Singleton local shards for checkpointing
mikolajblaz Aug 8, 2025
1062399
Merge branch 'mblaz/singleton-local-shards' into 'main'
deepakn94 Aug 8, 2025
8e2de56
ADLR/megatron-lm!3668 - fix(mtp logging): Correctly accumulate MTP lo…
BestJuly Aug 9, 2025
a9dcf98
Merge branch 'lit/fix_mtp_logging' into 'main'
chtruong814 Aug 9, 2025
5df6dbe
ADLR/megatron-lm!3471 - perf(MoE): Add fused weighted squared ReLU
yanring Aug 9, 2025
3a6f5a9
Merge branch 'zijiey/fused_weighted_squared_relu' into 'main'
ericharper Aug 9, 2025
e01cd66
ADLR/megatron-lm!3798 - Clean up Zarr test cases as part of future ba…
skierat Aug 9, 2025
259885e
Merge branch 'skierat/rm-zarr-tests' into 'main'
ko3n1g Aug 9, 2025
f43bc6c
ADLR/megatron-lm!3671 - perf(distopt): Cache shard buffer
kunlunl Aug 10, 2025
10e3163
Merge branch 'shard_buffer_fix' into 'main'
deepakn94 Aug 10, 2025
10493df
ADLR/megatron-lm!3731 - Support Llama4 HF checkpoint to MLM checkpoint
yueshen2016 Aug 10, 2025
d338252
Merge branch 'yueshen/support_llama4_hf_mlm_import' into 'main'
deepakn94 Aug 10, 2025
3da9032
chore: Version bump
Aug 11, 2025
08abeed
ADLR/megatron-lm!3330 - feat(MoE): support CP and recompute for MTP
shifangx Aug 11, 2025
650ab87
Merge branch 'shifang/mtp_cp' into 'main'
chtruong814 Aug 11, 2025
eea7c08
ADLR/megatron-lm!3806 - Disallow expandable segments for cudagraphs e…
mathemakitten Aug 13, 2025
94a3711
Merge branch 'helenn-ban-expandable-segments' into 'main'
jaredcasper Aug 13, 2025
13fd57a
ADLR/megatron-lm!3710 - MXFP8 DP AG overlap enablement
youngeunkwon0405 Aug 13, 2025
410222b
Merge branch 'mxfp8-dp-ag-overlap-mr' into 'main'
jaredcasper Aug 13, 2025
2f1027d
ci(hotfix): Disable broken tests
ko3n1g Aug 12, 2025
a5f0057
ci(hotfix): Catch WaitTimeExceeded
ko3n1g Aug 14, 2025
cde92b2
ADLR/megatron-lm!3406 - Update README
sbhavani Aug 14, 2025
2dd030e
Merge branch 'update-readme' into 'main'
ko3n1g Aug 14, 2025
d87bfd1
ADLR/megatron-lm!3808 - Move FullCudaGraphWrapper implementation to M…
vasunvidia Aug 15, 2025
91e2ee5
Merge branch 'vrengasamy/full_cuda_graph_core' into 'main'
chtruong814 Aug 15, 2025
2b6b46b
ADLR/megatron-lm!3631 - Fixes and updates for external cudagraph
buptzyb Aug 15, 2025
9545270
Merge branch 'robinz/external_cudagraph_update' into 'main'
chtruong814 Aug 15, 2025
16ad771
ADLR/megatron-lm!3799 - build: Bump TE
ko3n1g Aug 15, 2025
0d33682
Merge branch 'ko3n1g/build/te-2.6' into 'main'
ko3n1g Aug 15, 2025
82e5ff6
ADLR/megatron-lm!3606 - Debug distributed checkpoint for Transformer …
timmoon10 Aug 16, 2025
d1a8777
Merge branch 'tmoon/te-op-fuser-debug-checkpoint' into 'main'
ko3n1g Aug 16, 2025
7704169
ADLR/megatron-lm!3812 - Add argument to control collnet enablement
youngeunkwon0405 Aug 16, 2025
4819438
Merge branch 'ibsharp-knob' into 'main'
deepakn94 Aug 16, 2025
46eb0a3
ADLR/megatron-lm!3569 - Dynamic Backend Inference MLA
wdykas Aug 16, 2025
29a0607
Merge branch 'mla-flash' into 'main'
ko3n1g Aug 16, 2025
b6a7f40
ADLR/megatron-lm!3422 - Adding support for multiple validation sets
Aug 16, 2025
2e88416
Merge branch 'bnorick/multi-validation' into 'main'
ko3n1g Aug 16, 2025
1b07529
ADLR/megatron-lm!3718 - Fix log prob calculation, pipeline parallelis…
santhnm2 Aug 16, 2025
7a31b35
Merge branch 'dynamic_logprobs_fix' into 'main'
ko3n1g Aug 16, 2025
c86819f
ADLR/megatron-lm!3746 - Fix cuda graph with first/last layer bf16 by …
guyueh1 Aug 16, 2025
a100a3c
Merge branch 'fix_cuda_graph_with_first_last_layer_bf16' into 'main'
chtruong814 Aug 16, 2025
d7444d0
ADLR/megatron-lm!3827 - chore: Upgrade dependencies (2025-08-15)
Aug 16, 2025
2106bd6
Merge branch 'ci-bot/build/upgrade-dependencies-2025-08-15' into 'main'
ko3n1g Aug 16, 2025
254ef23
ADLR/megatron-lm!3829 - ci: Add copy-pr-bot
ko3n1g Aug 16, 2025
c769b67
Merge branch 'ko3n1g/ci/copy-pr-bot' into 'main'
ko3n1g Aug 16, 2025
69b65e0
ci(hotfix): Restart on `malloc(): unaligned tcache chunk detected`
ko3n1g Aug 17, 2025
6b62015
ADLR/megatron-lm!3378 - M4 p2p communication, schedules and finalize …
yashaswikarnati Aug 17, 2025
de512dc
Merge branch 'yash/p2p_class' into 'main'
ko3n1g Aug 17, 2025
c08d89b
ADLR/megatron-lm!3809 - feat(moe): Add MoE router fusion
Victarry Aug 18, 2025
d93743a
Merge branch 'denliu/router_fusoin' into 'main'
ko3n1g Aug 18, 2025
79d04be
chore: Version bump
Aug 18, 2025
d3df238
ADLR/megatron-lm!3814 - Apex.contrib.nccl_allocator migration
youngeunkwon0405 Aug 18, 2025
66a1dfc
Merge branch 'remove-apex-nccl-allocator' into 'main'
ko3n1g Aug 18, 2025
8e11c52
chore: Version bump 0.15.0rc0
ko3n1g Aug 18, 2025
551b734
ci: Fix segfaults (maybe)
ko3n1g Aug 17, 2025
f778f7b
ci: DEV tests from A100 to H100 cluster
ko3n1g Aug 15, 2025
781e765
ADLR/megatron-lm!3465 - perf(MoE): Support recomputation for FP8 laye…
hxbai Aug 20, 2025
6850cc6
Merge branch 'hongxiaob/save_original_input' into 'main'
ko3n1g Aug 20, 2025
8653fad
ADLR/megatron-lm!3757 - ZMQ based communication of requests during pa…
sidsingh-nvidia Aug 20, 2025
c47cf0a
Merge branch 'dynamic-inference-parallelism' into 'main'
ko3n1g Aug 20, 2025
8d9dbed
ADLR/megatron-lm!3815 - Add is_cg_capturable flag to CrossEntropyLoss…
katec846 Aug 22, 2025
4cd81e8
Merge branch 'add_is_cg_capturable_flag' into 'main'
jaredcasper Aug 22, 2025
af28b5a
ADLR/megatron-lm!3443 - [FSDP] Decouple Custom FSDP to make it indepe…
shjwudp Aug 22, 2025
4dd2f2b
Merge branch 'nvfsdp_convergence' into 'main'
ko3n1g Aug 22, 2025
237080b
ADLR/megatron-lm!3842 - This fixes the bug where not using full_valid…
Aug 22, 2025
4db3c78
Merge branch 'fix_full_validation_bug' into 'main'
jaredcasper Aug 22, 2025
7139518
ADLR/megatron-lm!3824 - Fix cuda graph when VPP is used
guyueh1 Aug 22, 2025
c6aab54
Merge branch 'fix_cuda_graph_with_vpp' into 'main'
chtruong814 Aug 22, 2025
eb0c03e
ADLR/megatron-lm!3834 - chore: Upgrade dependencies (2025-08-18)
Aug 22, 2025
09ca1d2
Merge branch 'ci-bot/build/upgrade-dependencies-2025-08-18' into 'main'
ko3n1g Aug 22, 2025
d6d094a
ADLR/megatron-lm!3600 - fix sync save utility
aartibasant Aug 22, 2025
bf341cb
Merge branch 'abasant/fix_sync_save_utility_v2' into 'main'
ananthsub Aug 22, 2025
7683298
ADLR/megatron-lm!3719 - Ability to abort persistent CP worker
aartibasant Aug 22, 2025
0913599
Merge branch 'abasant/ability_to_abort_persistent_worker' into 'main'
ananthsub Aug 22, 2025
4d5dc62
ADLR/megatron-lm!3432 - Enable VLM FP8
trintamaki Aug 22, 2025
d9270aa
Merge branch 'trintamaki/vlm_fp8' into 'main'
jaredcasper Aug 22, 2025
2e7e438
ADLR/megatron-lm!2452 - Make it an option to use TE activation functi…
guyueh1 Aug 23, 2025
6299ea7
Merge branch 'te_activation_func' into 'main'
ko3n1g Aug 23, 2025
0532e92
ADLR/megatron-lm!3678 - moonshotai/Kimi-K2-Instruct HF import, PTQ, a…
ChenhanYu Aug 23, 2025
d775d4a
Merge branch 'chenhany/kimi_k2' into 'main'
jaredcasper Aug 23, 2025
724580d
ADLR/megatron-lm!3861 - Remove FP8 calibration script
mathemakitten Aug 23, 2025
bebc0e4
Merge branch 'helenn-delete-calibration' into 'main'
jaredcasper Aug 23, 2025
ed1eaa9
ADLR/megatron-lm!3828 - Fix NEMO unittest where the weight is provide…
ChenhanYu Aug 23, 2025
312f300
Merge branch 'chenjiel/fix_nemo' into 'main'
jaredcasper Aug 23, 2025
c40a446
ADLR/megatron-lm!3864 - add wandb_entity
Aug 24, 2025
f6a675a
Merge branch 'entity' into 'main'
ericharper Aug 24, 2025
3d19693
ADLR/megatron-lm!3532 - Implement new optimizer checkpoint formats fo…
mikolajblaz Aug 24, 2025
3d784cb
Merge branch 'mblaz/dp_zero_model_space' into 'main'
ko3n1g Aug 24, 2025
1c29678
ADLR/megatron-lm!3876 - build: Bump packaging
ko3n1g Aug 24, 2025
f364164
Merge branch 'ko3n1g/ci/packaging' into 'main'
ko3n1g Aug 24, 2025
c7fd91a
ci(hotfix): Increase n_nondeterminism_attemps
ko3n1g Aug 24, 2025
4840669
chore: Version bump
Aug 25, 2025
f8f6e9b
ci(hotfix): Restart on zmq error
ko3n1g Aug 25, 2025
188435a
ci(hotfix): Increase non-determinism attempts
ko3n1g Aug 25, 2025
66c12ce
ADLR/megatron-lm!3887 - chore: Upgrade dependencies (2025-08-25)
Aug 26, 2025
77eaa9a
Merge branch 'ci-bot/build/upgrade-dependencies-2025-08-25' into 'main'
ko3n1g Aug 26, 2025
875ad2a
ADLR/megatron-lm!3875 - ci: Auto-publish megatron-fsdp
ko3n1g Aug 26, 2025
dcf7d36
Merge branch 'ko3n1g/ci/push-megatron-fsdp' into 'main'
ko3n1g Aug 26, 2025
a6c6250
ADLR/megatron-lm!3646 - [1/4] Merge Megatron-RL into LM
tdene Aug 26, 2025
16c0d28
Merge branch 'tdene/push-to-upstream-mr1' into 'main'
jaredcasper Aug 26, 2025
d6526b1
ADLR/megatron-lm!3884 - Fix unsetting NCCL_COLLNET_ENABLE in initiali…
chtruong814 Aug 26, 2025
8db4323
Merge branch 'chtruong/fix-unset-nccl-collnet' into 'main'
chtruong814 Aug 26, 2025
d7bf5aa
ADLR/megatron-lm!3074 - feat(MoE): Support Expert Parallel A2A Overla…
Aug 26, 2025
4b30ec5
Merge branch 'hongbinl/1f1b_overlap_new' into 'main'
ko3n1g Aug 26, 2025
799cee0
ADLR/megatron-lm!3782 - Move cuda graph capture to core
buptzyb Aug 26, 2025
b7a6f90
Merge branch 'robinz/cudagraph_core' into 'main'
ko3n1g Aug 26, 2025
37ee3d1
ADLR/megatron-lm!3764 - tests: Auto-validate weekly tests
ko3n1g Aug 26, 2025
d6301fb
Merge branch 'ko3n1g/tests/thresholds-weekly' into 'main'
ko3n1g Aug 26, 2025
5b2cb28
ci: No integration tests on merge-trains
ko3n1g Aug 27, 2025
7b8bbf2
ci: Allow interrupt on main
ko3n1g Aug 28, 2025
8efa2a0
ci(hotfix): Non-determinism only on EXIT_CODE=0
ko3n1g Aug 28, 2025
6740f5e
ADLR/megatron-lm!3885 - Add proper teardowns for cudagraphs tests to …
mathemakitten Aug 28, 2025
028f079
Merge branch 'helenn-patch-legacy-cudagraph-tests' into 'main'
ko3n1g Aug 28, 2025
4f6ab63
ADLR/megatron-lm!3897 - build: Loosen transformers pin
ko3n1g Aug 28, 2025
7aad147
Merge branch 'ko3n1g/build/transformers-pin' into 'main'
ko3n1g Aug 28, 2025
7ceafd9
ADLR/megatron-lm!3899 - fix: correct way to pass pipeline_rank in tes…
ZhiyuLi-Nvidia Aug 28, 2025
bdad881
Merge branch 'zhiyul/fix_vpp_ci_test' into 'main'
ko3n1g Aug 28, 2025
fcbde8a
ADLR/megatron-lm!3900 - Fix providers refactoring.
yobibyte Aug 28, 2025
d1e4fc6
Merge branch 'vitalyk/fix-textgen' into 'main'
deepakn94 Aug 28, 2025
b74396f
ADLR/megatron-lm!3877 - Fix FSDP distributed parameter weight shapes …
cspades Aug 28, 2025
1d0995d
Merge branch 'cye/fix-fsdp-dist-shape' into 'main'
ko3n1g Aug 28, 2025
500333c
ADLR/megatron-lm!3917 - chore: Add RL review group
ko3n1g Aug 29, 2025
17cb145
Merge branch 'ko3n1g/chore/add-rl-group' into 'main'
ko3n1g Aug 29, 2025
3ec579a
ADLR/megatron-lm!3843 - build: Upgrade TE to 2.7
ko3n1g Aug 29, 2025
1deafac
Merge branch 'ko3n1g/build/te-2.7' into 'main'
ko3n1g Aug 29, 2025
0e3d8ec
ci(hotfix): Restart attempts
ko3n1g Aug 29, 2025
c2527ba
ci(hotfix): Golden values
ko3n1g Aug 29, 2025
fa4d12c
ci(hotfix): yq path
ko3n1g Aug 29, 2025
256c855
ADLR/megatron-lm!3688 - Non-decode CUDA graphs for the dynamic infere…
sidsingh-nvidia Aug 30, 2025
d0d8a5c
Merge branch 'siddharth/non-decode-cg' into 'main'
ko3n1g Aug 30, 2025
180ebf0
ADLR/megatron-lm!3914 - Update README - Latest News
sbhavani Aug 30, 2025
d7ed78d
Merge branch 'main' into 'main'
ko3n1g Aug 30, 2025
28925b8
ADLR/megatron-lm!3928 - ci: Py312 wheels
ko3n1g Aug 30, 2025
a6237d0
Merge branch 'ko3n1g/ci/py312-support' into 'main'
ko3n1g Aug 30, 2025
19db79c
Revert "ADLR/megatron-lm!3688 - Non-decode CUDA graphs for the dynami…
ko3n1g Aug 30, 2025
6f178d2
ADLR/megatron-lm!3741 - Freeze GC around create_cudagraphs().
lmcafee-nvidia Aug 31, 2025
5c05330
Merge branch 'lmcafee/cuda-graph-gc' into 'main'
lmcafee-nvidia Aug 31, 2025
223533f
ADLR/megatron-lm!3555 - Dynamic inference functional tests | ShareGPT
lmcafee-nvidia Aug 31, 2025
f0edd1c
Merge branch 'lmcafee/dyn-inf-functional-tests' into 'main'
lmcafee-nvidia Aug 31, 2025
3271c42
chore: Version bump
Sep 1, 2025
726bc3f
ci(hotfix): Fix `Run tests` label
ko3n1g Sep 1, 2025
2c69af5
ADLR/megatron-lm!3933 - chore: Upgrade dependencies (2025-09-01)
Sep 1, 2025
f4fa7d6
Merge branch 'ci-bot/build/upgrade-dependencies-2025-09-01' into 'main'
ko3n1g Sep 1, 2025
39247c2
ADLR/megatron-lm!3936 - fix typos in megatron/training/arguments.py
sbhavani Sep 1, 2025
5cdac7d
Merge branch 'docs/fix-typos' into 'main'
ko3n1g Sep 1, 2025
6462307
ci(hotfix): Less false-positive restarts
ko3n1g Sep 2, 2025
4025494
chore: Version bump
Sep 2, 2025
d8714d9
chore: Version bump
ko3n1g Sep 2, 2025
84111cc
ADLR/megatron-lm!3942 - ci: Fix release tag
ko3n1g Sep 3, 2025
40af198
Merge branch 'ko3n1g/ci/fix-release' into 'main'
ko3n1g Sep 3, 2025
5cc85f3
ADLR/megatron-lm!3695 - Added double buffering to be configurable
sanandaraj5597 Sep 3, 2025
96f1e01
Merge branch 'optional_double_buffering' into 'main'
ko3n1g Sep 3, 2025
3075197
ADLR/megatron-lm!3869 - Implement fused MLP as subclass of unfused MLP
timmoon10 Sep 3, 2025
122324c
Merge branch 'tmoon/fused-mlp-subclass' into 'main'
ko3n1g Sep 3, 2025
972553d
ADLR/megatron-lm!3902 - Fix: use_decoupled_grad was set to the wrong …
yaoyu-33 Sep 3, 2025
e0b9fbb
Merge branch 'yuya/dist_fused_adam_fix' into 'main'
ko3n1g Sep 3, 2025
a8fee91
ADLR/megatron-lm!3784 - Use isolated RNG for sampling
santhnm2 Sep 3, 2025
0118b97
Merge branch 'pp_eval_debugging' into 'main'
ko3n1g Sep 3, 2025
708f565
ADLR/megatron-lm!3846 - ckpt loading safe strategy
dimapihtar Sep 3, 2025
86660c1
Merge branch 'ckpt_safe_load' into 'main'
ko3n1g Sep 3, 2025
a15a6d4
ADLR/megatron-lm!3893 - chore: update MSC docs and bump version
shunjiad Sep 3, 2025
e674a29
Merge branch 'fix-msc-docs' into 'main'
ko3n1g Sep 3, 2025
1cf07f3
Revert "ADLR/megatron-lm!3784 - Use isolated RNG for sampling"
ko3n1g Sep 3, 2025
1e9e94c
ADLR/megatron-lm!3881 - Megatron-FSDP NCCL symmetric register support
youngeunkwon0405 Sep 3, 2025
2c7e98a
Merge branch 'nvfsdp-symmetric' into 'main'
ericharper Sep 3, 2025
50502b9
ADLR/megatron-lm!3383 - Add KV duplication for TensorRTLLM export to …
meatybobby Sep 3, 2025
9cbd0d7
Merge branch 'bobchen/kv_repeat' into 'main'
ko3n1g Sep 3, 2025
7b49ca7
ADLR/megatron-lm!3652 - Remove `create_cudagraphs` + unify cudagraphs…
mathemakitten Sep 3, 2025
02a1dd0
Merge branch 'helenn-unify-graph-creation-and-recording' into 'main'
ko3n1g Sep 3, 2025
b6517c6
ADLR/megatron-lm!2997 - add new tokenizer system
dimapihtar Sep 4, 2025
cca55cc
Merge branch 'tokenizers' into 'main'
ko3n1g Sep 4, 2025
76622ed
ADLR/megatron-lm!3892 - M4: Merge `model_comm_pgs`, `grad_comm_pgs` a…
yaoyu-33 Sep 4, 2025
0889ce9
Merge branch 'yuya/m4_merge_into_pg_collection' into 'main'
ko3n1g Sep 4, 2025
1b432dd
fix(hotfix): Update golden values
ko3n1g Sep 4, 2025
49be51a
Revert "fix(hotfix): Update golden values"
ko3n1g Sep 4, 2025
3a51832
Revert "ADLR/megatron-lm!2997 - add new tokenizer system"
ko3n1g Sep 4, 2025
ca9797e
Revert "ADLR/megatron-lm!3892 - M4: Merge `model_comm_pgs`, `grad_com…
ko3n1g Sep 4, 2025
3e3c3c9
ci(hotfix): Retry on non-determinism
ko3n1g Sep 4, 2025
9be09c9
ci(hotfix): Disable mimo test
ko3n1g Sep 4, 2025
3a9a060
Replay of: !2997 - add new tokenizer system
ko3n1g Sep 4, 2025
ebc82b1
ADLR/megatron-lm!3953 - ci: Use Team tag for review reminders
ko3n1g Sep 4, 2025
63e8566
Merge branch 'ko3n1g/ci/team-reviews' into 'main'
ko3n1g Sep 4, 2025
0269b8a
ADLR/megatron-lm!3935 - docs: Add Installation Guide
ko3n1g Sep 4, 2025
dd7de1a
Merge branch 'ko3n1g/docs/installation-guide' into 'main'
ko3n1g Sep 4, 2025
f573042
ADLR/megatron-lm!3957 - ci: Auto-validate convergence tests
ko3n1g Sep 5, 2025
e000263
Merge branch 'ko3n1g/ci/auto-validate-release-tests-2' into 'main'
ko3n1g Sep 5, 2025
72d2354
ADLR/megatron-lm!3318 - feat(moe) Add support for global aux loss
Victarry Sep 5, 2025
e58d908
Merge branch 'denliu/global_aux_loss' into 'main'
chtruong814 Sep 5, 2025
641bf8b
ADLR/megatron-lm!3960 - Dynamic inference example | Remove shorten_st…
lmcafee-nvidia Sep 5, 2025
3616ce5
Merge branch 'lmcafee/dyn-inf-remove-shorten-str' into 'main'
lmcafee-nvidia Sep 5, 2025
ba3e244
ADLR/megatron-lm!3924 - Fix typos in tensor parallel layer documentation
sbhavani Sep 5, 2025
07b22a0
Merge branch 'docs/improve-tp-code-comments' into 'main'
deepakn94 Sep 5, 2025
c7a3aa4
ADLR/megatron-lm!3946 - fixing reverted MR 3688
sidsingh-nvidia Sep 6, 2025
230e0cb
Merge branch 'siddharth/non-decode-cg' into 'main'
chtruong814 Sep 6, 2025
a99f647
ADLR/megatron-lm!3943 - FP4 utils for nvfp4 recipe
Sep 6, 2025
020abf0
Merge branch 'fp4_recipe_revamp' into 'main'
chtruong814 Sep 6, 2025
696977f
ADLR/megatron-lm!3964 - Fix megatron-fsdp logging and remove extra co…
BoxiangW Sep 6, 2025
2108cd8
Merge branch 'boxiangw/megatron-fsdp-fix' into 'main'
chtruong814 Sep 6, 2025
85a8340
ADLR/megatron-lm!3867 - nvidia/NVIDIA-Nemotron-Nano-9B-v2-Base NVFP4 …
ChenhanYu Sep 6, 2025
55c1433
Merge branch 'chenhany/nmh-nano-9b-quantize-support' into 'main'
chtruong814 Sep 6, 2025
3230510
ADLR/megatron-lm!3880 - Fix is_first_microbatch not correctly set wit…
Sep 7, 2025
dc68d29
Merge branch 'w_cache_fix' into 'main'
chtruong814 Sep 7, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
5 changes: 0 additions & 5 deletions .coveragerc

This file was deleted.

4 changes: 4 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[flake8]
max-line-length = 100
extend-ignore = E203,E501,F401,E402,E714
per-file-ignores = __init__.py:F401
32 changes: 32 additions & 0 deletions .github/ISSUE_TEMPLATE/bug.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
name: BUG
about: Report a bug that needs attention
title: "[BUG]"
labels: ''
assignees: ''

---

**Describe the bug**
A clear and concise description of what the bug is.

**To Reproduce**
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

**Expected behavior**
A clear and concise description of what you expected to happen.

**Stack trace/logs**
If applicable, add the stack trace or logs from the time of the error.

**Environment (please complete the following information):**
- Megatron-LM commit ID
- PyTorch version
- CUDA version
- NCCL version

**Proposed fix**
If you have a proposal for how to fix the issue state it here or link to a PR.

**Additional context**
Add any other context about the problem here.
23 changes: 23 additions & 0 deletions .github/ISSUE_TEMPLATE/enhancement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
name: ENHANCEMENT
about: Suggest an idea to improve this project
title: "[ENHANCEMENT]"
labels: ''
assignees: ''

---

**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Proposed implementation**
If you have a proposed implementation for the feature state it here or link to a PR.

**Additional context**
Add any other context or screenshots about the feature request here.
12 changes: 12 additions & 0 deletions .github/ISSUE_TEMPLATE/question.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
name: QUESTION
about: Ask a question about Megatron-LM that is not a bug, regression or enhancement
request
title: "[QUESTION]"
labels: ''
assignees: ''

---

**Your question**
Ask a clear and concise question about Megatron-LM.
39 changes: 39 additions & 0 deletions .github/ISSUE_TEMPLATE/regression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
name: REGRESSION
about: Report a regression in speed or accuracy due to a Megatron-LM update
title: "[REGRESSION]"
labels: ''
assignees: ''

---

**Describe the regression**
A clear and concise description of what the regression is.

**To Reproduce**
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

**Previous performance**
What speed or accuracy did you previously see.

**New performance**
What speed or accuracy do you see after the update.

**Stack trace/logs**
If applicable, add the stack trace or logs related to the regression.

**Environment (please complete the following information):**
- Previous Megatron-LM commit ID
- New Megatron-LM commit ID
- Previous PyTorch version
- New PyTorch version
- Previous CUDA version
- New CUDA version
- Previous NCCL version
- New NCCL version

**Proposed fix**
If you have a proposal for how to fix the issue state it here or link to a PR.

**Additional context**
Add any other context about the problem here.
3 changes: 3 additions & 0 deletions .github/copy-pr-bot.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
enabled: true
auto_sync_draft: false
auto_sync_ready: true
8 changes: 8 additions & 0 deletions .github/workflows/close-inactive-issue-pr.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
name: Stale-Close-Inactive-Issues-PRs
on:
schedule:
- cron: "30 1 * * *"

jobs:
close-issues:
uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_close_inactive_issue_pr.yml@v0.44.0
13 changes: 13 additions & 0 deletions .github/workflows/community-bot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
name: Community Bot

on:
issues:
types: [opened, edited, reopened, closed, deleted]
issue_comment:
types: [created, edited, deleted]

jobs:
community-bot:
uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_community_bot.yml@v0.49.1
secrets:
GH_TOKEN: ${{ secrets.PAT }}
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,13 @@ build
*~
slurm*
logs
.vscode
local/
.gitmodules
wandb/
onelogger.log
onelogger.err
.venv
runs/
/test_cases/
**/dist/
Loading