feat(mma): add fp16@fp16->fp32 mma and unit tests #101
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
feat(mma): add half-precision MMAs for automotive devices and training
Add FP16 variants of matrix multiply operations benefiting non-Hopper
devices including NVIDIA Orin (sm_87) and Ada Lovelace (sm_89) automotive
edge devices. In addition, backward passes during training can benefit from
higher precisions. These variants provide higher precision compared to BF16
when needed.
Key changes:
mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32
interfacesmma_
interfaces for register tilesrt_base<half,...>
(AB, ABt, AtB, AtBt)
Tested on NVIDIA A100, Ada, and H100 platforms.