Skip to content

Conversation

alvoron
Copy link
Collaborator

@alvoron alvoron commented Jun 16, 2025

Description

Please include a summary of the change. Please also include relevant motivation and context. See contribution guidelines for more details. If the change fixes an issue not documented in the project's Github issue tracker, please document all steps necessary to reproduce it.

Fixes # (github issue)

Checklist

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • Have you formatted the code using clang-format?

Performance improvements

  • Have you submitted performance data that demonstrates performance improvements?

New features

  • Have you published an RFC for the new feature?
  • Was the RFC approved?
  • Have you added relevant tests?

Bug fixes

  • Have you included information on how to reproduce the issue (either in a github issue or in this PR)?
  • Have you added relevant regression tests?

RFC PR

  • Does RFC document follow the template?
  • Have you added a link to the rendered document?

dzarukin and others added 30 commits March 24, 2025 12:24
Temporary "const char *" objects can disappear while getting to the
parser internals. Moving strings to parse into a permanent container
solves the problem.
src1 with different data types cannot be fused.
And refactored sdpa primitive integration for better compilation performance.
Currently the new kernel only supports floating point sdpa.
This allows oneDNN to build successfully with GCC 7.x
Signed-off-by: Zhang fei <zhangfei@iscas.ac.cn>
Signed-off-by: Zhang fei <zhangfei@iscas.ac.cn>
Signed-off-by: Zhang fei <zhangfei@iscas.ac.cn>
Signed-off-by: Zhang fei <zhangfei@iscas.ac.cn>
Signed-off-by: Zhang fei <zhangfei@iscas.ac.cn>
Signed-off-by: Zhang fei <zhangfei@iscas.ac.cn>
Level Zero query currently returns wrong result wrt support for
atomics by the device. This commit reverts to using ocl query until
the issue is fixed in level zero.
dmitrygo and others added 29 commits April 28, 2025 17:01
3.5 squash list:

[FORK][FIX] Corrected brgemm rd_step for bf16 compressed weights
3.5 squash list:

[Fork][Fix] Fix avx2 bf16 reorder
[FORK][FEATURE] Enable avx2 jit reorder for bf16 data type
* fix matmul decompress test case

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

* save tmp

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

* [FORK][FIX] IP weights compression: scalar scale
[FORK][FEATURE] InnerProduct primitive: squashed weight decompression

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

* [FORK][FIX] IP weights compression: max bcast blocking computation
[FORK][FEATURE] InnerProduct primitive: squashed weight decompression

* fix compile issue

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

* fix crash issue

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

* try to fix compare issue

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

* contiue fix some accrucy issue

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

* fix f4_e2m1

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

* continue to fix f4e2m1

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

* fix confict on smoke_FC_(2|3)D_I8_sparse

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

* clean debug and unused code

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

* revert this change, should affect test case

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

---------

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>
Co-authored-by: dmitrygo <dmitry.gorokhov@intel.com>
Co-authored-by: Xuxin, Zeng <xuxin.zeng@intel.com>
[FORK][FEATURE] InnerProduct primitive: squashed weight decompression
- allocate aux accums regs on stack
- precompute grouped src sums
- optimize pointer arithmetic
- reduce aux vecs count requred for the microkernel
Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>
Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>
Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>
Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>
after migration to 3.8, the default value of runtime_scale_t is undef
instead of f32

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>
Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>
Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>
}

inline float load_float_value(data_type_t dt, const void *ptr, dim_t idx) {
FORCE_INLINE float load_float_value(data_type_t dt, const void *ptr, dim_t idx) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume this is done for performance improvement. Probably a better alternative is to change the loading method in the kernel/implementation of interest. The switch below is mostly the killer of any benefits. The function was not designed to be performant in any way.

Feel free to resolve, it's just a general observation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.