[DRAFT]Dnn38 arm #285

alvoron · 2025-06-16T18:12:32Z

Description

Please include a summary of the change. Please also include relevant motivation and context. See contribution guidelines for more details. If the change fixes an issue not documented in the project's Github issue tracker, please document all steps necessary to reproduce it.

Fixes # (github issue)

Checklist

General

Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
Have you formatted the code using clang-format?

Performance improvements

Have you submitted performance data that demonstrates performance improvements?

New features

Have you published an RFC for the new feature?
Was the RFC approved?
Have you added relevant tests?

Bug fixes

Have you included information on how to reproduce the issue (either in a github issue or in this PR)?
Have you added relevant regression tests?

RFC PR

Does RFC document follow the template?
Have you added a link to the rendered document?

Temporary "const char *" objects can disappear while getting to the parser internals. Moving strings to parse into a permanent container solves the problem.

src1 with different data types cannot be fused.

And refactored sdpa primitive integration for better compilation performance. Currently the new kernel only supports floating point sdpa.

This allows oneDNN to build successfully with GCC 7.x

Signed-off-by: Zhang fei <zhangfei@iscas.ac.cn>

Level Zero query currently returns wrong result wrt support for atomics by the device. This commit reverts to using ocl query until the issue is fixed in level zero.

3.5 squash list: [FORK][FIX] Corrected brgemm rd_step for bf16 compressed weights

3.5 squash list: [Fork][Fix] Fix avx2 bf16 reorder

…8m0) support

[FORK][FEATURE] Enable avx2 jit reorder for bf16 data type

* fix matmul decompress test case Signed-off-by: HU Yuan2 <yuan2.hu@intel.com> * save tmp Signed-off-by: HU Yuan2 <yuan2.hu@intel.com> * [FORK][FIX] IP weights compression: scalar scale [FORK][FEATURE] InnerProduct primitive: squashed weight decompression Signed-off-by: HU Yuan2 <yuan2.hu@intel.com> * [FORK][FIX] IP weights compression: max bcast blocking computation [FORK][FEATURE] InnerProduct primitive: squashed weight decompression * fix compile issue Signed-off-by: HU Yuan2 <yuan2.hu@intel.com> * fix crash issue Signed-off-by: HU Yuan2 <yuan2.hu@intel.com> * try to fix compare issue Signed-off-by: HU Yuan2 <yuan2.hu@intel.com> * contiue fix some accrucy issue Signed-off-by: HU Yuan2 <yuan2.hu@intel.com> * fix f4_e2m1 Signed-off-by: HU Yuan2 <yuan2.hu@intel.com> * continue to fix f4e2m1 Signed-off-by: HU Yuan2 <yuan2.hu@intel.com> * fix confict on smoke_FC_(2|3)D_I8_sparse Signed-off-by: HU Yuan2 <yuan2.hu@intel.com> * clean debug and unused code Signed-off-by: HU Yuan2 <yuan2.hu@intel.com> * revert this change, should affect test case Signed-off-by: HU Yuan2 <yuan2.hu@intel.com> --------- Signed-off-by: HU Yuan2 <yuan2.hu@intel.com> Co-authored-by: dmitrygo <dmitry.gorokhov@intel.com>

Co-authored-by: Xuxin, Zeng <xuxin.zeng@intel.com>

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

[FORK][FEATURE] InnerProduct primitive: squashed weight decompression

- allocate aux accums regs on stack - precompute grouped src sums - optimize pointer arithmetic - reduce aux vecs count requred for the microkernel

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

after migration to 3.8, the default value of runtime_scale_t is undef instead of f32 Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

dzarukin · 2025-07-10T17:10:49Z

src/cpu/ref_io_helper.hpp

 }

-inline float load_float_value(data_type_t dt, const void *ptr, dim_t idx) {
+FORCE_INLINE float load_float_value(data_type_t dt, const void *ptr, dim_t idx) {


I would assume this is done for performance improvement. Probably a better alternative is to change the loading method in the kernel/implementation of interest. The switch below is mostly the killer of any benefits. The function was not designed to be performant in any way.

Feel free to resolve, it's just a general observation.

dzarukin and others added 30 commits March 24, 2025 12:24

cpu: x64: jit_reorder: add verbose messages

ce085ff

benchdnn: self: replace temporary "const char *" with "std::string"

4832ccc

Temporary "const char *" objects can disappear while getting to the parser internals. Moving strings to parse into a permanent container solves the problem.

cpu: x64: fixed memory leak in jit_uni_ncsp convolution impl

67c3042

ngen: update PVC WAR bug workaround

607a318

benchdnn: inputs: graph: fix test cases related to int8/f8 add

caa770a

src1 with different data types cannot be fused.

Yobodovs/amx blocking heuristics fixes (uxlfoundation#2938)

5d7ed69

xe: ocl: fix gemm_with_po verbose dispatch message

7d418f5

ngen: downstream nGEN

931cc27

cpu: aarch64: add brgemm bwd data support for block size 8 and 16

068b775

graph: backend: dnnl: introduce internal dnnl_sdpa op

6bc6597

And refactored sdpa primitive integration for better compilation performance. Currently the new kernel only supports floating point sdpa.

build: removed -fcf-protection build option for old GCC

71f1837

This allows oneDNN to build successfully with GCC 7.x

benchdnn: add per test case timer

114cd34

benchdnn: add message for not found files

1a2f7c3

benchdnn: add summary execute timer

6ffa939

benchdnn: add summary create_pd and create_prim timers

5f6951b

riscv64: update intrinsics

4107fb9

Signed-off-by: Zhang fei <zhangfei@iscas.ac.cn>

riscv64: fix clang-format error

32f71c2

Signed-off-by: Zhang fei <zhangfei@iscas.ac.cn>

riscv64: fix clang-format error

37b4972

Signed-off-by: Zhang fei <zhangfei@iscas.ac.cn>

riscv64: fix clang-format error

190bf99

Signed-off-by: Zhang fei <zhangfei@iscas.ac.cn>

riscv64: update cmake

f8c3ab0

Signed-off-by: Zhang fei <zhangfei@iscas.ac.cn>

riscv64: update cmake

5de25f3

Signed-off-by: Zhang fei <zhangfei@iscas.ac.cn>

gpu: intel: sycl: workaround failing atomics support

2a35904

Level Zero query currently returns wrong result wrt support for atomics by the device. This commit reverts to using ocl query until the issue is fixed in level zero.

gpu: intel: sycl: level zero query fixup

3e40d66

xe: jit: gemm: reduce grf consumption for fp4 strategies

201b16a

common, xe: sycl: improve logging when OpenCL install is missing

5652be3

gpu: intel: ocl: add s32 support for binary

5171c2a

gpu: intel: ocl: enable s32 for binary primitive

e1c08d7

tests: benchdnn: gpu: enable s32 dt in binary

4e44b16

xe: ocl: prevent double -> float literal conversion

6a7f559

xe: ocl: prevent double -> float literal conversion fix

4d61ba4

dmitrygo and others added 29 commits April 28, 2025 17:01

[FORK][FEATURE] Support (f32,bf16,f32) inner-product

3d9f274

3.5 squash list: [FORK][FIX] Corrected brgemm rd_step for bf16 compressed weights

[FORK][FEATURE] Enable avx2 jit reorder for bf16 data type

3523e9e

3.5 squash list: [Fork][Fix] Fix avx2 bf16 reorder

[FORK][FEATURE] IP weights compression: mxfp4 (wei=f4e2m1, scales=f8e…

36a18db

…8m0) support

[X64] Fixed need_mask_register for eltwise injectors

51a48c6

[FORK][FIX] Fixed debug assert in jit_io_helper_t

f2b408c

[FORK][FEATURE] Enable avx2 jit reorder for bf16 data type

temp fix

4f0f1f3

cpu: x64: matmul: correct LDD when M == 1 (#2)

40d9d5c

Co-authored-by: Xuxin, Zeng <xuxin.zeng@intel.com>

cpu: x64: guard macro definitions to avoid potential Wundef hits

e4d97f3

[FORK][FIX] Fix missing 'map' include introduced by xbyak debug logic

54e4db8

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

[FORK][FIX] IP weights compression: max bcast blocking computation

5d91309

[FORK][FEATURE] InnerProduct primitive: squashed weight decompression

[FORK][FEATURE] DQ IP: performance enhansments

2f9e73e

- allocate aux accums regs on stack - precompute grouped src sums - optimize pointer arithmetic - reduce aux vecs count requred for the microkernel

[FORK][FiX] fix IP compress test case after migration v3.8 on avx2

0f9b7fb

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

[FORK][FIX] fix args checking issue

cd0b8d8

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

[FORK][FIX] add missing override

9ad2016

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

[FORK][Fix] Fix condition compilation

fb38e19

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

[FORK][FIX] fix LLM FP16 Failed on avx512 and avx2

38c7c03

after migration to 3.8, the default value of runtime_scale_t is undef instead of f32 Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

[FORK][FIX] fix riscv cmake issue

2577b22

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

[FORK][FIX] fix crash of convolution 1x1 int8 model on SPR (#9)

902df2e

Signed-off-by: HU Yuan2 <yuan2.hu@intel.com>

[ARM] Hide x64 dependent implementation under macro

a64d9a0

[ARM] ARM 32bits support for oneDNN

c764191

[ARM] Added ARM32 ACL kernels calls

d8054b4

[MERGE THIS INTO ANOTHER COMMIT] brgemm_matmul_matrix_B_reorder_t fix

328e942

[FORK][FEATURE][ARM] Enable f16 ACL post-op

ed34759

[FORK][ARM][FIX] Fix ACL configuration and skip failed tests

301e94f

[ARM] New heuristic for winograd and gemm (ACL)

1a918af

[ARM][FORK] Resolve float32_t type on 32-bit platforms

634c2db

[ARM][FORK][FIX] Set CMAKE_CXX_STANDARD to 20 on Android

7eb6272

[ARM][FORK][FIX] Use FORCE_INLINE for load_float_value

9b9b876

dzarukin reviewed Jul 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DRAFT]Dnn38 arm #285

[DRAFT]Dnn38 arm #285

Uh oh!

alvoron commented Jun 16, 2025

Uh oh!

dzarukin Jul 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

49 participants

[DRAFT]Dnn38 arm #285

Are you sure you want to change the base?

[DRAFT]Dnn38 arm #285

Uh oh!

Conversation

alvoron commented Jun 16, 2025

Description

Checklist

General

Performance improvements

New features

Bug fixes

RFC PR

Uh oh!

dzarukin Jul 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

49 participants