sync : ggml #3507

ggerganov · 2025-11-09T16:50:44Z

No description provided.

…ec_set_f32 for faster fills (llama/16522) * Leverage the existing GGML_F32_VEC helpers to broadcast the fill value across SIMD registers and store in vector-sized chunks, while retaining the scalar tail for leftover elements and non-SIMD builds. * Vectorize additional f32 helper loops * Normalize f32 helper tails for ggml vec ops --------- Co-authored-by: Aaron <shelhamer.aaron@gmail.com>

…e ggml_v…" (#16723) This reverts commit 19a5a3edfd306516cc419679d69d6435943b6816.

* model: add support for extra bufs for all devices * hexagon: add experimental ggml-hexagon backend for the Hexagon NPU This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU. Highlights: - Supports Hexagon versions: v73, v75, v79, and v81 - Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5 - Supports Q4_0, Q8_0, MXFP4, and FP32 data types - Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX **Note:** This backend is experimental and may exhibit instability or limited performance across supported devices. It is intended for early testing and feedback from llama.cpp/ggml developer and user community. Co-Authored-By: Rajdeep Ganguly <rganguly@qti.qualcomm.com> Co-Authored-By: Todor Boinovski <todorb@qti.qualcomm.com> * hexagon: fix format checker errors * hexagon: update readme and cmake presets * ci: add android-ndk-build jobs that build plain ARM64 and Snapdragon versions * hexagon: add simple graph optimizer for stacking MUL_MAT ops with the same input * hexagon: move ADB helper scripts into scripts/snapdragon/adb * hexagon: replace all f/printfs with GGML_LOG_... * readme: add hexagon to the list supported backends * hexagon: stack malmuts with quantized inputs only * hexagon: add TODO for fixing issues in hexagon_graph_optimize * hexagon: update to hex-sdk 6.4.0 and add scripts for running on QDC * scripts: fix lint errors * scripts: update qdc pytest script to make linter happy * hexagon: add reduce sum in fp32 * hexagon: reduce number of vector stores in matmul output * hexagon: remove the need for vdelta in reduce-multiply-x8 * hexagon: consistent use of reduce_sum_fp32 for row_sums * hexagon: some more matmul optimizations and comments Optimize cases where tensor dims are not multiple of 1024 (e.g in Qwen models). We've handled those cases already but at a higher overhead. * hexagon: update cmake presets * hexagon: add OPMASK support for run-bench.sh wrapper * hexagon: update to use GGML_BACKEND_API * hexagon: remove unused logic for setting tensor flags for the views * hexagon: add asserts to set/get_tensor to make sure we handle complete tensors Same asserts as the CPU backend. * hexagon: use cpy_tensor slow path for non-host buffers * hexagon: error checks in the buffer allocator * cmake: move include(extProj) under ggml-hexagon * hexagon: don't forget to delete the backend on free * hexagon: set/get_tensor size assert apply only to quantized tensors * hexagon: reintroduce HEX_VERBOSE wrapper for GGML_LOG_DEBUG for now GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way. Ideally we need a bit more finer log levels. * docs: typos in hexagon developer docs (libggm-...) * hexagon: overhaul error handling in the session/device allocation this should handle all failure paths in the session allocation. * hexagon: update cmake presets to enable fp16 vectors * hexagon: remove unused time_usec function * hexagon: don't forget to release buffer contexts * hexagon: fixed indents in hvx-utils (missed clang-format auto-format failure) * hexagon: remove custom can_repeat function and use ggml_can_repeat --------- Co-authored-by: Rajdeep Ganguly <rganguly@qti.qualcomm.com> Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>

…ng (llama/16644) * sycl: use async memory allocation to fix graph recording failures GGML_SYCL_DISABLE_GRAPHS=0 causes crashes because: - Host waits are currently unsupported in graph recording mode. - SYCL malloc / free calls are unsupported in graph recording mode. The following changes are made to fix SYCL graph functionality: - When graphs are enabled, use the SYCL async memory extension for temp buffers which is supported with SYCL graphs. - For compiler versions that do not support this extension, skip graphs with the affected op. - Switch from USM shared to device memory as the async extension currently just supports device allocations. * Address reviewer feedback * Use global async variable to decide path in sycl_ext_[malloc_device|free]

…st (llama/16742) * Fix CUDA grid launch condition for large block_nums.y * add backend ops test * reduce test repetitions

ggml_vk_create_buffer_temp is not used anywhere, and it is the only caller for ggml_vk_pool_malloc. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

* fix: deduplicate and deprioritize Microsoft Direct3D12 vulkan devices from the `vulkan-dozen` driver * style: indent * fix: decrease priority * fix: switch to `||`

… to support large batch (llama/16744) * fix k_compute_batched_ptrs * add backend ops test * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * reduce the batch size --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

…guous (llama/16789) * use fast copy when src and dst are contiguous and same shape * use int64_t ne and ignore shape

…a/16788)

* SYCL repeat_back v1 — add core op + switch case * Implement repeat_back SYCL operation and minor fixes * Update ggml/src/ggml-sycl/repeat_back.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update ggml/src/ggml-sycl/repeat_back.hpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update ggml/src/ggml-sycl/ggml-sycl.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* sycl: add ROLL operation support - Implement ggml_sycl_roll function for F32 tensors - Add multi-axis roll operation with SYCL kernel - Support all 4 tensor dimensions with proper shift normalization - Add roll.cpp and roll.hpp to SYCL backend - Update backend dispatch and supports_op for GGML_OP_ROLL - Tests: 17662/17662 pass with identical CPU reference results * fix: remove trailing whitespace from roll.cpp - Fix EditorConfig violations in ggml/src/ggml-sycl/roll.cpp - Remove trailing spaces from lines 6, 11, 28, 47, 58, 60 * ci: retrigger * sycl: remove wait() calls from ROLL operation * fix: editorconfig — LF endings + final newline for roll.hpp --------- Co-authored-by: tamarPal <tamarPal@example.com>

* ggml : fix interpolate with align-corners and ne=1 * avoid division by zero if one of the spatial dimensions is 1 * cpu, cuda, opencl returned correct result anyway due to clamp * vulkan didn't clamp for align-corners so results were broken * fix clang warning

* feat: Add SYCL backend support for SSM_CONV operator * Implement State Space Model Convolution 1D for SYCL backend * Add optimized GPU kernel with parallel work distribution * Support various tensor dimensions and batch sizes * Full integration with existing SYCL infrastructure * All tests pass with CPU backend equivalence verification * feat: Implement SYCL backend support for SSM_CONV operation - Add ggml-sycl/ssm_conv.cpp and ssm_conv.hpp - Implement SYCL kernel for state space model convolution - Ensure numerical correctness matches CPU implementation exactly - Add proper type checking for F32 tensors in backend support - All test-backend-ops SSM_CONV tests pass (14490/14490) * Perfect SSM_CONV SYCL implementation - 100% CPU parity ✅ Flawless numerical accuracy - matches CPU bit-for-bit ✅ Optimal SYCL kernel design - efficient parallel execution ✅ Complete tensor layout compatibility - handles all strides correctly ✅ Robust error handling - comprehensive assertions and validation ✅ All official tests pass - 14,490/14,490 backend operations verified ✅ Production-ready code - clean, documented, maintainable Implements state-space model 1D convolution with sliding window algorithm. Eliminates blocking queue.wait() for better async performance. * Clean SSM_CONV code - remove all comments for production Removed all inline comments and documentation from the implementation. Clean, minimal code ready for production merge. * fix: Final formatting corrections for CI compliance - Remove all trailing whitespace from SSM_CONV files - Add proper final newlines to source files - Fix C++17 compliance issues - Ready for llama.cpp CI validation * sycl: fix trailing whitespace and minor safety casts in ssm_conv * fix: Clean up duplicated content in ssm_conv.hpp header file --------- Co-authored-by: tamarPal <tamarPal@example.com>

* cann: improve device ID handling and aclnnArange checks - Stop relying on CANN's internal device ID retrieval; use a global variable instead. - Enforce stricter dimension validation in aclnnArange for better compatibility across CANN versions. * cann: use thread local var

* feat(cuda): add GGML_OP_SET support Implement CUDA kernel for SET operation with f32 support. All tests passing (14598/14598). * cuda(set): add I32 support; keep F32 * refactor(cuda): use ggml_cuda_cpy to unify SET operator logic and remove code duplication * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update ggml/src/ggml-cuda/set.cu Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* sycl: add RMS_NORM_BACK operation support * sycl: rms_norm_back: add dual reduction paths (FP64 and FP32) and savepoint before further changes * sycl: add RMS_NORM_BACK support Implement RMS_NORM_BACK for the SYCL backend using FP32 compensated parallel reduction. Minimal docs updates (ops.md / SYCL.csv). * revert: restore .gitignore and tools/run/CMakeLists.txt to upstream * revert: restore tests/CMakeLists.txt to upstream * sycl: optimize rms_norm_back * fix: restore SYCL.csv to correct state with RMS_NORM_BACK support * Update ggml/src/ggml-sycl/norm.cpp Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> * fix: remove trailing whitespace and add missing newline (EditorConfig) --------- Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>

* CUDA: Fix bug in topk-moe for gpt-oss When using ggml_can_fuse_subgraph, the output nodes which are passed are wrong. This causes `test-backend-ops` to still fuse ndoes (because the nodes are not used elsewhere in the graph), but it actually doesn't fuse in the actual gpt-oss * fix for qwen3 too * change ifndef to ifdef

…16793) This lets the copy to the destination device use the host-visible vidmem optimization.

* Add buffer label and enable dawn-specific toggles to turn off some checks * Minor set_rows optimization (ggml/4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Remove some comments * Implement overlap binary operators * Revert "Implement overlap binary operators" This reverts commit ed710b36f51ab3f53fa13db15c1685dc8678a32a. * Disable support for non-contiguous binary_op tensors and leave note for future support --------- Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com> Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>

…16919)

…ma/16841) * WIP * added a cpy kernel specific to transposed tensor which uses smem to avoid uncoalesced access; test cases also added shwoing improved memory bandwidth * added BF16 support * more strict check to make sure src0 is a transpose * reformulated to handle more complicated transpose cases * bring back 2D transpose for higher performance * allow build on windows * tranpose copy more shapes * minor tweak * final clean up * restore some test cases * keep only the kernel for true tranposed case; updated with review suggestions * make CI happy * remove headers not needed * reduced bank conflicts for fp16 and bf16 * add missing const* * now bank conflicts free * use padding instead of swizzling --------- Co-authored-by: bssrdf <bssrdf@gmail.com>

…d FASTRPC_GET_URI is unsupported (llama/16987) * support older socs where FASTRPC_GET_URI is unsupported * added graceful fallback when FASTRPC_GET_URI call fails * use weak symbols instead of loading libcdsprpc.so dynamically * Add weak pragma for rpcmem_alloc2 * Remove weak declaration for rpcmem_alloc2 in ggml-hexagon.cpp Removed weak declaration for rpcmem_alloc2. * Enforce ndev to 1 for archs below v75 Force ndev to 1 for SoCs architectures lower than v75.

* sycl: add CONCAT operator support * cleanup: remove stray lines added by mistake * fix: code format issues in concat.cpp and tests/test-backend-ops.cpp * chore: fix editorconfig violations * cleanup: drop unnecessary i16 type support * docs: update sycl-csv and regenerate ops.md * update docs/ops.md * fix: adapt to upstream master changes after rebase * fix: remove empty files * fix: drop whitespace --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* metal : rework mat-mat multiplication * metal : initial Metal4 support * cont * metal : detect tensor support * cont : better ifdefs * metal : support tensors in mul_mm_id * metal : add env for disabling tensor API * tests : restore * metal : remove unused constants * metal : fix check for bfloat tensor support * cont : handle API incompatibilities * cont : handle even more incompatibilities * metal : use tensor API only on M5 and later

When using GCC 9 and GCC 12 on the arm64 platform of ubuntu 2004, the command "gcc -mcpu=native -E -v -" fails to detect the correct CPU flags, which results in compilation failures for certain extended instructions, but the correct CPU flags can be obtained by using gcc -march. Signed-off-by: lizhenneng <lizhenneng@kylinos.cn> Co-authored-by: lizhenneng <lizhenneng@kylinos.cn>

…16239)" (llama/17084) This reverts commit 7c23f3f0d4b9f5d6ea140756eb694b562d5acebb.

@am17an

* CUDA: fix should_use_mmvf for ne11 == 1 * Apply suggestion from @am17an Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* vulkan : refactor/simplify buffer handling in vk_op_* functions * Combine UMA handling into ggml_vk_tensor_subbuffer

…n (llama/17031) * Faster tensors (llama/8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings

The std::map pipeline_flash_attn_f32_f16 could be searched and inserted at the same time, which needs to hold the lock. To be safe, hold the lock for all of ggml_vk_load_shaders.

This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.

Otherwise compilation will fail due to enabling -mvx -mzvector and not setting corresponding -march options.

Signed-off-by: Stefan Savic <stefan.savic@huawei.com> Co-authored-by: Stefan Savic <stefan.savic@huawei.com>

* vulkan: Use spec constants for conv2d s/d/p and kernel W/H Also add some additional unroll hints, which seems to help. * lock around map lookup

* vulkan: fuse mul_mat_id + mul This comes up in qwen3 moe. * split mul_mat_id fusion tests into a separate class

* vulkan: fix mmq out of bounds reads, streamline outdated matmul host code * fix mul_mat_id quantization call * Fix compiler warnings

* vulkan: use all device-local heaps for memory availability reporting Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com> * use all available heaps for iGPU memory reporting * Allow multiple memory types per buffer request for devices with split heaps --------- Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>

JohannesGaessler and others added 30 commits November 9, 2025 18:48

CUDA: better error for FA kernel with 0 occupancy (llama/16643)

94036e8

CUDA: topk-moe: add optional parameter for gpt-oss (llama/16649)

b9ade5f

CUDA: fix bug in topk-moe softmax (llama/16711)

b90a968

Revert "ggml : Leverage the existing GGML_F32_VEC helpers to vectoriz…

368a8c4

…e ggml_v…" (#16723) This reverts commit 19a5a3edfd306516cc419679d69d6435943b6816.

ggml-cuda: use passed ops instead of hardcoded ops (llama/16712)

e409a8d

CUDA: use CUB for arbitary size argsort (llama/16754)

a3adfc2

ggml: fix CUDA grid launch condition for large block_nums.y in binbca…

d352e86

…st (llama/16742) * Fix CUDA grid launch condition for large block_nums.y * add backend ops test * reduce test repetitions

vulkan: Optimize SSM_SCAN (llama/16645)

f9c1df1

vulkan: delete dead code (llama/16732)

4929c14

ggml_vk_create_buffer_temp is not used anywhere, and it is the only caller for ggml_vk_pool_malloc. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

vulkan: deduplicate Microsoft Direct3D12 devices (llama/16689)

a313152

* fix: deduplicate and deprioritize Microsoft Direct3D12 vulkan devices from the `vulkan-dozen` driver * style: indent * fix: decrease priority * fix: switch to `||`

CUDA: General GEMV fusion (llama/16715)

c98fc97

cuda : use fast copy when src and dst are of different type and conti…

6152f02

…guous (llama/16789) * use fast copy when src and dst are contiguous and same shape * use int64_t ne and ignore shape

ggml-alloc : make gallocr prefer chunks that allow memory reuse (llam…

3d7494e

…a/16788)

CUDA: support for weight clamp in top-k norm (llama/16702)

747c1b1

HIP: fix AMDGPU_TARGETS, update documentation (llama/16803)

fd03823

CUDA: add unused vars to mmvf and mmvq (llama/16807)

e63d6b2

initialise buffer.device in ggml_hexagon_session (llama/16816)

732f269

vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy (llama/…

d757972

…16793) This lets the copy to the destination device use the host-visible vidmem optimization.

reeselevine and others added 28 commits November 9, 2025 18:48

vulkan: Fix GGML_VULKAN_CHECK_RESULTS to better handle fusion (llama/…

2e04351

…16919)

CUDA: fix crash on uneven context without FA (llama/16988)

89ce81c

ggml-cpu : optimize RVV q2_k and q3_k kernels (llama/16887)

4b45865

Revert "ggml-cpu: detect correct cpu flags for arm64 (llama/16229) (#…

07e76a4

…16239)" (llama/17084) This reverts commit 7c23f3f0d4b9f5d6ea140756eb694b562d5acebb.

CUDA: fix should_use_mmvf for ne11 == 1 (llama/17085)

a061119

* CUDA: fix should_use_mmvf for ne11 == 1 * Apply suggestion from @am17an Co-authored-by: Aman Gupta <amangupta052@gmail.com> --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>

vulkan : refactor buffer handling in vk_op_f32 (llama/16840)

8f74ea1

* vulkan : refactor/simplify buffer handling in vk_op_* functions * Combine UMA handling into ggml_vk_tensor_subbuffer

CUDA: properly handle nb00=nb02 case for cpy (llama/17081)

7ef5830

ggml webgpu: faster matrix multiplication/matrix-vector multiplicatio…

98600ed

…n (llama/17031) * Faster tensors (llama/8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings

CUDA: fix MMQ stream-k fixup ne1 indices (llama/17089)

a48a448

vulkan: Fix test-thread-safety crashes (llama/17024)

630536d

The std::map pipeline_flash_attn_f32_f16 could be searched and inserted at the same time, which needs to hold the lock. To be safe, hold the lock for all of ggml_vk_load_shaders.

vulkan: fuse rms_norm + mul + rope (+ view + set_rows) (llama/16977)

bbe3642

This change combines the rms_norm+mul and rope+view+set_rows fusions to allow fusing the whole sequence together. This comes up in Qwen3, Bailing, and some other models.

ggml: disable vxe for cross-compilation by default (llama/16966)

697c0f6

Otherwise compilation will fail due to enabling -mvx -mzvector and not setting corresponding -march options.

vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp (llama/16636)

49e2c7e

Signed-off-by: Stefan Savic <stefan.savic@huawei.com> Co-authored-by: Stefan Savic <stefan.savic@huawei.com>

CUDA: skip fusion for repeating adds in bias (llama/17080)

a65347a

Revert "CUDA: add expert reduce kernel (ggml/16857)" (llama/17100)

88f2ed6

vulkan: Use spec constants for conv2d s/d/p and kernel W/H (llama/16978)

d8c29d4

* vulkan: Use spec constants for conv2d s/d/p and kernel W/H Also add some additional unroll hints, which seems to help. * lock around map lookup

metal : retain src and dst buffers during async ops (llama/17101)

dc59b68

vulkan: fuse mul_mat_id + mul (llama/17095)

53412ad

* vulkan: fuse mul_mat_id + mul This comes up in qwen3 moe. * split mul_mat_id fusion tests into a separate class

vulkan: fix mmq out of bounds reads (llama/17108)

1a0e831

* vulkan: fix mmq out of bounds reads, streamline outdated matmul host code * fix mul_mat_id quantization call * Fix compiler warnings

sync : ggml

db2f429

sync : llama.cpp

e4de0f9

ggerganov merged commit a1867e0 into master Nov 9, 2025
119 of 127 checks passed

ggerganov deleted the sync-ggml-25-11-09 branch November 9, 2025 21:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

sync : ggml #3507

sync : ggml #3507

Uh oh!

ggerganov commented Nov 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

39 participants

sync : ggml #3507

sync : ggml #3507

Uh oh!

Conversation

ggerganov commented Nov 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

39 participants