Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
92 commits
Select commit Hold shift + click to select a range
94036e8
CUDA: better error for FA kernel with 0 occupancy (llama/16643)
JohannesGaessler Oct 21, 2025
b9ade5f
CUDA: topk-moe: add optional parameter for gpt-oss (llama/16649)
am17an Oct 21, 2025
b90a968
CUDA: fix bug in topk-moe softmax (llama/16711)
am17an Oct 22, 2025
8398a24
ggml : Leverage the existing GGML_F32_VEC helpers to vectorize ggml_v…
sirus20x6 Oct 22, 2025
368a8c4
Revert "ggml : Leverage the existing GGML_F32_VEC helpers to vectoriz…
slaren Oct 22, 2025
f92f4d2
Add experimental ggml-hexagon backend for the Hexagon NPU (llama/16547)
max-krasnyansky Oct 22, 2025
1018c46
sycl: use async memory allocation to fix crashes during graph recordi…
mmichel11 Oct 23, 2025
e409a8d
ggml-cuda: use passed ops instead of hardcoded ops (llama/16712)
am17an Oct 23, 2025
a3adfc2
CUDA: use CUB for arbitary size argsort (llama/16754)
am17an Oct 24, 2025
d352e86
ggml: fix CUDA grid launch condition for large block_nums.y in binbca…
leejet Oct 24, 2025
f9c1df1
vulkan: Optimize SSM_SCAN (llama/16645)
jeffbolznv Oct 25, 2025
4929c14
vulkan: delete dead code (llama/16732)
giuseppe Oct 25, 2025
a313152
vulkan: deduplicate Microsoft Direct3D12 devices (llama/16689)
giladgd Oct 26, 2025
c98fc97
CUDA: General GEMV fusion (llama/16715)
am17an Oct 26, 2025
5a0df69
ggml: fix cuda kernel launch configuration for k_compute_batched_ptrs…
leejet Oct 26, 2025
6152f02
cuda : use fast copy when src and dst are of different type and conti…
CISC Oct 26, 2025
3d7494e
ggml-alloc : make gallocr prefer chunks that allow memory reuse (llam…
Acly Oct 26, 2025
747c1b1
CUDA: support for weight clamp in top-k norm (llama/16702)
am17an Oct 27, 2025
c7097aa
sycl: add REPEAT_BACK operation support (llama/16734)
shani-f Oct 27, 2025
8e85c51
sycl: add ROLL operation support (llama/16665)
tamarPal Oct 27, 2025
fd03823
HIP: fix AMDGPU_TARGETS, update documentation (llama/16803)
JohannesGaessler Oct 27, 2025
5fb8ca9
ggml : fix interpolate with align-corners and ne=1 (llama/16700)
Acly Oct 27, 2025
89cc926
sycl: add SSM_CONV operation support (llama/16800)
tamarPal Oct 28, 2025
e63d6b2
CUDA: add unused vars to mmvf and mmvq (llama/16807)
am17an Oct 28, 2025
26bd99e
CANN: Improve device ID handling and aclnnArange checks (llama/16752)
noemotiovon Oct 28, 2025
732f269
initialise buffer.device in ggml_hexagon_session (llama/16816)
l3utterfly Oct 28, 2025
91d0e34
cuda: add SET operation support (llama/16804)
YaelGitAccount Oct 28, 2025
728c4e6
sycl: add RMS_NORM_BACK operation support (llama/16808)
YaelLogic Oct 29, 2025
a247ba3
CUDA: Fix bug in topk-moe for gpt-oss (llama/16821)
am17an Oct 29, 2025
d757972
vulkan: Call ggml_vk_buffer_write_2d from ggml_vk_buffer_copy (llama/…
jeffbolznv Oct 29, 2025
a70a475
CUDA: use fastdiv in set-rows (llama/16834)
am17an Oct 29, 2025
e0c36d0
Hexagon Op queue & dispatch optimizations (llama/16820)
max-krasnyansky Oct 29, 2025
781429f
Vulkan MMQ Integer Dot Refactor and K-Quant support (llama/16536)
0cc4m Oct 29, 2025
361e044
vulkan: Update topk_moe fusion to handle gpt's late softmax (llama/16…
jeffbolznv Oct 29, 2025
b02accd
vulkan: Fuse rope+set_rows (llama/16769)
jeffbolznv Oct 29, 2025
637ac11
Hide latency of bias and gate-loading (llama/16847)
ORippler Oct 30, 2025
4af2278
vulkan: Handle argsort with a large number of rows (llama/16851)
jeffbolznv Oct 30, 2025
27371bc
cuda : fix argsort with 64k+ rows (llama/16849)
CISC Oct 30, 2025
8a496a7
cpu: introduce chunking for flash attention (llama/16829)
max-krasnyansky Oct 30, 2025
4e86c56
model: add support for qwen3vl series (llama/16780)
JJJYmmm Oct 30, 2025
664eeb7
cpu: introduce chunking for repack matmuls and enable matmul-id chunk…
max-krasnyansky Oct 30, 2025
a59be7c
opencl: fix boundary handling for mul_mm (llama/16875)
lhez Oct 30, 2025
ae7d9cf
ggml-hexagon: respect input size when getting/setting tensor data (ll…
l3utterfly Oct 31, 2025
136051d
vulkan: fix shmem overrun in mmq id shader (llama/16873)
0cc4m Oct 31, 2025
12af7b9
vulkan: Fix crash when FP16 mul_mat accumulation is not supported (ll…
rillomas Oct 31, 2025
03ec080
vulkan: disable spirv-opt for rope shaders (llama/16872)
jeffbolznv Oct 31, 2025
757c037
CUDA: add expert reduce kernel (llama/16857)
am17an Oct 31, 2025
18c5436
ggml : fix conv2d_dw SVE path (ggml/1380)
ggerganov Nov 4, 2025
6f0b1f5
CUDA: Volta tensor core support for MMF (llama/16843)
JohannesGaessler Oct 31, 2025
6de5011
CUDA: Remove unneded bias/gate dims in fused mmvq (llama/16858)
ORippler Nov 1, 2025
062b513
vulkan: fuse mul_mat+add and mul_mat_id+add_id (llama/16868)
jeffbolznv Nov 1, 2025
5dbcb94
vulkan: Fix multi_add invalid descriptor usage (llama/16899)
jeffbolznv Nov 1, 2025
6344439
ggml: add s390x cpu-feats (llama/16774)
taronaeo Nov 2, 2025
d5d10bf
CUDA: add FLOOR, CEIL, ROUND, TRUNC unary ops (llama/16917)
mnehete32 Nov 2, 2025
6754829
clip : use FA (llama/16837)
ggerganov Nov 2, 2025
6b8a6ca
SYCL: optimized repeat_back kernel (3× fewer asm instructions, 2× fas…
shani-f Nov 3, 2025
3f58bf6
ggml : LoongArch fixes (llama/16958)
MQ-mengqing Nov 3, 2025
2b763ac
ggml: CUDA: add head size 72 for flash-attn (llama/16962)
theo77186 Nov 3, 2025
2f5851f
opencl: support imrope (llama/16914)
lhez Nov 3, 2025
b01c500
CUDA: avoid mul + bias fusion when doing fusion (llama/16935)
am17an Nov 4, 2025
c8b77a8
Fix garbled output with REPACK at high thread counts (llama/16956)
NoahOksuz Nov 4, 2025
a211e6b
ggml-cpu : bicubic interpolation (llama/16891)
Acly Nov 4, 2025
fb8529c
vulkan: remove the need for the dryrun (llama/16826)
jeffbolznv Nov 4, 2025
23f526c
refactor: replace sprintf with snprintf for safer string handling in …
chraac Nov 4, 2025
ff5ab7c
ggml webgpu: minor set rows optimization (llama/16810)
reeselevine Nov 9, 2025
2e04351
vulkan: Fix GGML_VULKAN_CHECK_RESULTS to better handle fusion (llama/…
jeffbolznv Nov 5, 2025
96eee7e
improve CUDA cpy memory bandwidth when copying transposed tensor (lla…
bssrdf Nov 5, 2025
78f4363
ggml-hexagon: graceful fallback for older socs where rpcmem_alloc2 an…
l3utterfly Nov 6, 2025
07f35f6
sycl: add CONCAT operator support (llama/16047)
ye-NX Nov 6, 2025
73287b6
metal : initial Metal4 tensor API support (llama/16634)
ggerganov Nov 6, 2025
89ce81c
CUDA: fix crash on uneven context without FA (llama/16988)
JohannesGaessler Nov 6, 2025
4b45865
ggml-cpu : optimize RVV q2_k and q3_k kernels (llama/16887)
xctan Nov 6, 2025
3b6f5f3
ggml-cpu: detect correct cpu flags for arm64 (ggml/16229) (llama/16239)
lizhenneng Nov 7, 2025
07e76a4
Revert "ggml-cpu: detect correct cpu flags for arm64 (llama/16229) (#…
angt Nov 7, 2025
a061119
CUDA: fix should_use_mmvf for ne11 == 1 (llama/17085)
JohannesGaessler Nov 7, 2025
8f74ea1
vulkan : refactor buffer handling in vk_op_f32 (llama/16840)
Acly Nov 7, 2025
7ef5830
CUDA: properly handle nb00=nb02 case for cpy (llama/17081)
bssrdf Nov 7, 2025
98600ed
ggml webgpu: faster matrix multiplication/matrix-vector multiplicatio…
reeselevine Nov 8, 2025
a48a448
CUDA: fix MMQ stream-k fixup ne1 indices (llama/17089)
JohannesGaessler Nov 8, 2025
630536d
vulkan: Fix test-thread-safety crashes (llama/17024)
jeffbolznv Nov 8, 2025
bbe3642
vulkan: fuse rms_norm + mul + rope (+ view + set_rows) (llama/16977)
jeffbolznv Nov 8, 2025
697c0f6
ggml: disable vxe for cross-compilation by default (llama/16966)
AlekseiNikiforovIBM Nov 8, 2025
49e2c7e
vulkan: Increase BK to 32; use BK/4 for non-CM mul_mm.comp (llama/16636)
SavicStefan Nov 8, 2025
a65347a
CUDA: skip fusion for repeating adds in bias (llama/17080)
am17an Nov 8, 2025
88f2ed6
Revert "CUDA: add expert reduce kernel (ggml/16857)" (llama/17100)
am17an Nov 8, 2025
d8c29d4
vulkan: Use spec constants for conv2d s/d/p and kernel W/H (llama/16978)
jeffbolznv Nov 8, 2025
dc59b68
metal : retain src and dst buffers during async ops (llama/17101)
ggerganov Nov 9, 2025
53412ad
vulkan: fuse mul_mat_id + mul (llama/17095)
jeffbolznv Nov 9, 2025
1a0e831
vulkan: fix mmq out of bounds reads (llama/17108)
0cc4m Nov 9, 2025
d7fc9e2
vulkan: iGPU memory reporting fix (llama/17110)
0cc4m Nov 9, 2025
db2f429
sync : ggml
ggerganov Nov 9, 2025
e4de0f9
sync : llama.cpp
ggerganov Nov 9, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion examples/talk-llama/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@ if (WHISPER_SDL2)
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

file(GLOB SRC_MODELS models/*.cpp)

set(TARGET whisper-talk-llama)
add_executable(${TARGET} talk-llama.cpp
llama.cpp
Expand Down Expand Up @@ -29,7 +31,8 @@ if (WHISPER_SDL2)
llama-sampling.cpp
llama-vocab.cpp
unicode.cpp
unicode-data.cpp)
unicode-data.cpp
${SRC_MODELS})
target_include_directories(${TARGET} PRIVATE ${SDL2_INCLUDE_DIRS})

target_link_libraries(${TARGET} PRIVATE common common-sdl whisper ${SDL2_LIBRARIES} ${CMAKE_THREAD_LIBS_INIT})
Expand Down
108 changes: 108 additions & 0 deletions examples/talk-llama/llama-arch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
{ LLM_ARCH_QWEN2VL, "qwen2vl" },
{ LLM_ARCH_QWEN3, "qwen3" },
{ LLM_ARCH_QWEN3MOE, "qwen3moe" },
{ LLM_ARCH_QWEN3VL, "qwen3vl" },
{ LLM_ARCH_QWEN3VLMOE, "qwen3vlmoe" },
{ LLM_ARCH_PHI2, "phi2" },
{ LLM_ARCH_PHI3, "phi3" },
{ LLM_ARCH_PHIMOE, "phimoe" },
Expand Down Expand Up @@ -103,6 +105,9 @@ static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
{ LLM_ARCH_SEED_OSS, "seed_oss" },
{ LLM_ARCH_GROVEMOE, "grovemoe" },
{ LLM_ARCH_APERTUS, "apertus" },
{ LLM_ARCH_MINIMAX_M2, "minimax-m2" },
{ LLM_ARCH_COGVLM, "cogvlm" },
{ LLM_ARCH_PANGU_EMBED, "pangu-embedded" },
{ LLM_ARCH_UNKNOWN, "(unknown)" },
};

Expand Down Expand Up @@ -145,6 +150,7 @@ static const std::map<llm_kv, const char *> LLM_KV_NAMES = {
{ LLM_KV_EXPERTS_PER_GROUP, "%s.experts_per_group" },
{ LLM_KV_MOE_EVERY_N_LAYERS, "%s.moe_every_n_layers" },
{ LLM_KV_NEXTN_PREDICT_LAYERS, "%s.nextn_predict_layers" },
{ LLM_KV_NUM_DEEPSTACK_LAYERS, "%s.n_deepstack_layers" },
{ LLM_KV_POOLING_TYPE, "%s.pooling_type" },
{ LLM_KV_LOGIT_SCALE, "%s.logit_scale" },
{ LLM_KV_DECODER_START_TOKEN_ID, "%s.decoder_start_token_id" },
Expand Down Expand Up @@ -779,6 +785,45 @@ static const std::map<llm_arch, std::map<llm_tensor, const char *>> LLM_TENSOR_N
{ LLM_TENSOR_FFN_UP_EXPS, "blk.%d.ffn_up_exps" },
},
},
{
LLM_ARCH_QWEN3VL,
{
{ LLM_TENSOR_TOKEN_EMBD, "token_embd" },
{ LLM_TENSOR_OUTPUT_NORM, "output_norm" },
{ LLM_TENSOR_OUTPUT, "output" },
{ LLM_TENSOR_ATTN_NORM, "blk.%d.attn_norm" },
{ LLM_TENSOR_ATTN_Q, "blk.%d.attn_q" },
{ LLM_TENSOR_ATTN_Q_NORM, "blk.%d.attn_q_norm" },
{ LLM_TENSOR_ATTN_K, "blk.%d.attn_k" },
{ LLM_TENSOR_ATTN_K_NORM, "blk.%d.attn_k_norm" },
{ LLM_TENSOR_ATTN_V, "blk.%d.attn_v" },
{ LLM_TENSOR_ATTN_OUT, "blk.%d.attn_output" },
{ LLM_TENSOR_FFN_NORM, "blk.%d.ffn_norm" },
{ LLM_TENSOR_FFN_GATE, "blk.%d.ffn_gate" },
{ LLM_TENSOR_FFN_DOWN, "blk.%d.ffn_down" },
{ LLM_TENSOR_FFN_UP, "blk.%d.ffn_up" },
},
},
{
LLM_ARCH_QWEN3VLMOE,
{
{ LLM_TENSOR_TOKEN_EMBD, "token_embd" },
{ LLM_TENSOR_OUTPUT_NORM, "output_norm" },
{ LLM_TENSOR_OUTPUT, "output" },
{ LLM_TENSOR_ATTN_NORM, "blk.%d.attn_norm" },
{ LLM_TENSOR_ATTN_Q, "blk.%d.attn_q" },
{ LLM_TENSOR_ATTN_Q_NORM, "blk.%d.attn_q_norm" },
{ LLM_TENSOR_ATTN_K, "blk.%d.attn_k" },
{ LLM_TENSOR_ATTN_K_NORM, "blk.%d.attn_k_norm" },
{ LLM_TENSOR_ATTN_V, "blk.%d.attn_v" },
{ LLM_TENSOR_ATTN_OUT, "blk.%d.attn_output" },
{ LLM_TENSOR_FFN_NORM, "blk.%d.ffn_norm" },
{ LLM_TENSOR_FFN_GATE_INP, "blk.%d.ffn_gate_inp" },
{ LLM_TENSOR_FFN_GATE_EXPS, "blk.%d.ffn_gate_exps" },
{ LLM_TENSOR_FFN_DOWN_EXPS, "blk.%d.ffn_down_exps" },
{ LLM_TENSOR_FFN_UP_EXPS, "blk.%d.ffn_up_exps" },
},
},
{
LLM_ARCH_PHI2,
{
Expand Down Expand Up @@ -2312,6 +2357,64 @@ static const std::map<llm_arch, std::map<llm_tensor, const char *>> LLM_TENSOR_N
{ LLM_TENSOR_FFN_UP_CHEXPS, "blk.%d.ffn_up_chexps" },
},
},
{
LLM_ARCH_MINIMAX_M2,
{
{ LLM_TENSOR_TOKEN_EMBD, "token_embd" },
{ LLM_TENSOR_OUTPUT_NORM, "output_norm" },
{ LLM_TENSOR_OUTPUT, "output" },
{ LLM_TENSOR_ATTN_NORM, "blk.%d.attn_norm" },
{ LLM_TENSOR_ATTN_Q, "blk.%d.attn_q" },
{ LLM_TENSOR_ATTN_K, "blk.%d.attn_k" },
{ LLM_TENSOR_ATTN_V, "blk.%d.attn_v" },
{ LLM_TENSOR_ATTN_OUT, "blk.%d.attn_output" },
{ LLM_TENSOR_ATTN_Q_NORM, "blk.%d.attn_q_norm" },
{ LLM_TENSOR_ATTN_K_NORM, "blk.%d.attn_k_norm" },
{ LLM_TENSOR_FFN_NORM, "blk.%d.ffn_norm" },
{ LLM_TENSOR_FFN_GATE_INP, "blk.%d.ffn_gate_inp" },
{ LLM_TENSOR_FFN_GATE_EXPS, "blk.%d.ffn_gate_exps" },
{ LLM_TENSOR_FFN_DOWN_EXPS, "blk.%d.ffn_down_exps" },
{ LLM_TENSOR_FFN_UP_EXPS, "blk.%d.ffn_up_exps" },
{ LLM_TENSOR_FFN_EXP_PROBS_B, "blk.%d.exp_probs_b" },
},
},
{
LLM_ARCH_PANGU_EMBED,
{
{ LLM_TENSOR_TOKEN_EMBD, "token_embd" },
{ LLM_TENSOR_OUTPUT_NORM, "output_norm" },
{ LLM_TENSOR_OUTPUT, "output" },
{ LLM_TENSOR_ATTN_NORM, "blk.%d.attn_norm" },
{ LLM_TENSOR_ATTN_Q, "blk.%d.attn_q" },
{ LLM_TENSOR_ATTN_K, "blk.%d.attn_k" },
{ LLM_TENSOR_ATTN_V, "blk.%d.attn_v" },
{ LLM_TENSOR_ATTN_OUT, "blk.%d.attn_output" },
{ LLM_TENSOR_FFN_NORM, "blk.%d.ffn_norm" },
{ LLM_TENSOR_FFN_GATE, "blk.%d.ffn_gate" },
{ LLM_TENSOR_FFN_DOWN, "blk.%d.ffn_down" },
{ LLM_TENSOR_FFN_UP, "blk.%d.ffn_up" },
},
},
{
LLM_ARCH_COGVLM,
{
{ LLM_TENSOR_TOKEN_EMBD, "token_embd" },
{ LLM_TENSOR_OUTPUT_NORM, "output_norm" },
{ LLM_TENSOR_OUTPUT, "output" },
{ LLM_TENSOR_ATTN_NORM, "blk.%d.attn_norm" },
{ LLM_TENSOR_ATTN_QKV, "blk.%d.attn_qkv" },
{ LLM_TENSOR_ATTN_OUT, "blk.%d.attn_output" },
{ LLM_TENSOR_FFN_NORM, "blk.%d.ffn_norm" },
{ LLM_TENSOR_FFN_GATE, "blk.%d.ffn_gate" },
{ LLM_TENSOR_FFN_DOWN, "blk.%d.ffn_down" },
{ LLM_TENSOR_FFN_UP, "blk.%d.ffn_up" },
{ LLM_TENSOR_VISEXP_ATTN_QKV, "blk.%d.vis_attn_qkv" },
{ LLM_TENSOR_VISEXP_ATTN_OUT, "blk.%d.vis_attn_output" },
{ LLM_TENSOR_VISEXP_FFN_GATE, "blk.%d.vis_gate" },
{ LLM_TENSOR_VISEXP_FFN_DOWN, "blk.%d.vis_down" },
{ LLM_TENSOR_VISEXP_FFN_UP, "blk.%d.vis_up" },
},
},
{
LLM_ARCH_UNKNOWN,
{
Expand Down Expand Up @@ -2488,6 +2591,11 @@ static const std::map<llm_tensor, llm_tensor_info> LLM_TENSOR_INFOS = {
{LLM_TENSOR_SHORTCONV_CONV, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_SSM_CONV}},
{LLM_TENSOR_SHORTCONV_INPROJ, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}},
{LLM_TENSOR_SHORTCONV_OUTPROJ, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}},
{LLM_TENSOR_VISEXP_ATTN_QKV, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}},
{LLM_TENSOR_VISEXP_ATTN_OUT, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}},
{LLM_TENSOR_VISEXP_FFN_GATE, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}},
{LLM_TENSOR_VISEXP_FFN_DOWN, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}},
{LLM_TENSOR_VISEXP_FFN_UP, {LLM_TENSOR_LAYER_REPEATING, GGML_OP_MUL_MAT}},
// NextN/MTP tensors are currently ignored (reserved for future MTP support)
// These tensors only exist in the last layer(s) and are treated as output tensors
{LLM_TENSOR_NEXTN_EH_PROJ, {LLM_TENSOR_LAYER_OUTPUT, GGML_OP_MUL_MAT}},
Expand Down
11 changes: 11 additions & 0 deletions examples/talk-llama/llama-arch.h
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@ enum llm_arch {
LLM_ARCH_QWEN2VL,
LLM_ARCH_QWEN3,
LLM_ARCH_QWEN3MOE,
LLM_ARCH_QWEN3VL,
LLM_ARCH_QWEN3VLMOE,
LLM_ARCH_PHI2,
LLM_ARCH_PHI3,
LLM_ARCH_PHIMOE,
Expand Down Expand Up @@ -107,6 +109,9 @@ enum llm_arch {
LLM_ARCH_SEED_OSS,
LLM_ARCH_GROVEMOE,
LLM_ARCH_APERTUS,
LLM_ARCH_MINIMAX_M2,
LLM_ARCH_COGVLM,
LLM_ARCH_PANGU_EMBED,
LLM_ARCH_UNKNOWN,
};

Expand Down Expand Up @@ -149,6 +154,7 @@ enum llm_kv {
LLM_KV_EXPERTS_PER_GROUP,
LLM_KV_MOE_EVERY_N_LAYERS,
LLM_KV_NEXTN_PREDICT_LAYERS,
LLM_KV_NUM_DEEPSTACK_LAYERS,
LLM_KV_POOLING_TYPE,
LLM_KV_LOGIT_SCALE,
LLM_KV_DECODER_START_TOKEN_ID,
Expand Down Expand Up @@ -455,6 +461,11 @@ enum llm_tensor {
LLM_TENSOR_SHORTCONV_CONV,
LLM_TENSOR_SHORTCONV_INPROJ,
LLM_TENSOR_SHORTCONV_OUTPROJ,
LLM_TENSOR_VISEXP_ATTN_QKV,
LLM_TENSOR_VISEXP_ATTN_OUT,
LLM_TENSOR_VISEXP_FFN_GATE,
LLM_TENSOR_VISEXP_FFN_DOWN,
LLM_TENSOR_VISEXP_FFN_UP,
LLM_TENSOR_NEXTN_EH_PROJ,
LLM_TENSOR_NEXTN_EMBED_TOKENS,
LLM_TENSOR_NEXTN_ENORM,
Expand Down
94 changes: 63 additions & 31 deletions examples/talk-llama/llama-batch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,7 @@ bool llama_batch_allocr::init(
/*.n_seq_tokens =*/ (uint32_t) 1,
/*.n_seqs =*/ (uint32_t) batch.n_tokens,
/*.n_seqs_unq =*/ (uint32_t) this->seq_id_unq.size(),
/*.n_pos =*/ n_pos_per_embd,
/*.token =*/ batch.token,
/*.embd =*/ batch.embd,
/*.pos =*/ batch.pos,
Expand Down Expand Up @@ -251,46 +252,72 @@ bool llama_batch_allocr::init(
// consistency checks
//

for (uint32_t s = 0; s < n_seq_max; ++s) {
if (seq_pos[s].empty()) {
continue;
if (n_pos_per_embd > 1) {
// M-RoPE case: allow position to "jump" forward only (non-continuous positions are allowed)
for (uint32_t s = 0; s < n_seq_max; ++s) {
if (seq_pos[s].empty()) {
continue;
}

const llama_pos p0 = memory ? memory->seq_pos_max(s) : -1;

if (batch.token) {
if (p0 >= 0 && p0 >= seq_pos_min(s)) {
LLAMA_LOG_ERROR(
"%s: the tokens of sequence %d in the input batch have inconsistent sequence positions:\n"
" - the last position stored in the memory module of the context (i.e. the KV cache) for sequence %d is X = %d\n"
" - the tokens for sequence %d in the input batch have a starting position of Y = %d\n"
" for M-RoPE, it is required that the position satisfies: X < Y\n",
__func__, s, s, p0, s, seq_pos_min(s));

return false;
}
} else {
// embedding inputs can have overlapping positions
if (p0 >= 0 && p0 > seq_pos_min(s)) {
LLAMA_LOG_ERROR(
"%s: the tokens of sequence %d in the input batch have inconsistent sequence positions:\n"
" - the last position stored in the memory module of the context (i.e. the KV cache) for sequence %d is X = %d\n"
" - the tokens for sequence %d in the input batch have a starting position of Y = %d\n"
" for M-RoPE, it is required that the position satisfies: X <= Y\n",
__func__, s, s, p0, s, seq_pos_min(s));

return false;
}
}
}
} else {
for (uint32_t s = 0; s < n_seq_max; ++s) {
if (seq_pos[s].empty()) {
continue;
}

const llama_pos p0 = memory ? memory->seq_pos_max(s) : -1;
const llama_pos p0 = memory ? memory->seq_pos_max(s) : -1;

if (p0 >= 0) {
bool ok = true;
if (p0 >= 0) {
bool ok = true;

if (batch.token) {
if (seq_pos_min(s) != p0 + 1) {
ok = false;
}
} else {
assert(batch.embd);

// for embeddings (typically used as vision input), we allow them to have repeating positions
// ref: https://github.com/ggml-org/llama.cpp/issues/13694#issuecomment-2983871762
if (seq_pos_min(s) != p0 && seq_pos_min(s) != p0 + 1) {
ok = false;
if (!ok) {
LLAMA_LOG_ERROR(
"%s: the tokens of sequence %d in the input batch have inconsistent sequence positions:\n"
" - the last position stored in the memory module of the context (i.e. the KV cache) for sequence %d is X = %d\n"
" - the tokens for sequence %d in the input batch have a starting position of Y = %d\n"
" it is required that the sequence positions remain consecutive: Y = X + 1\n",
__func__, s, s, p0, s, seq_pos_min(s));

return false;
}
}

if (!ok) {
LLAMA_LOG_ERROR(
"%s: the tokens of sequence %d in the input batch have inconsistent sequence positions:\n"
" - the last position stored in the memory module of the context (i.e. the KV cache) for sequence %d is X = %d\n"
" - the tokens for sequence %d in the input batch have a starting position of Y = %d\n"
" it is required that the sequence positions remain consecutive: Y = X + 1\n",
__func__, s, s, p0, s, seq_pos_min(s));

if (seq_pos_max(s) - seq_pos_min(s) + 1 > (int) seq_pos[s].size()) {
LLAMA_LOG_ERROR("%s: sequence %d positions are not continuous\n", __func__, s);
return false;
}
}

if (seq_pos_max(s) - seq_pos_min(s) + 1 > (int) seq_pos[s].size()) {
LLAMA_LOG_ERROR("%s: sequence %d positions are not continuous\n", __func__, s);
return false;
}
}

if (memory) {
Expand Down Expand Up @@ -389,6 +416,7 @@ llama_ubatch llama_batch_allocr::ubatch_reserve(uint32_t n_seq_tokens, uint32_t
/*.n_seq_tokens =*/ n_seq_tokens,
/*.n_seqs =*/ n_seqs,
/*.n_seqs_unq =*/ n_seqs,
/*.n_pos =*/ n_pos_per_embd,

/*.token =*/ udata->token.data(),
/*.embd =*/ nullptr,
Expand Down Expand Up @@ -655,10 +683,8 @@ llama_ubatch llama_batch_allocr::ubatch_add(const std::vector<int32_t> & idxs, u

auto udata = std::make_shared<llama_ubatch::data_t>();

const int32_t n_pos_cur = batch.embd ? n_pos_per_embd : 1;

const int64_t n_embd_all = batch.embd ? (int64_t) n_tokens*n_embd : 0;
const int64_t n_pos_all = (int64_t) n_tokens*n_pos_cur;
const int64_t n_pos_all = (int64_t) n_tokens*n_pos_per_embd;

udata->token .resize(n_tokens);
udata->embd .resize(n_embd_all);
Expand All @@ -680,8 +706,13 @@ llama_ubatch llama_batch_allocr::ubatch_add(const std::vector<int32_t> & idxs, u
memcpy(udata->embd.data() + i*n_embd, batch.embd + (int64_t) idxs[i]*n_embd, n_embd*sizeof(float));
}

for (int j = 0; j < n_pos_cur; ++j) {
udata->pos[j*n_tokens + i] = batch.pos[j*batch.n_tokens + idxs[i]];
for (size_t j = 0; j < (size_t)n_pos_per_embd; ++j) {
// if we are using M-RoPE
// if the current batch is text, we need to broadcast the same position across all RoPE sections
// otherwise, the input batch is image embeddings, we copy the positions as-is
// if we are not using M-RoPE, there is only one position per token (this loop runs only once)
size_t src_off = batch.token ? 0 : j*batch.n_tokens;
udata->pos[j*n_tokens + i] = batch.pos[src_off + idxs[i]];
}

udata->n_seq_id[i] = batch.n_seq_id[idxs[i]];
Expand Down Expand Up @@ -710,6 +741,7 @@ llama_ubatch llama_batch_allocr::ubatch_add(const std::vector<int32_t> & idxs, u
/*.n_seq_tokens =*/ n_tokens/n_seqs,
/*.n_seqs =*/ n_seqs,
/*.n_seqs_unq =*/ (uint32_t) udata->seq_id_unq.size(),
/*.n_pos =*/ n_pos_per_embd,

/*.token =*/ batch.token ? udata->token.data() : nullptr,
/*.embd =*/ batch.embd ? udata->embd.data() : nullptr,
Expand Down
13 changes: 12 additions & 1 deletion examples/talk-llama/llama-batch.h
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,16 @@ struct llama_ubatch {
return b_equal_seqs != 0;
}

// typical for M-RoPE cases:
// 0 - sequantial position of the tokens/embeddings in the sequence
// 1 - y position in the image
// 2 - x position in the image
// 3 - other
bool is_pos_2d() const {
// TODO @ngxson : we may need to check for model arch when more models use >1 positions
return n_pos >= 3;
}

uint32_t b_equal_seqs; // note: this is a boolean, but we use an int32_t for alignment
// otherwise address sanitizer complains
// TODO: whole_seqs for embeddings?
Expand All @@ -25,6 +35,7 @@ struct llama_ubatch {
uint32_t n_seq_tokens; // tokens per sequence set
uint32_t n_seqs; // sequence sets in the ubatch
uint32_t n_seqs_unq; // unique sequence ids in the ubatch
uint32_t n_pos; // number of position inputs for each token/embedding

// seq_id_unq: unique sequence ids in the ubatch
// seq_idx: indices of the unique sequence ids in the ubatch in [0, n_seqs_unq)
Expand All @@ -33,7 +44,7 @@ struct llama_ubatch {
// // size | idx | val
llama_token * token; // [n_tokens] | i | id, token
float * embd; // [n_embd, n_tokens] | i | embd
llama_pos * pos; // [n_tokens] | i | pos
llama_pos * pos; // [n_tokens*n_pos] | i | pos
int32_t * n_seq_id; // [n_tokens] | i | -
llama_seq_id ** seq_id; // [n_tokens] | s | s0, s1, seq_id
llama_seq_id * seq_id_unq; // [n_seqs_unq] | s | seq_id
Expand Down
Loading
Loading