Skip to content

Commit f2b9452

Browse files
authored
fix: reranking models limited to 512 tokens in llama.cpp backend (#6344)
Fix reranking models being limited to 512 tokens input in llama.cpp backend Signed-off-by: JonGames <18472148+jongames@users.noreply.github.com>
1 parent 585da99 commit f2b9452

File tree

1 file changed

+1
-0
lines changed

1 file changed

+1
-0
lines changed

backend/cpp/llama-cpp/grpc-server.cpp

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -231,6 +231,7 @@ static void params_parse(const backend::ModelOptions* request,
231231
params.cpuparams.n_threads = request->threads();
232232
params.n_gpu_layers = request->ngpulayers();
233233
params.n_batch = request->nbatch();
234+
params.n_ubatch = request->nbatch(); // fixes issue with reranking models being limited to 512 tokens (the default n_ubatch size); allows for setting the maximum input amount of tokens thereby avoiding this error "input is too large to process. increase the physical batch size"
234235
// Set params.n_parallel by environment variable (LLAMA_PARALLEL), defaults to 1
235236
//params.n_parallel = 1;
236237
const char *env_parallel = std::getenv("LLAMACPP_PARALLEL");

0 commit comments

Comments
 (0)