How should the client handle thinking blocks? #15333
Replies: 1 comment 1 reply
-
Setting
Most jinja chat templates handle stripping thinking tags from the past messages. However, if you use one of the reasoning formats I mentioned above, they are already removed.
I believe I don't believe your assessment about reasoning format always being |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I have a few questions about the correct handling of reasoning blocks and i really need some feedback. I'm upgrading a simple python client frontend to support reasoning models. At the moment I'm using streaming mode in v1/chat/completions in llama-server.
Can llama-server working in SSE streaming mode somehow mark tokens in a way as to differentiate thinking tokens from normal tokens?
This would avoid the frontend the need to implement flimsy parsing for thinking delimiters on a per-model basis.
Does llama-server remove past thinking blocks automatically when processing or should the frontend take care to remove them before sending?
According to this, models are typically trained to expect thinking blocks related to past messages be removed.
The http API has the parameter "thinking_forced_open". Is there also a "thinking_forced_off"? "reasoning-budget" is only available as a CLI argument.
The coexistence of these commands would abstract the switching logic for hybrid models, without needing to support textual switches on a per-model basis on the frontend side like "/no_think" or by prefilling "" or whatever proprietary delimiter string. Also some default to non-thinking mode, with possibility to enable thinking, others default to thinking mode with possibility to disable thinking and this would iron out these inconsistencies as well.
I found in the docs a few other parameters related to managing thinking blocks, but I found them not of much use:
CLI llama-server arguments:
--reasoning-format | always set to "none" by default for streaming mode, ignored;
--reasoning-budget set to 0 | error "Assistant response prefill is incompatible with enable_thinking";
--chat_template_kwargs "{"enable_thinking":true/false}" | only a few models support this, also "Assistant response prefill is incompatible with enable_thinking";
http API:
reasoning_format | I'm suspecting this has the same behavior as the CLI argument;
chat_template_kwargs : {"enable_thinking": true/false} | only a few models support this.
Beta Was this translation helpful? Give feedback.
All reactions