How should the client handle thinking blocks? #15333

GlasslessPizza · 2025-08-14T22:07:24Z

GlasslessPizza
Aug 14, 2025

I have a few questions about the correct handling of reasoning blocks and i really need some feedback. I'm upgrading a simple python client frontend to support reasoning models. At the moment I'm using streaming mode in v1/chat/completions in llama-server.

Can llama-server working in SSE streaming mode somehow mark tokens in a way as to differentiate thinking tokens from normal tokens?
This would avoid the frontend the need to implement flimsy parsing for thinking delimiters on a per-model basis.
Does llama-server remove past thinking blocks automatically when processing or should the frontend take care to remove them before sending?
According to this, models are typically trained to expect thinking blocks related to past messages be removed.
The http API has the parameter "thinking_forced_open". Is there also a "thinking_forced_off"? "reasoning-budget" is only available as a CLI argument.
The coexistence of these commands would abstract the switching logic for hybrid models, without needing to support textual switches on a per-model basis on the frontend side like "/no_think" or by prefilling "" or whatever proprietary delimiter string. Also some default to non-thinking mode, with possibility to enable thinking, others default to thinking mode with possibility to disable thinking and this would iron out these inconsistencies as well.

I found in the docs a few other parameters related to managing thinking blocks, but I found them not of much use:

CLI llama-server arguments:
--reasoning-format | always set to "none" by default for streaming mode, ignored;
--reasoning-budget set to 0 | error "Assistant response prefill is incompatible with enable_thinking";
--chat_template_kwargs "{"enable_thinking":true/false}" | only a few models support this, also "Assistant response prefill is incompatible with enable_thinking";

http API:
reasoning_format | I'm suspecting this has the same behavior as the CLI argument;
chat_template_kwargs : {"enable_thinking": true/false} | only a few models support this.

aldehir · 2025-08-16T07:25:59Z

aldehir
Aug 16, 2025
Collaborator

Can llama-server working in SSE streaming mode somehow mark tokens in a way as to differentiate thinking tokens from normal tokens?
This would avoid the frontend the need to implement flimsy parsing for thinking delimiters on a per-model basis.

Setting reasoning-format to auto or deepseek puts the reasoning in a separate reasoning_content field so you don't have to parse it yourself.

Does llama-server remove past thinking blocks automatically when processing or should the frontend take care to remove them before sending?
According to this, models are typically trained to expect thinking blocks related to past messages be removed.

Most jinja chat templates handle stripping thinking tags from the past messages. However, if you use one of the reasoning formats I mentioned above, they are already removed.

The http API has the parameter "thinking_forced_open". Is there also a "thinking_forced_off"? "reasoning-budget" is only available as a CLI argument.

I believe enable_thinking is the equivalent. When false, it appends an empty set of thinking tags so the model resumes the conversation after having no thoughts. thinking_forced_open only appends the opening tag, which influences the model to think.

I don't believe your assessment about reasoning format always being none when streaming is correct. There is another format deepseek-legacy where that is the case. Then again, I could be misinterpreting the code. I haven't personally tried it out.

1 reply

GlasslessPizza Aug 16, 2025
Author

I got it from the docs:

--reasoning-format FORMAT

controls whether thought tags are allowed and/or extracted from the response, and in which format they're returned; one of:

none: leaves thoughts unparsed in message.content
deepseek: puts thoughts in message.reasoning_content (except in streaming mode, which behaves as none)
(default: deepseek)
(env: LLAMA_ARG_THINK)

There's no mention of "auto" or "deepseek-legacy", unless those are undocumented, and it does state "except in streaming mode, which behaves as none".
Regardless, I eventually found out that the argument "--jinja" alone is enough to put the reasoning tokens in the reasoning_content field, at least for those models that support it. Now i need to figure out a way to pass back the reasoning tokens in the prefill to reimplement the pause and continue functionality i had before. Do you have any suggestion?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How should the client handle thinking blocks? #15333

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How should the client handle thinking blocks? #15333

Uh oh!

Uh oh!

GlasslessPizza Aug 14, 2025

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

aldehir Aug 16, 2025 Collaborator

Uh oh!

GlasslessPizza Aug 16, 2025 Author

GlasslessPizza
Aug 14, 2025

Replies: 1 comment 1 reply

aldehir
Aug 16, 2025
Collaborator

GlasslessPizza Aug 16, 2025
Author