LLM is generating multiple rounds of conversation with himself #18836

MatKollar · 2025-05-24T11:15:53Z

MatKollar
May 24, 2025

I'm using the ContextChatEngine in my implementation to support conversation history. Since switching to this engine, I've noticed that the LLM (using llama.cpp with Mistral/Gemma models) sometimes generates multiple rounds of user/assistant dialogue by itself in the response. It creates fictional back-and-forth conversations instead of just answering the current prompt.

I suspect this behavior might be caused by how the conversation history is structured in the final prompt. From what I see, it's formatted like:

user: ...
assistant: ...
user: ...
assistant:

This may be encouraging the model to continue the pattern and generate additional, made-up turns.

Is there any way to prevent this behavior?

I couldn't find an option to modify the history prompt template in ContextChatEngine.
I also haven't found a way to explicitly mark the last user query differently to guide the model's stopping point.
Would it be possible to set an end-of-sequence (EOS) token or some similar mechanism to stop generation ?

Any advice or alternative approaches would be helpful.

@dosu-bot · 2025-05-24T11:17:53Z

dosubot[bot]
bot May 24, 2025

You're correct that the way conversation history is formatted in the prompt can encourage the LLM to continue generating additional, fictional dialogue turns. This is a common issue when the prompt ends with something like:

user: ...
assistant: ...
user: ...
assistant:

The model often interprets this as an invitation to continue the conversation pattern.

Here’s what I found regarding your questions:

1. Modifying the History Prompt Template

You can modify the history prompt template in ContextChatEngine by using the context_template parameter. This parameter accepts either a string or a PromptTemplate, allowing you to customize how the conversation history is presented to the model. You can set this when initializing the engine or via the from_defaults method. This gives you control over the formatting and can help break the pattern that causes the model to generate extra turns. For example, you might end the prompt with a clear instruction or a different marker instead of assistant: to discourage further dialogue generation [source].

2. Marking the Last User Query Differently

There is no built-in feature in ContextChatEngine to explicitly mark the last user query differently, but since you can customize the prompt template, you could implement this yourself. For example, you could add a special marker or instruction after the last user message, such as:

user: ...
assistant: ...
user: [CURRENT QUESTION] ...
[END OF HISTORY]
Answer the user's last question above.

This can help signal to the model where the conversation history ends and where it should stop generating.

3. End-of-Sequence (EOS) Token

There is no direct support in ContextChatEngine for injecting an EOS token or similar mechanism into the prompt to control generation. While EOS tokens are defined in some integrations (like NVIDIA TensorRT), this is not exposed at the chat engine level and is not directly applicable to your use case [source].

4. Alternative Approaches

Prompt Engineering: Adjust your prompt template to end with a clear instruction or a unique delimiter after the last user message, rather than the typical assistant: label.
Role-based Templates: If your LLM backend supports message/role-based templates (like OpenAI's chat models), consider using those instead of a plain text prompt. This can help the model better understand turn boundaries.
External History Management: Some users manage chat history externally (e.g., with Langchain) and inject only the relevant context and last user message into the prompt, reducing the risk of the model hallucinating extra turns [source].

Summary:
You can address this issue by customizing the context_template to change how history is presented, possibly marking the last user query or ending the prompt with a clear instruction. There is no built-in EOS token support at the chat engine level, but prompt engineering is usually effective for this problem.

If you need an example of a custom prompt template or further guidance on implementation, let me know!

To continue talking to Dosu, mention @dosu-bot.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

2 replies

MatKollar May 24, 2025
Author

@dosu-bot Could you please give me an example of how to define a custom context_template where the context_str, chat_history, and query_str are formatted before being passed to the ContextChatEngine?

I'd like to control how the history and user query are structured in the final prompt to avoid unintended multi-turn completions.

dosubot[bot] bot May 24, 2025

Here's an example of how you can define a custom context_template for the ContextChatEngine to control how context_str, chat_history, and query_str are formatted, helping to avoid unintended multi-turn completions.

Example Custom Template

You can pass a string or a PromptTemplate as the context_template parameter. The template can include placeholders for your context, chat history, and the current user query. To discourage the model from generating extra turns, end the prompt with a clear instruction instead of an assistant: label.

from llama_index.core.chat_engine.context import ContextChatEngine
from llama_index.core.prompts import PromptTemplate

custom_template = """
You are an AI assistant. Here is the relevant context for this conversation:
{context_str}

Conversation history:
{chat_history}

The user has asked a new question:
{query_str}

Please answer the user's last question above as clearly and concisely as possible. Do not generate any further dialogue turns.
"""

engine = ContextChatEngine.from_defaults(
    retriever=your_retriever,
    context_template=PromptTemplate(custom_template)
)

How it works:

{context_str} will be replaced with your retrieved context.
{chat_history} will be replaced with the formatted previous turns (you can control this formatting upstream if needed).
{query_str} will be replaced with the current user query.
The instruction at the end tells the model to answer only the last question, discouraging it from continuing the conversation pattern.

Implementation Details

If you pass a string as context_template, it will be converted to a PromptTemplate automatically [source].
The engine will use this template to construct the prompt, inserting the context, history, and query in the appropriate places [source].
You can further customize how chat_history is formatted before it is passed in, for example by joining messages with newlines or adding role markers.

Tips

Avoid ending your template with assistant: or similar, as this encourages the model to continue the dialogue pattern.
Use explicit instructions or delimiters (like [END OF HISTORY]) to mark the end of the history and the start of the answer section.

Let me know if you need a more advanced example or help with formatting the chat_history!

To continue talking to Dosu, mention @dosu-bot.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other}

MatKollar · 2025-05-24T15:11:41Z

MatKollar
May 24, 2025
Author

Issue resolved by explicitly setting stop sequences for the model to prevent it from generating additional rounds of conversation.
When initializing the LlamaCPP model, I passed the following generate_kwargs:

generate_kwargs = {
    "stop": ["user:", "assistant:"]
}

This forces the model to stop generating output once it encounters one of those stop tokens.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LLM is generating multiple rounds of conversation with himself #18836

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

LLM is generating multiple rounds of conversation with himself #18836

Uh oh!

MatKollar May 24, 2025

Replies: 2 comments · 2 replies

Uh oh!

dosubot[bot] bot May 24, 2025

1. Modifying the History Prompt Template

2. Marking the Last User Query Differently

3. End-of-Sequence (EOS) Token

4. Alternative Approaches

Uh oh!

MatKollar May 24, 2025 Author

Uh oh!

dosubot[bot] bot May 24, 2025

Example Custom Template

Implementation Details

Tips

Uh oh!

MatKollar May 24, 2025 Author

MatKollar
May 24, 2025

Replies: 2 comments 2 replies

dosubot[bot]
bot May 24, 2025

MatKollar May 24, 2025
Author

MatKollar
May 24, 2025
Author