Add parsing #2772

pavel-esir · 2025-09-25T13:03:18Z

Description

Ticket:

Fixes #(issue)

Checklist:

Tests have been updated or added to cover the new code
This patch fully addresses the ticket.
I have made corresponding changes to the documentation

pavel-esir · 2025-09-25T13:06:37Z

src/cpp/include/openvino/genai/parsers.hpp

+namespace genai {
+
+
+using ParsedMessage = std::map<std::string, std::string>;


Suggested change

using ParsedMessage = std::map<std::string, std::string>;

// TODO: will be converted to JSONLike object

using ParsedMessage = std::map<std::string, std::string>;

Copilot

Pull Request Overview

This PR adds parsing functionality to OpenVINO GenAI, implementing both base parsers for batch processing and incremental parsers for streaming scenarios. The changes introduce parser classes for structured output processing, specifically reasoning content and tool calling, with support for both C++ and Python APIs.

Introduces new parser base classes (ParserBase, IncrementalParserBase) with concrete implementations (DeepSeekR1ReasoningParser, Llama32PythonicParser, BaseReasoningParser)
Adds TextParserStreamer class that extends TextStreamer with parsing capabilities for incremental text processing
Integrates parsers into the generation pipeline through GenerationConfig and provides comprehensive test coverage

Reviewed Changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`tests/python_tests/test_parsers.py`	Python test cases for parser functionality
`tests/cpp/parser.cpp`	C++ unit tests for various parser implementations
`tests/cpp/CMakeLists.txt`	Build configuration update to include nlohmann_json dependency
`src/python/py_streamers.cpp`	Python bindings for TextParserStreamer class
`src/python/py_parsers.cpp`	Python bindings for parser base classes
`src/python/py_openvino_genai.cpp`	Integration of parser module initialization
`src/python/py_generation_config.cpp`	Adds parsers field to GenerationConfig Python bindings
`src/python/openvino_genai/py_openvino_genai.pyi`	Type definitions for parser classes and updated exports
`src/python/openvino_genai/__init__.py`	Python module exports for parser classes
`src/cpp/src/text_streamer.cpp`	Implementation of TextParserStreamer class
`src/cpp/src/parsers.hpp`	Internal header with parser implementation details
`src/cpp/src/parsers.cpp`	Core parser implementations and registration logic
`src/cpp/src/llm/pipeline.cpp`	Integration of parsers into LLM generation pipeline
`src/cpp/src/generation_config.cpp`	Adds parsers field support to GenerationConfig
`src/cpp/include/openvino/genai/text_streamer.hpp`	Public header for TextParserStreamer
`src/cpp/include/openvino/genai/parsers.hpp`	Public header defining parser interfaces
`src/cpp/include/openvino/genai/llm_pipeline.hpp`	Adds parsed results field to DecodedResults
`src/cpp/include/openvino/genai/generation_config.hpp`	Adds parsers field to GenerationConfig
`samples/cpp/text_generation/parsed_output_sample.cpp`	Sample demonstrating parser usage

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-25T13:07:14Z

tests/python_tests/test_parsers.py

+    extended = stream_string[:]
+    extended.append("")
+
+    for parser in parsers:
+        for (prev_subword, subword) in zip(extended, stream_string):


The zip operation creates pairs where prev_subword comes from extended (which has an extra empty string at the end) and subword comes from the original stream_string. This creates incorrect pairing - the last element of extended will be paired with the last element of stream_string, not providing the intended previous/current relationship.

Suggested change

extended = stream_string[:]

extended.append("")

for parser in parsers:

for (prev_subword, subword) in zip(extended, stream_string):

prev_subwords = [""] + stream_string[:-1]

for parser in parsers:

for (prev_subword, subword) in zip(prev_subwords, stream_string):

Copilot · 2025-09-25T13:07:14Z

tests/python_tests/test_parsers.py

+    extended = stream_string[:]
+    extended.append("")
+
+    for parser in parsers:
+        for (prev_subword, subword) in zip(extended, stream_string):


The zip operation creates pairs where prev_subword comes from extended (which has an extra empty string at the end) and subword comes from the original stream_string. This creates incorrect pairing - the last element of extended will be paired with the last element of stream_string, not providing the intended previous/current relationship.

Suggested change

extended = stream_string[:]

extended.append("")

for parser in parsers:

for (prev_subword, subword) in zip(extended, stream_string):

# Pair each subword with its previous subword (first element has no previous)

prev_subwords = [""] + stream_string[:-1]

for parser in parsers:

for (prev_subword, subword) in zip(prev_subwords, stream_string):

Copilot · 2025-09-25T13:07:15Z

src/python/py_streamers.cpp

+        // .def("write",
+        //      py::overload_cast<ParsedMessage&>(&TextParserStreamer::write),
+        //      py::arg("message"),
+        //      "Write is called with a ParsedMessage. Returns StreamingStatus.")


Commented-out code should be removed rather than left in the source. If this functionality is needed later, version control can provide the history.

Suggested change

// .def("write",

// py::overload_cast<ParsedMessage&>(&TextParserStreamer::write),

// py::arg("message"),

// "Write is called with a ParsedMessage. Returns StreamingStatus.")

Copilot · 2025-09-25T13:07:15Z

src/cpp/src/parsers.cpp

+#include <vector>
+#include <cctype>
+#include <stdexcept>
+#include <bits/stdc++.h>


The header <bits/stdc++.h> is non-standard and GCC-specific. Use individual standard headers instead (e.g., <algorithm>, <iostream>, etc.) for better portability.

Suggested change

#include <bits/stdc++.h>

Copilot · 2025-09-25T13:07:15Z

src/cpp/src/parsers.cpp

+// static initializer to register available buildin parsers
+static bool register_backends() {
+    IncrementalParserBase::registered_parsers[DeepSeekR1ReasoningParser::name()] = std::make_shared<DeepSeekR1ReasoningParser>();
+    IncrementalParserBase::registered_parsers[DeepSeekR1ReasoningParser::name()] = std::make_shared<DeepSeekR1ReasoningParser>();


Line 125 is a duplicate of line 124, registering the same parser twice under the same name. The second registration overwrites the first and should likely be removed or changed to register a different parser variant.

Suggested change

IncrementalParserBase::registered_parsers[DeepSeekR1ReasoningParser::name()] = std::make_shared<DeepSeekR1ReasoningParser>();

Copilot · 2025-09-25T13:07:16Z

src/cpp/src/text_streamer.cpp

+        // if (parser->is_active()) {
+            m_parsed_message = parser->parse(m_parsed_message, m_text_buffer, message);
+        // }
+        // m_parsed_message["content"] += message;


Commented-out code should be removed. If the is_active() check or content concatenation is needed, it should be implemented properly rather than left as commented code.

Suggested change

// m_parsed_message["content"] += message;

// Concatenate message to "content" field in m_parsed_message

if (m_parsed_message.find("content") != m_parsed_message.end() && m_parsed_message["content"].index() == 1) { // assuming std::string is at index 1

m_parsed_message["content"] = std::get<std::string>(m_parsed_message["content"]) + message;

} else {

m_parsed_message["content"] = message;

}

Copilot · 2025-09-25T13:07:16Z

src/cpp/src/llm/pipeline.cpp

+    if (!parsers.empty()) {
+        for (size_t i = 0; i < res.texts.size(); ++i) {
+            auto& message = res.texts[i];
+            ParsedMessage msg;


The parser's parse method expects a ParsedMessage& as input but is being called with msg which should contain the message content. The input should be initialized with the text content: ParsedMessage input; input[\"content\"] = message; msg = parser->parse(input);

Suggested change

ParsedMessage msg;

ParsedMessage input;

input["content"] = message;

ParsedMessage msg = input;

Copilot · 2025-09-25T13:07:17Z

src/python/openvino_genai/py_openvino_genai.pyi

    def write(self, token: typing.SupportsInt | collections.abc.Sequence[typing.SupportsInt]) -> StreamingStatus:
        ...


[nitpick] The end() method was removed from the TextStreamer class definition but this appears to be an intentional API change. However, verify that all dependent code has been updated to handle the absence of this method.

Copilot · 2025-09-25T13:07:17Z

samples/cpp/text_generation/parsed_output_sample.cpp

+
+#include "openvino/genai/llm_pipeline.hpp"
+#include "openvino/genai/parsers.hpp"
+#include "openvino/genai/text_streamer.hpp"


The ParsingState type is used in a using declaration but is not defined in any of the included headers. This will result in a compilation error.

Suggested change

#include "openvino/genai/text_streamer.hpp"

#include "openvino/genai/text_streamer.hpp"

#include "openvino/genai/parsing_state.hpp"

Copilot · 2025-09-25T13:07:17Z

samples/cpp/text_generation/parsed_output_sample.cpp

+public:
+    CurrentStreamer(const ov::genai::Tokenizer& tokenizer)
+        : ov::genai::TextParserStreamer(tokenizer) {}
+    ov::genai::StreamingStatus write(const ov::genai::ParsedMessage& message) {


The method signature uses const ov::genai::ParsedMessage& message but the base class method TextParserStreamer::write expects ParsedMessage& message (non-const reference). This signature mismatch will prevent proper virtual function override.

Suggested change

ov::genai::StreamingStatus write(const ov::genai::ParsedMessage& message) {

ov::genai::StreamingStatus write(ov::genai::ParsedMessage& message) {

src/cpp/include/openvino/genai/parsers.hpp

pavel-esir · 2025-09-25T13:09:16Z

src/cpp/include/openvino/genai/parsers.hpp

+class Llama32PythonicParser : public ParserBase {
+// Does not modify original content, only extracts and adds tool calls
+public:
+    Llama32PythonicParser(bool keep_original_content = true) : m_keep_original_content(keep_original_content) {}


Check what default is in VLLM

src/cpp/include/openvino/genai/parsers.hpp

pavel-esir · 2025-09-25T13:20:53Z

src/cpp/src/llm/pipeline.cpp

+        ParsedMessage msg;
+        for (auto& parser: incremental_parsers) {
+            // Previous is and empty message because we populate message with the full generated text.
+            msg = parser->parse(msg, "", message);


Check the state of incremental parser and reset if necessary

src/cpp/src/llm/pipeline.cpp

Wovchena · 2025-09-26T10:46:05Z

src/cpp/include/openvino/genai/parsers.hpp

+
+class DeepSeekR1ReasoningParser : public IncrementalParserBase {
+private:
+    bool m_starts_with_thinking = true;


Private members in public API are clumsy. Everyone can see them but can't do anything with them. Make them protected. Alternatively you can only keep string way of requesting such parsers, removing the class from public headers. If someone needs it, they can still copy the implementation and adjust introducing a new parser

src/cpp/src/parsers.hpp

src/cpp/include/openvino/genai/parsers.hpp

Copilot AI review requested due to automatic review settings September 25, 2025 13:03

pavel-esir marked this pull request as draft September 25, 2025 13:03

pavel-esir commented Sep 25, 2025

View reviewed changes

Copilot AI reviewed Sep 25, 2025

View reviewed changes

pavel-esir commented Sep 25, 2025

View reviewed changes

src/cpp/include/openvino/genai/parsers.hpp Show resolved Hide resolved

pavel-esir commented Sep 25, 2025

View reviewed changes

src/cpp/include/openvino/genai/parsers.hpp Outdated Show resolved Hide resolved

pavel-esir commented Sep 25, 2025

View reviewed changes

src/cpp/src/llm/pipeline.cpp Outdated Show resolved Hide resolved

Wovchena reviewed Sep 26, 2025

View reviewed changes

pavel-esir added 4 commits September 29, 2025 20:35

tests are green

83d86b3

python api added

debedc1

refactored

aed2588

add parsing at the end of generate()

18615be

pavel-esir force-pushed the add_parsing branch from 450c7a8 to b15fc5f Compare September 29, 2025 23:01

github-actions bot added category: continuous batching Continuous batching category: GHA CI based on Github actions labels Sep 29, 2025

pavel-esir force-pushed the add_parsing branch from b15fc5f to 290c821 Compare September 29, 2025 23:09

hide map with predefined initialized parsers; add TODOs

290c821

MaximProshin added the Code Freeze label Oct 10, 2025

		namespace genai {


		using ParsedMessage = std::map<std::string, std::string>;

	using ParsedMessage = std::map<std::string, std::string>;
	// TODO: will be converted to JSONLike object
	using ParsedMessage = std::map<std::string, std::string>;

-        // m_parsed_message["content"] += message;
+        // Concatenate message to "content" field in m_parsed_message
+        if (m_parsed_message.find("content") != m_parsed_message.end() && m_parsed_message["content"].index() == 1) { // assuming std::string is at index 1
+            m_parsed_message["content"] = std::get<std::string>(m_parsed_message["content"]) + message;
+        } else {
+            m_parsed_message["content"] = message;
+        }

-            ParsedMessage msg;
+            ParsedMessage input;
+            input["content"] = message;
+            ParsedMessage msg = input;

		def write(self, token: typing.SupportsInt \| collections.abc.Sequence[typing.SupportsInt]) -> StreamingStatus:
		...

	#include "openvino/genai/text_streamer.hpp"
	#include "openvino/genai/text_streamer.hpp"
	#include "openvino/genai/parsing_state.hpp"

	ov::genai::StreamingStatus write(const ov::genai::ParsedMessage& message) {
	ov::genai::StreamingStatus write(ov::genai::ParsedMessage& message) {

Add parsing #2772

Are you sure you want to change the base?

Add parsing #2772

Conversation

pavel-esir commented Sep 25, 2025

Description

Checklist:

Uh oh!

pavel-esir Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pavel-esir Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pavel-esir Sep 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Wovchena Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants