Skip to content

Conversation

songbell
Copy link

eagle 3 validation mode for multinomial sampling
to be merged after #2740

Signed-off-by: fishbell <bell.song@intel.com>
Signed-off-by: fishbell <bell.song@intel.com>
Signed-off-by: fishbell <bell.song@intel.com>
Signed-off-by: fishbell <bell.song@intel.com>
Signed-off-by: fishbell <bell.song@intel.com>
Signed-off-by: fishbell <bell.song@intel.com>
Signed-off-by: fishbell <bell.song@intel.com>
Signed-off-by: fishbell <bell.song@intel.com>
Signed-off-by: fishbell <bell.song@intel.com>
Signed-off-by: fishbell <bell.song@intel.com>
Signed-off-by: fishbell <bell.song@intel.com>
Signed-off-by: fishbell <bell.song@intel.com>
@Copilot Copilot AI review requested due to automatic review settings September 30, 2025 03:25
@github-actions github-actions bot added category: llm_bench Label for tool/llm_bench folder category: continuous batching Continuous batching category: LLM LLM pipeline (stateful, static) category: sampling Sampling / Decoding algorithms category: speculative decoding Speculative decoding category: LoRA Low rank adapters category: cmake / build Cmake scripts category: CPP API Changes in GenAI C++ public headers no-match-files labels Sep 30, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enables validation mode for Eagle 3 speculative decoding, specifically adding support for multinomial sampling. The change integrates EAGLE 3 model capabilities into the continuous batching pipeline with specialized handling for hidden state management and token validation.

  • Adds eagle3_mode flag to llm_bench for enabling EAGLE 3 speculative decoding
  • Implements comprehensive EAGLE 3 support with model transformations, hidden state extraction, and specialized sampling
  • Extends tokenization to support EAGLE 3 mode with conditional special token handling

Reviewed Changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tools/llm_bench/benchmark.py Adds command-line flag for EAGLE 3 mode
tools/llm_bench/task/text_generation.py Implements conditional tokenization for EAGLE 3 mode
tools/continuous_batching/accuracy/continuous_batching_eagle_decoding.cpp Adds new EAGLE decoding accuracy testing tool
tools/continuous_batching/accuracy/CMakeLists.txt Builds the new EAGLE decoding test executable
src/cpp/src/speculative_decoding/ Core EAGLE 3 implementation with model transforms and pipeline logic
src/cpp/src/continuous_batching/ Pipeline integration and model runner enhancements for EAGLE mode
src/cpp/src/safe_tensor_wrapper.* Utility for parsing SafeTensor format files
src/cpp/include/openvino/genai/ Header updates for EAGLE mode support
Comments suppressed due to low confidence (1)

src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp:1

  • The magic number 0.9f should be defined as a named constant to improve code readability and maintainability.
// Copyright (C) 2023-2025 Intel Corporation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

tokenization_start = time.perf_counter()
input_data = tokenizer.encode(input_text_list)
if args.get("eagle3_mode"):
# eagle3 needs to disable special tokens to ensure compress rate
Copy link

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions 'compress rate' but doesn't explain what compression is being referred to or why disabling special tokens improves it. Consider adding a more detailed comment explaining the relationship between EAGLE 3 mode and special token handling.

Suggested change
# eagle3 needs to disable special tokens to ensure compress rate
# In EAGLE 3 mode, the model uses a custom tokenization and compression scheme that achieves a higher compression rate
# (i.e., fewer tokens per input) when special tokens (such as BOS/EOS) are omitted. Disabling special tokens here ensures
# that the input is tokenized in a way that is compatible with EAGLE 3's expectations and maximizes compression efficiency.

Copilot uses AI. Check for mistakes.


std::vector<ov::genai::GenerationConfig> get_spec_decoding_generation_config_examples() {

// sampling param for speulative decoding
Copy link

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: 'speulative' should be 'speculative'.

Suggested change
// sampling param for speulative decoding
// sampling param for speculative decoding

Copilot uses AI. Check for mistakes.

print_cb_generation_result(generation_result);
break;
case ov::genai::GenerationStatus::IGNORED:
std::cout << "Request was ignored due to lack of memory." <<std::endl;
Copy link

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space before '<<std::endl;'. Should be consistent with spacing pattern used elsewhere in the file.

Suggested change
std::cout << "Request was ignored due to lack of memory." <<std::endl;
std::cout << "Request was ignored due to lack of memory." << std::endl;

Copilot uses AI. Check for mistakes.

break;
case ov::genai::GenerationStatus::STOP:
case ov::genai::GenerationStatus::CANCEL:
std::cout << "Request was aborted." <<std::endl;
Copy link

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing space before '<<std::endl;'. Should be consistent with spacing pattern used elsewhere in the file.

Suggested change
std::cout << "Request was aborted." <<std::endl;
std::cout << "Request was aborted." << std::endl;

Copilot uses AI. Check for mistakes.


if (!m_new_results.empty()) {
model->add_results(m_new_results);
std::cout << "EagleModelTransform - Added last hidden output " << std::endl;
Copy link

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using std::cout for logging in production code is not recommended. Consider using a proper logging framework or at least make these debug-only outputs that can be conditionally compiled out.

Copilot uses AI. Check for mistakes.

Comment on lines +183 to +184
// m_initial_hidden_states.clear();
//auto key = std::make_pair(request_id, seq_grouped_id);
Copy link

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented-out code should be removed rather than left in the codebase. If this is for future use, consider adding a TODO comment explaining the purpose.

Suggested change
// m_initial_hidden_states.clear();
//auto key = std::make_pair(request_id, seq_grouped_id);

Copilot uses AI. Check for mistakes.

@songbell songbell marked this pull request as draft September 30, 2025 03:26
Signed-off-by: fishbell <bell.song@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: cmake / build Cmake scripts category: continuous batching Continuous batching category: CPP API Changes in GenAI C++ public headers category: llm_bench Label for tool/llm_bench folder category: LLM LLM pipeline (stateful, static) category: LoRA Low rank adapters category: sampling Sampling / Decoding algorithms category: speculative decoding Speculative decoding no-match-files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant