enable validationmode for eagle3 #2778

songbell · 2025-09-30T03:25:29Z

eagle 3 validation mode for multinomial sampling
to be merged after #2740

Signed-off-by: fishbell <bell.song@intel.com>

Copilot

Pull Request Overview

This PR enables validation mode for Eagle 3 speculative decoding, specifically adding support for multinomial sampling. The change integrates EAGLE 3 model capabilities into the continuous batching pipeline with specialized handling for hidden state management and token validation.

Adds eagle3_mode flag to llm_bench for enabling EAGLE 3 speculative decoding
Implements comprehensive EAGLE 3 support with model transformations, hidden state extraction, and specialized sampling
Extends tokenization to support EAGLE 3 mode with conditional special token handling

Reviewed Changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
tools/llm_bench/benchmark.py	Adds command-line flag for EAGLE 3 mode
tools/llm_bench/task/text_generation.py	Implements conditional tokenization for EAGLE 3 mode
tools/continuous_batching/accuracy/continuous_batching_eagle_decoding.cpp	Adds new EAGLE decoding accuracy testing tool
tools/continuous_batching/accuracy/CMakeLists.txt	Builds the new EAGLE decoding test executable
src/cpp/src/speculative_decoding/	Core EAGLE 3 implementation with model transforms and pipeline logic
src/cpp/src/continuous_batching/	Pipeline integration and model runner enhancements for EAGLE mode
src/cpp/src/safe_tensor_wrapper.*	Utility for parsing SafeTensor format files
src/cpp/include/openvino/genai/	Header updates for EAGLE mode support

Comments suppressed due to low confidence (1)

src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp:1

The magic number 0.9f should be defined as a named constant to improve code readability and maintainability.

// Copyright (C) 2023-2025 Intel Corporation

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-30T03:26:36Z

tools/llm_bench/task/text_generation.py

    tokenization_start = time.perf_counter()
-    input_data = tokenizer.encode(input_text_list)
+    if args.get("eagle3_mode"):
+        # eagle3 needs to disable special tokens to ensure compress rate


The comment mentions 'compress rate' but doesn't explain what compression is being referred to or why disabling special tokens improves it. Consider adding a more detailed comment explaining the relationship between EAGLE 3 mode and special token handling.

Suggested change

# eagle3 needs to disable special tokens to ensure compress rate

# In EAGLE 3 mode, the model uses a custom tokenization and compression scheme that achieves a higher compression rate

# (i.e., fewer tokens per input) when special tokens (such as BOS/EOS) are omitted. Disabling special tokens here ensures

# that the input is tokenized in a way that is compatible with EAGLE 3's expectations and maximizes compression efficiency.

Copilot · 2025-09-30T03:26:37Z

tools/continuous_batching/accuracy/continuous_batching_eagle_decoding.cpp

+
+std::vector<ov::genai::GenerationConfig> get_spec_decoding_generation_config_examples() {
+
+    // sampling param for speulative decoding


Typo in comment: 'speulative' should be 'speculative'.

Suggested change

// sampling param for speulative decoding

// sampling param for speculative decoding

Copilot · 2025-09-30T03:26:37Z

tools/continuous_batching/accuracy/continuous_batching_eagle_decoding.cpp

+            print_cb_generation_result(generation_result);
+            break;
+        case ov::genai::GenerationStatus::IGNORED:
+            std::cout << "Request was ignored due to lack of memory." <<std::endl;


Missing space before '<<std::endl;'. Should be consistent with spacing pattern used elsewhere in the file.

Suggested change

std::cout << "Request was ignored due to lack of memory." <<std::endl;

std::cout << "Request was ignored due to lack of memory." << std::endl;

Copilot · 2025-09-30T03:26:37Z

tools/continuous_batching/accuracy/continuous_batching_eagle_decoding.cpp

+            break;
+        case ov::genai::GenerationStatus::STOP:
+        case ov::genai::GenerationStatus::CANCEL:
+            std::cout << "Request was aborted." <<std::endl;


Missing space before '<<std::endl;'. Should be consistent with spacing pattern used elsewhere in the file.

Suggested change

std::cout << "Request was aborted." <<std::endl;

std::cout << "Request was aborted." << std::endl;

Copilot · 2025-09-30T03:26:37Z

src/cpp/src/speculative_decoding/speculative_decoding_impl.cpp

+
+        if (!m_new_results.empty()) {
+            model->add_results(m_new_results);
+            std::cout << "EagleModelTransform - Added last hidden output " << std::endl;


Using std::cout for logging in production code is not recommended. Consider using a proper logging framework or at least make these debug-only outputs that can be conditionally compiled out.

Copilot · 2025-09-30T03:26:38Z

src/cpp/src/continuous_batching/model_runner.hpp

+        // m_initial_hidden_states.clear();
+        //auto key = std::make_pair(request_id, seq_grouped_id);


Commented-out code should be removed rather than left in the codebase. If this is for future use, consider adding a TODO comment explaining the purpose.

Suggested change

// m_initial_hidden_states.clear();

//auto key = std::make_pair(request_id, seq_grouped_id);

Signed-off-by: fishbell <bell.song@intel.com>

songbell added 13 commits September 17, 2025 23:24

eagle impl with top-1 proposal

f56435a

Signed-off-by: fishbell <bell.song@intel.com>

enable cb benchmark for eagle3

1d74c03

Signed-off-by: fishbell <bell.song@intel.com>

add benchmarking, apply copilot review comments

0b09a65

Signed-off-by: fishbell <bell.song@intel.com>

fix case build failure

b36ecf7

Signed-off-by: fishbell <bell.song@intel.com>

fix SDL

d86e5a7

Signed-off-by: fishbell <bell.song@intel.com>

typo

f0aa2c7

Signed-off-by: fishbell <bell.song@intel.com>

opt hidden state transfer with ROI tensor

bbfa8ad

Signed-off-by: fishbell <bell.song@intel.com>

opt roi copy interface

35e6376

Signed-off-by: fishbell <bell.song@intel.com>

Merge branch 'master' into bell/eagle_cb_impl

f6a8872

parse eagle info from draft model

8a480a4

Signed-off-by: fishbell <bell.song@intel.com>

do not need seperate eagle sample

cd753f6

Signed-off-by: fishbell <bell.song@intel.com>

opt constructor for eagledecodingimpl

6aa4965

Signed-off-by: fishbell <bell.song@intel.com>

remove hardcoding of eagle layers

5efbada

Signed-off-by: fishbell <bell.song@intel.com>

Copilot AI review requested due to automatic review settings September 30, 2025 03:25

Copilot AI reviewed Sep 30, 2025

View reviewed changes

songbell marked this pull request as draft September 30, 2025 03:26

songbell mentioned this pull request Sep 30, 2025

eagle3 cb impl with top-1 proposal #2740

Open

enable multinomial for eagle 3

16a05ad

Signed-off-by: fishbell <bell.song@intel.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

enable validationmode for eagle3 #2778

enable validationmode for eagle3 #2778

songbell commented Sep 30, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 30, 2025

Uh oh!

Copilot AI Sep 30, 2025

Uh oh!

Copilot AI Sep 30, 2025

Uh oh!

Copilot AI Sep 30, 2025

Uh oh!

Copilot AI Sep 30, 2025

Uh oh!

Copilot AI Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-        # eagle3 needs to disable special tokens to ensure compress rate
+        # In EAGLE 3 mode, the model uses a custom tokenization and compression scheme that achieves a higher compression rate
+        # (i.e., fewer tokens per input) when special tokens (such as BOS/EOS) are omitted. Disabling special tokens here ensures
+        # that the input is tokenized in a way that is compatible with EAGLE 3's expectations and maximizes compression efficiency.


		std::vector<ov::genai::GenerationConfig> get_spec_decoding_generation_config_examples() {

		// sampling param for speulative decoding

	// sampling param for speulative decoding
	// sampling param for speculative decoding

	std::cout << "Request was ignored due to lack of memory." <<std::endl;
	std::cout << "Request was ignored due to lack of memory." << std::endl;

	std::cout << "Request was aborted." <<std::endl;
	std::cout << "Request was aborted." << std::endl;

		// m_initial_hidden_states.clear();
		//auto key = std::make_pair(request_id, seq_grouped_id);

enable validationmode for eagle3 #2778

Are you sure you want to change the base?

enable validationmode for eagle3 #2778

Conversation

songbell commented Sep 30, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant