Skip to content

Conversation

LysandreJik
Copy link
Member

Needs this to be merged first: #41444

Tests and docs need to be added before undraft

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Base automatically changed from streaming-requests-cb to main October 10, 2025 08:24
@LysandreJik LysandreJik marked this pull request as ready for review October 14, 2025 08:27
@LysandreJik
Copy link
Member Author

Test need a bit of a refactor, which will land once we merge @Wauplin's PR here: #41487

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am missing a lot of context, but let's make sure funcs are not nested if possible, and func names are helpful + document why we need themn

from huggingface_hub import model_info
from huggingface_hub.constants import HF_HUB_OFFLINE
from openai.types.chat.chat_completion import Choice
from starlette.responses import StreamingResponse
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK what starlette is, prob a new deps?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(also do we really need it?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the import from starlette to the one from FastAPI -> the FastAPI one is a re-export of the starlette one, but it's more coherent from an import shielding perspective

return stream_chat_completion(generation_streamer, request_id)
if req.get("stream"):

def sse(_generator):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a very helpful func name

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, replaced it by map so that it's cleaner

load_and_register_attn_kernel(applicable_attn_implementation)
# log that we used kernel fallback if successful
if attn_implementation.startswith("flash_attention"):
if attn_implementation.startswith("flash_"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should check "flash_" in attn_implementation" because if we fallback / use a kernel, it starts with kernel_community/....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

@LysandreJik
Copy link
Member Author

I cleaned it up a bit @ArthurZucker; IMO we might want to take a look at simplifying the stream/non-stream path in the future, depending on how the tool handling is done (still needs to be implemented for non-streaming generate and everything CB).

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@LysandreJik LysandreJik merged commit 13a35a5 into main Oct 15, 2025
26 checks passed
@LysandreJik LysandreJik deleted the serving-lighteval branch October 15, 2025 07:37
i3hz pushed a commit to i3hz/transformers that referenced this pull request Oct 15, 2025
* Enable non-streaming in transformers serve

Remove typos

Remove typos

Remove typos

* Fix tests

* Arthur review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants