-
Notifications
You must be signed in to change notification settings - Fork 30.8k
Enable non-streaming mode in transformers serve
#41446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
21cf126
to
5a99194
Compare
Remove typos Remove typos Remove typos
af65c33
to
dca9e05
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am missing a lot of context, but let's make sure funcs are not nested if possible, and func names are helpful + document why we need themn
src/transformers/commands/serving.py
Outdated
from huggingface_hub import model_info | ||
from huggingface_hub.constants import HF_HUB_OFFLINE | ||
from openai.types.chat.chat_completion import Choice | ||
from starlette.responses import StreamingResponse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDK what starlette is, prob a new deps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also do we really need it?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the import from starlette to the one from FastAPI -> the FastAPI one is a re-export of the starlette one, but it's more coherent from an import shielding perspective
src/transformers/commands/serving.py
Outdated
return stream_chat_completion(generation_streamer, request_id) | ||
if req.get("stream"): | ||
|
||
def sse(_generator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a very helpful func name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, replaced it by map so that it's cleaner
src/transformers/modeling_utils.py
Outdated
load_and_register_attn_kernel(applicable_attn_implementation) | ||
# log that we used kernel fallback if successful | ||
if attn_implementation.startswith("flash_attention"): | ||
if attn_implementation.startswith("flash_"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should check "flash_" in attn_implementation"
because if we fallback / use a kernel, it starts with kernel_community/....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea!
I cleaned it up a bit @ArthurZucker; IMO we might want to take a look at simplifying the stream/non-stream path in the future, depending on how the tool handling is done (still needs to be implemented for non-streaming generate and everything CB). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
* Enable non-streaming in transformers serve Remove typos Remove typos Remove typos * Fix tests * Arthur review
Needs this to be merged first: #41444
Tests and docs need to be added before undraft