diff --git a/docs/pull_hf_models.md b/docs/pull_hf_models.md index 09b84ed822..16dbe5b2a7 100644 --- a/docs/pull_hf_models.md +++ b/docs/pull_hf_models.md @@ -1,19 +1,20 @@ # OVMS Pull mode {#ovms_docs_pull} -This documents describes how to leverage OpenVINO Model Server (OVMS) pull feature to automate deployment configuration with Generative AI models. When pulling from [OpenVINO organization](https://huggingface.co/OpenVINO) from HF no additional steps are required. However, when pulling models [outside of the OpenVINO](https://github.com/openvinotoolkit/model_server/blob/main/docs/pull_optimum_cli.md) organization you have to install additional python dependencies when using baremetal execution so that optimum-cli is available for ovms executable or build the OVMS python container for docker deployments. In summary you have 2 options: +This document describes how to leverage OpenVINO Model Server (OVMS) pull feature to automate deployment configuration with Generative AI models. When pulling from [Hugging Face Hub](https://huggingface.co/), no additional steps are required. However, when pulling models in Pytorch format, you have to install additional python dependencies when using baremetal execution so that optimum-cli is available for ovms executable or rely on the docker image `openvino/model_server:latest-py`. In summary you have 2 options: -- pulling preconfigured models in IR format from OpenVINO organization -- pulling models with automatic conversion and quantization (requires optimum-cli). Include additional consideration like longer time for deployment and pulling model data (original model) from HF, model memory for conversion, diskspace - described [here](https://github.com/openvinotoolkit/model_server/blob/main/docs/pull_optimum_cli.md) +- pulling pre-configured models in IR format (described below) +- pulling models with automatic conversion and quantization via optimum-cli. Described in the [pulling with conversion](https://github.com/openvinotoolkit/model_server/blob/main/docs/pull_optimum_cli.md) -### Pulling the models +> **Note:** Models in IR format must be exported using `optimum-cli` including tokenizer and detokenizer files also in IR format, if applicable. If missing, tokenizer and detokenizer should be added using `convert_tokenizer --with-detokenizer` tool. -There is a special mode to make OVMS pull the model from Hugging Face before starting the service: +## Pulling pre-configured models + +There is a special OVMS mode to pull the model from Hugging Face before starting the service. It is triggered by `--source_models` parameter. In addition, `--pull` parameter is for pulling alone. The application quits after the model is downloaded. Without `--pull` option, the model will be deployed and server started. ::::{tab-set} :::{tab-item} With Docker :sync: docker **Required:** Docker Engine installed - ```text docker run $(id -u):$(id -g) --rm -v :/models:rw openvino/model_server:latest --pull --source_model --model_repository_path /models --model_name --target_device --task [TASK_SPECIFIC_PARAMETERS] ``` @@ -55,9 +56,8 @@ ovms --pull --source_model "OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov" --model_r :::: -It will prepare all needed configuration files to support LLMS with OVMS in the model repository. Check [parameters page](./parameters.md) for detailed descriptions of configuration options and parameter usage. +Check [parameters page](./parameters.md) for detailed descriptions of configuration options and parameter usage. -In case you want to setup model and start server in one step follow instructions on [this page](./starting_server.md). +In case you want to setup model and start server in one step, follow [instructions](./starting_server.md). -*Note:* -When using pull mode you need both read and write access rights to models repository. +> **Note:** When using pull mode you need both read and write access rights to models repository. diff --git a/docs/pull_optimum_cli.md b/docs/pull_optimum_cli.md index 8ae83a4ac4..2c606b345c 100644 --- a/docs/pull_optimum_cli.md +++ b/docs/pull_optimum_cli.md @@ -1,40 +1,16 @@ # OVMS Pull mode with optimum cli {#ovms_docs_pull_optimum} -This documents describes how to leverage OpenVINO Model Server (OVMS) pull feature to automate deployment configuration with Generative AI models when pulling outside of [OpenVINO organization](https://huggingface.co/OpenVINO) from HF. +This document describes how to leverage OpenVINO Model Server (OVMS) pull feature to automate deployment of generative models from HF with conversion and quantization in runtime. -You have to use docker image with optimum-cli or install additional python dependencies to the baremetal package. Follow the steps described below. +You have to use docker image with optimum-cli `openvino/model_server:latest-py` or install additional python dependencies to the baremetal package. Follow the steps described below. -Pulling models with automatic conversion and quantization (requires optimum-cli). Include additional consideration like longer time for deployment and pulling model data (original model) from HF, model memory for conversion, diskspace. +> **Note:** This procedure might increase memory usage during the model conversion and requires downloading the original model. Expect memory usage at least to the level of original model size during the conversion. -Note: Pulling the models from HuggingFace Hub can automate conversion and compression. It might however increase memory usage during the model conversion and requires downloading the original model. - -## OVMS building and installation for optimum-cli integration -### Build python docker image -```bash -git clone https://github.com/openvinotoolkit/model_server.git -cd model_server -make python_image -``` - -Example pull command with optimum model cache directory sharing and setting HF_TOKEN environment variable for model download authentication. - -```bash -docker run -e HF_TOKEN=hf_YOURTOKEN -e HF_HOME=/hf_home/cache --user $(id -u):$(id -g) --group-add=$(id -g) -v /opt/home/user/.cache/huggingface/:/hf_home/cache -v $(pwd)/models:/models:rw openvino/model_server:py --pull --model_repository_path /models --source_model meta-llama/Meta-Llama-3-8B-Instruct -``` - -### Install optimum-cli -Install python on your baremetal system from `https://www.python.org/downloads/` and run the commands: -```console -pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt -``` - -or use the python binary from the ovms_windows_python_on.zip or ovms.tar.gz package - see [deployment instructions](deploying_server_baremetal.md) for details. +## Add optimum-cli to OVMS installation on windows ```bat curl -L https://github.com/openvinotoolkit/model_server/releases/download/v2025.3/ovms_windows_python_on.zip -o ovms.zip tar -xf ovms.zip -``` -```bat ovms\setupvars.bat ovms\python\python -m pip install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt ``` @@ -50,14 +26,12 @@ Using `--pull` parameter, we can use OVMS to download the model, quantize and co **Required:** Docker Engine installed ```text -docker run $(id -u):$(id -g) --rm -v :/models:rw openvino/model_server:py --pull --source_model --model_repository_path /models --model_name --target_device --task [TASK_SPECIFIC_PARAMETERS] +docker run $(id -u):$(id -g) --rm -v :/models:rw openvino/model_server:latest-py --pull --source_model --model_repository_path /models --model_name --target_device --task [TASK_SPECIFIC_PARAMETERS] ``` ::: :::{tab-item} On Baremetal Host :sync: baremetal -**Required:** OpenVINO Model Server package - see [deployment instructions](./deploying_server_baremetal.md) for details. - ```text ovms --pull --source_model --model_repository_path --model_name --target_device --task [TASK_SPECIFIC_PARAMETERS] ``` @@ -67,7 +41,7 @@ ovms --pull --source_model --model_repository_path :/models:rw openvino/model_server:py --pull --source_model "Qwen/Qwen3-8B" --model_repository_path /models --model_name Qwen3-8B --task text_generation +docker run $(id -u):$(id -g) --rm -v :/models:rw openvino/model_server:latest-py --pull --source_model "Qwen/Qwen3-8B" --model_repository_path /models --model_name Qwen3-8B --task text_generation --weight-format int8 ``` ::: @@ -84,14 +58,26 @@ docker run $(id -u):$(id -g) --rm -v :/models:rw openvino **Required:** OpenVINO Model Server package - see [deployment instructions](./deploying_server_baremetal.md) for details. ```bat -ovms --pull --source_model "Qwen/Qwen3-8B" --model_repository_path /models --model_name Qwen3-8B --task text_generation +ovms --pull --source_model "Qwen/Qwen3-8B" --model_repository_path /models --model_name Qwen3-8B --task text_generation --weight-format int8 ``` ::: :::: -It will prepare all needed configuration files to support LLMS with OVMS in the model repository. Check [parameters page](./parameters.md) for detailed descriptions of configuration options and parameter usage. +Check [parameters page](./parameters.md) for detailed descriptions of configuration options and parameter usage. + + +## Additional considerations + +When using pull mode you need both read and write access rights to local models repository. +You need read permissions to the source model in Hugging Face Hub. Pass the access token via environment variable just like with hf_cli application. + +You can mount the HuggingFace cache to avoid downloading the original model in case it was pulled earlier. + +Below is an example pull command with optimum model cache directory sharing and setting HF_TOKEN environment variable for model download authentication: + +```bash +docker run -e HF_TOKEN=hf_YOURTOKEN -e HF_HOME=/hf_home/cache --user $(id -u):$(id -g) --group-add=$(id -g) -v /opt/home/user/.cache/huggingface/:/hf_home/cache -v $(pwd)/models:/models:rw openvino/model_server:latest-py --pull --model_repository_path /models --source_model meta-llama/Meta-Llama-3-8B-Instruct +``` -*Note:* -When using pull mode you need both read and write access rights to models repository. diff --git a/src/cli_parser.cpp b/src/cli_parser.cpp index 0e435bae61..26b27f97fe 100644 --- a/src/cli_parser.cpp +++ b/src/cli_parser.cpp @@ -590,8 +590,13 @@ void CLIParser::prepareGraph(ServerSettingsImpl& serverSettings, HFSettingsImpl& hfSettings.overwriteModels = result->operator[]("overwrite_models").as(); if (result->count("source_model")) { hfSettings.sourceModel = result->operator[]("source_model").as(); - // TODO: Currently we use git clone only for OpenVINO, we will change this method of detection to parsing model files - if (isOptimumCliDownload(serverSettings.hfSettings.sourceModel, hfSettings.ggufFilename)) { + // Cloning the repository is allowed only for OpenVINO models determined by the name pattern + // Other models will be downloaded and converted using optimum-cli or just downloaded as GGUF + std::string lowerSourceModel = toLower(hfSettings.sourceModel); + if (lowerSourceModel.find("openvino") == std::string::npos && + lowerSourceModel.find("-ov") == std::string::npos && + lowerSourceModel.find("_ov") == std::string::npos && + (hfSettings.ggufFilename == std::nullopt)) { hfSettings.downloadType = OPTIMUM_CLI_DOWNLOAD; } } @@ -600,7 +605,7 @@ void CLIParser::prepareGraph(ServerSettingsImpl& serverSettings, HFSettingsImpl& throw std::logic_error("--weight-format parameter unsupported for Openvino huggingface organization models."); } if (result->count("extra_quantization_params") && hfSettings.downloadType == GIT_CLONE_DOWNLOAD) { - throw std::logic_error("--extra_quantization_params parameter unsupported for Openvino huggingface organization models."); + throw std::logic_error("--extra_quantization_params parameter unsupported for OpenVINO models."); } if (result->count("weight-format")) diff --git a/src/test/ovmsconfig_test.cpp b/src/test/ovmsconfig_test.cpp index 42000eeae5..ff50bd6a32 100644 --- a/src/test/ovmsconfig_test.cpp +++ b/src/test/ovmsconfig_test.cpp @@ -1169,7 +1169,7 @@ TEST(OvmsExportHfSettingsTest, positiveDefault) { } TEST(OvmsExportHfSettingsTest, allChanged) { - std::string modelName = "NonOpenVINO/Phi-3-mini-FastDraft-50M-int8-ov"; + std::string modelName = "Unknown/Phi-3-mini-FastDraft-50M-int8"; std::string downloadPath = "test/repository"; char* n_argv[] = { (char*)"ovms", @@ -1205,7 +1205,7 @@ TEST(OvmsExportHfSettingsTest, allChanged) { } TEST(OvmsExportHfSettingsTest, allChangedPullAndStart) { - std::string modelName = "NonOpenVINO/Phi-3-mini-FastDraft-50M-int8-ov"; + std::string modelName = "Unknown/Phi-3-mini-FastDraft-50M-int8"; std::string downloadPath = "test/repository"; char* n_argv[] = { (char*)"ovms",