Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 10 additions & 10 deletions docs/pull_hf_models.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,20 @@
# OVMS Pull mode {#ovms_docs_pull}

This documents describes how to leverage OpenVINO Model Server (OVMS) pull feature to automate deployment configuration with Generative AI models. When pulling from [OpenVINO organization](https://huggingface.co/OpenVINO) from HF no additional steps are required. However, when pulling models [outside of the OpenVINO](https://github.com/openvinotoolkit/model_server/blob/main/docs/pull_optimum_cli.md) organization you have to install additional python dependencies when using baremetal execution so that optimum-cli is available for ovms executable or build the OVMS python container for docker deployments. In summary you have 2 options:
This document describes how to leverage OpenVINO Model Server (OVMS) pull feature to automate deployment configuration with Generative AI models. When pulling from [Hugging Face Hub](https://huggingface.co/), no additional steps are required. However, when pulling models in Pytorch format, you have to install additional python dependencies when using baremetal execution so that optimum-cli is available for ovms executable or rely on the docker image `openvino/model_server:latest-py`. In summary you have 2 options:

- pulling preconfigured models in IR format from OpenVINO organization
- pulling models with automatic conversion and quantization (requires optimum-cli). Include additional consideration like longer time for deployment and pulling model data (original model) from HF, model memory for conversion, diskspace - described [here](https://github.com/openvinotoolkit/model_server/blob/main/docs/pull_optimum_cli.md)
- pulling pre-configured models in IR format (described below)
- pulling models with automatic conversion and quantization via optimum-cli. Described in the [pulling with conversion](https://github.com/openvinotoolkit/model_server/blob/main/docs/pull_optimum_cli.md)

### Pulling the models
> **Note:** Models in IR format must be exported using `optimum-cli` including tokenizer and detokenizer files also in IR format, if applicable. If missing, tokenizer and detokenizer should be added using `convert_tokenizer --with-detokenizer` tool.

There is a special mode to make OVMS pull the model from Hugging Face before starting the service:
## Pulling pre-configured models

There is a special OVMS mode to pull the model from Hugging Face before starting the service. It is triggered by `--source_models` parameter. In addition, `--pull` parameter is for pulling alone. The application quits after the model is downloaded. Without `--pull` option, the model will be deployed and server started.

::::{tab-set}
:::{tab-item} With Docker
:sync: docker
**Required:** Docker Engine installed

```text
docker run $(id -u):$(id -g) --rm -v <model_repository_path>:/models:rw openvino/model_server:latest --pull --source_model <model_name_in_HF> --model_repository_path /models --model_name <external_model_name> --target_device <DEVICE> --task <task> [TASK_SPECIFIC_PARAMETERS]
```
Expand Down Expand Up @@ -55,9 +56,8 @@ ovms --pull --source_model "OpenVINO/Phi-3-mini-FastDraft-50M-int8-ov" --model_r
::::


It will prepare all needed configuration files to support LLMS with OVMS in the model repository. Check [parameters page](./parameters.md) for detailed descriptions of configuration options and parameter usage.
Check [parameters page](./parameters.md) for detailed descriptions of configuration options and parameter usage.

In case you want to setup model and start server in one step follow instructions on [this page](./starting_server.md).
In case you want to setup model and start server in one step, follow [instructions](./starting_server.md).

*Note:*
When using pull mode you need both read and write access rights to models repository.
> **Note:** When using pull mode you need both read and write access rights to models repository.
60 changes: 23 additions & 37 deletions docs/pull_optimum_cli.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,16 @@
# OVMS Pull mode with optimum cli {#ovms_docs_pull_optimum}

This documents describes how to leverage OpenVINO Model Server (OVMS) pull feature to automate deployment configuration with Generative AI models when pulling outside of [OpenVINO organization](https://huggingface.co/OpenVINO) from HF.
This document describes how to leverage OpenVINO Model Server (OVMS) pull feature to automate deployment of generative models from HF with conversion and quantization in runtime.

You have to use docker image with optimum-cli or install additional python dependencies to the baremetal package. Follow the steps described below.
You have to use docker image with optimum-cli `openvino/model_server:latest-py` or install additional python dependencies to the baremetal package. Follow the steps described below.

Pulling models with automatic conversion and quantization (requires optimum-cli). Include additional consideration like longer time for deployment and pulling model data (original model) from HF, model memory for conversion, diskspace.
> **Note:** This procedure might increase memory usage during the model conversion and requires downloading the original model. Expect memory usage at least to the level of original model size during the conversion.

Note: Pulling the models from HuggingFace Hub can automate conversion and compression. It might however increase memory usage during the model conversion and requires downloading the original model.

## OVMS building and installation for optimum-cli integration
### Build python docker image
```bash
git clone https://github.com/openvinotoolkit/model_server.git
cd model_server
make python_image
```

Example pull command with optimum model cache directory sharing and setting HF_TOKEN environment variable for model download authentication.

```bash
docker run -e HF_TOKEN=hf_YOURTOKEN -e HF_HOME=/hf_home/cache --user $(id -u):$(id -g) --group-add=$(id -g) -v /opt/home/user/.cache/huggingface/:/hf_home/cache -v $(pwd)/models:/models:rw openvino/model_server:py --pull --model_repository_path /models --source_model meta-llama/Meta-Llama-3-8B-Instruct
```

### Install optimum-cli
Install python on your baremetal system from `https://www.python.org/downloads/` and run the commands:
```console
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt
```

or use the python binary from the ovms_windows_python_on.zip or ovms.tar.gz package - see [deployment instructions](deploying_server_baremetal.md) for details.
## Add optimum-cli to OVMS installation on windows

```bat
curl -L https://github.com/openvinotoolkit/model_server/releases/download/v2025.3/ovms_windows_python_on.zip -o ovms.zip
tar -xf ovms.zip
```
```bat
ovms\setupvars.bat
ovms\python\python -m pip install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/main/demos/common/export_models/requirements.txt
```
Expand All @@ -50,14 +26,12 @@ Using `--pull` parameter, we can use OVMS to download the model, quantize and co
**Required:** Docker Engine installed

```text
docker run $(id -u):$(id -g) --rm -v <model_repository_path>:/models:rw openvino/model_server:py --pull --source_model <model_name_in_HF> --model_repository_path /models --model_name <external_model_name> --target_device <DEVICE> --task <task> [TASK_SPECIFIC_PARAMETERS]
docker run $(id -u):$(id -g) --rm -v <model_repository_path>:/models:rw openvino/model_server:latest-py --pull --source_model <model_name_in_HF> --model_repository_path /models --model_name <external_model_name> --target_device <DEVICE> --task <task> [TASK_SPECIFIC_PARAMETERS]
```
:::

:::{tab-item} On Baremetal Host
:sync: baremetal
**Required:** OpenVINO Model Server package - see [deployment instructions](./deploying_server_baremetal.md) for details.

```text
ovms --pull --source_model <model_name_in_HF> --model_repository_path <model_repository_path> --model_name <external_model_name> --target_device <DEVICE> --task <task> [TASK_SPECIFIC_PARAMETERS]
```
Expand All @@ -67,15 +41,15 @@ ovms --pull --source_model <model_name_in_HF> --model_repository_path <model_rep
Example for pulling `Qwen/Qwen3-8B`:

```bat
ovms --pull --source_model "Qwen/Qwen3-8B" --model_repository_path /models --model_name Qwen3-8B --target_device CPU --task text_generation
ovms --pull --source_model "Qwen/Qwen3-8B" --model_repository_path /models --model_name Qwen3-8B --target_device CPU --task text_generation --weight-format int8
```
::::{tab-set}
:::{tab-item} With Docker
:sync: docker
**Required:** Docker Engine installed

```text
docker run $(id -u):$(id -g) --rm -v <model_repository_path>:/models:rw openvino/model_server:py --pull --source_model "Qwen/Qwen3-8B" --model_repository_path /models --model_name Qwen3-8B --task text_generation
docker run $(id -u):$(id -g) --rm -v <model_repository_path>:/models:rw openvino/model_server:latest-py --pull --source_model "Qwen/Qwen3-8B" --model_repository_path /models --model_name Qwen3-8B --task text_generation --weight-format int8
```
:::

Expand All @@ -84,14 +58,26 @@ docker run $(id -u):$(id -g) --rm -v <model_repository_path>:/models:rw openvino
**Required:** OpenVINO Model Server package - see [deployment instructions](./deploying_server_baremetal.md) for details.

```bat
ovms --pull --source_model "Qwen/Qwen3-8B" --model_repository_path /models --model_name Qwen3-8B --task text_generation
ovms --pull --source_model "Qwen/Qwen3-8B" --model_repository_path /models --model_name Qwen3-8B --task text_generation --weight-format int8
```
:::
::::


It will prepare all needed configuration files to support LLMS with OVMS in the model repository. Check [parameters page](./parameters.md) for detailed descriptions of configuration options and parameter usage.
Check [parameters page](./parameters.md) for detailed descriptions of configuration options and parameter usage.


## Additional considerations

When using pull mode you need both read and write access rights to local models repository.

You need read permissions to the source model in Hugging Face Hub. Pass the access token via environment variable just like with hf_cli application.

You can mount the HuggingFace cache to avoid downloading the original model in case it was pulled earlier.

Below is an example pull command with optimum model cache directory sharing and setting HF_TOKEN environment variable for model download authentication:

```bash
docker run -e HF_TOKEN=hf_YOURTOKEN -e HF_HOME=/hf_home/cache --user $(id -u):$(id -g) --group-add=$(id -g) -v /opt/home/user/.cache/huggingface/:/hf_home/cache -v $(pwd)/models:/models:rw openvino/model_server:latest-py --pull --model_repository_path /models --source_model meta-llama/Meta-Llama-3-8B-Instruct
```

*Note:*
When using pull mode you need both read and write access rights to models repository.
11 changes: 8 additions & 3 deletions src/cli_parser.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -590,8 +590,13 @@ void CLIParser::prepareGraph(ServerSettingsImpl& serverSettings, HFSettingsImpl&
hfSettings.overwriteModels = result->operator[]("overwrite_models").as<bool>();
if (result->count("source_model")) {
hfSettings.sourceModel = result->operator[]("source_model").as<std::string>();
// TODO: Currently we use git clone only for OpenVINO, we will change this method of detection to parsing model files
if (isOptimumCliDownload(serverSettings.hfSettings.sourceModel, hfSettings.ggufFilename)) {
// Cloning the repository is allowed only for OpenVINO models determined by the name pattern
// Other models will be downloaded and converted using optimum-cli or just downloaded as GGUF
std::string lowerSourceModel = toLower(hfSettings.sourceModel);
if (lowerSourceModel.find("openvino") == std::string::npos &&
lowerSourceModel.find("-ov") == std::string::npos &&
lowerSourceModel.find("_ov") == std::string::npos &&
(hfSettings.ggufFilename == std::nullopt)) {
hfSettings.downloadType = OPTIMUM_CLI_DOWNLOAD;
}
}
Expand All @@ -600,7 +605,7 @@ void CLIParser::prepareGraph(ServerSettingsImpl& serverSettings, HFSettingsImpl&
throw std::logic_error("--weight-format parameter unsupported for Openvino huggingface organization models.");
}
if (result->count("extra_quantization_params") && hfSettings.downloadType == GIT_CLONE_DOWNLOAD) {
throw std::logic_error("--extra_quantization_params parameter unsupported for Openvino huggingface organization models.");
throw std::logic_error("--extra_quantization_params parameter unsupported for OpenVINO models.");
}

if (result->count("weight-format"))
Expand Down
4 changes: 2 additions & 2 deletions src/test/ovmsconfig_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1169,7 +1169,7 @@ TEST(OvmsExportHfSettingsTest, positiveDefault) {
}

TEST(OvmsExportHfSettingsTest, allChanged) {
std::string modelName = "NonOpenVINO/Phi-3-mini-FastDraft-50M-int8-ov";
std::string modelName = "Unknown/Phi-3-mini-FastDraft-50M-int8";
std::string downloadPath = "test/repository";
char* n_argv[] = {
(char*)"ovms",
Expand Down Expand Up @@ -1205,7 +1205,7 @@ TEST(OvmsExportHfSettingsTest, allChanged) {
}

TEST(OvmsExportHfSettingsTest, allChangedPullAndStart) {
std::string modelName = "NonOpenVINO/Phi-3-mini-FastDraft-50M-int8-ov";
std::string modelName = "Unknown/Phi-3-mini-FastDraft-50M-int8";
std::string downloadPath = "test/repository";
char* n_argv[] = {
(char*)"ovms",
Expand Down