Merge branch 'main' into unit_test

XkunW · web-flow · commit 40af1031d134 · 2025-05-26T17:00:57.000-04:00
diff --git a/.github/workflows/docker.yml b/.github/workflows/docker.yml
@@ -45,7 +45,7 @@ jobs:
           images: vectorinstitute/vector-inference
 
       - name: Build and push Docker image
-        uses: docker/build-push-action@14487ce63c7a62a4a324b0bfb37086795e31c6c1
+        uses: docker/build-push-action@1dc73863535b631f98b2378be8619f83b136f4a0
         with:
           context: .
           file: ./Dockerfile
diff --git a/.github/workflows/unit_tests.yml b/.github/workflows/unit_tests.yml
@@ -72,7 +72,7 @@ jobs:
           uv run pytest tests/test_imports.py
 
       - name: Upload coverage to Codecov
-        uses: codecov/codecov-action@v5.4.2
+        uses: codecov/codecov-action@v5.4.3
         with:
           token: ${{ secrets.CODECOV_TOKEN }}
           file: ./coverage.xml
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -17,7 +17,7 @@ repos:
     - id: check-toml
 
   - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: 'v0.11.9'
+    rev: 'v0.11.11'
     hooks:
     - id: ruff
       args: [--fix, --exit-non-zero-on-fix]
diff --git a/README.md b/README.md
@@ -85,7 +85,7 @@ models:
     vllm_args:
       --max-model-len: 1010000
       --max-num-seqs: 256
-      --compilation-confi: 3
+      --compilation-config: 3
 ```
 
 You would then set the `VEC_INF_CONFIG` path using:
@@ -94,7 +94,11 @@ You would then set the `VEC_INF_CONFIG` path using:
 export VEC_INF_CONFIG=/h/<username>/my-model-config.yaml
 ```
 
-Note that there are other parameters that can also be added to the config but not shown in this example, check the [`ModelConfig`](vec_inf/client/config.py) for details.
+**NOTE**
+* There are other parameters that can also be added to the config but not shown in this example, check the [`ModelConfig`](vec_inf/client/config.py) for details.
+* Check [vLLM Engine Arguments](https://docs.vllm.ai/en/stable/serving/engine_args.html) for the full list of available vLLM engine arguments, the default parallel size for any parallelization is default to 1, so none of the sizes were set specifically in this example
+* For GPU partitions with non-Ampere architectures, e.g. `rtx6000`, `t4v2`, BF16 isn't supported. For models that have BF16 as the default type, when using a non-Ampere GPU, use FP16 instead, i.e. `--dtype: float16`
+* Setting `--compilation-config` to `3` currently breaks multi-node model launches, so we don't set them for models that require multiple nodes of GPUs.
 
 #### Other commands
 
@@ -161,7 +165,7 @@ Once the inference server is ready, you can start sending in inference requests.
     "prompt_logprobs":null
 }
 ```
-**NOTE**: For multimodal models, currently only `ChatCompletion` is available, and only one image can be provided for each prompt.
+**NOTE**: Certain models don't adhere to OpenAI's chat template, e.g. Mistral family. For these models, you can either change your prompt to follow the model's default chat template or provide your own chat template via `--chat-template: TEMPLATE_PATH`
 
 ## SSH tunnel from your local device
 If you want to run inference from your local device, you can open a SSH tunnel to your cluster environment like the following:
diff --git a/examples/README.md b/examples/README.md
@@ -9,3 +9,4 @@
   - [`logits.py`](logits/logits.py): Python example of getting logits from hosted model.
 - [`api`](api): Examples for using the Python API
   - [`basic_usage.py`](api/basic_usage.py): Basic Python example demonstrating the Vector Inference API
+- [`slurm_dependency`](slurm_dependency): Example of launching a model with `vec-inf` and running a downstream SLURM job that waits for the server to be ready before sending a request.
diff --git a/examples/slurm_dependency/README.md b/examples/slurm_dependency/README.md
@@ -0,0 +1,33 @@
+# SLURM Dependency Workflow Example
+
+This example demonstrates how to launch a model server using `vec-inf`, and run a downstream SLURM job that waits for the server to become ready before querying it.
+
+## Files
+
+This directory contains the following:
+
+1. [run_workflow.sh](run_workflow.sh)
+   Launches the model server and submits the downstream job with a dependency, so it starts only after the server job begins running.
+
+2. [downstream_job.sbatch](downstream_job.sbatch)
+   A SLURM job script that runs the downstream logic (e.g., prompting the model).
+
+3. [run_downstream.py](run_downstream.py)
+   A Python script that waits until the inference server is ready, then sends a request using the OpenAI-compatible API.
+
+## What to update
+
+Before running this example, update the following in [downstream_job.sbatch](downstream_job.sbatch):
+
+- `--job-name`, `--output`, and `--error` paths
+- Virtual environment path in the `source` line
+- SLURM resource configuration (e.g., partition, memory, GPU)
+
+Also update the model name in [run_downstream.py](run_downstream.py) to match what you're launching.
+
+## Running the example
+
+First, activate a virtual environment where `vec-inf` is installed. Then, from this directory, run:
+
+```bash
+bash run_workflow.sh
diff --git a/examples/slurm_dependency/downstream_job.sbatch b/examples/slurm_dependency/downstream_job.sbatch
@@ -0,0 +1,18 @@
+#!/bin/bash
+#SBATCH --job-name=Meta-Llama-3.1-8B-Instruct-downstream
+#SBATCH --partition=a40
+#SBATCH --qos=m2
+#SBATCH --time=08:00:00
+#SBATCH --nodes=1
+#SBATCH --gpus-per-node=1
+#SBATCH --cpus-per-task=4
+#SBATCH --mem=8G
+#SBATCH --output=$HOME/.vec-inf-logs/Meta-Llama-3.1-8B-Instruct-downstream.%j.out
+#SBATCH --error=$HOME/.vec-inf-logs/Meta-Llama-3.1-8B-Instruct-downstream.%j.err
+
+# Activate your environment
+# TODO: update this path to match your venv location
+source $HOME/vector-inference/.venv/bin/activate
+
+# Wait for the server to be ready using the job ID passed as CLI arg
+python run_downstream.py "$SERVER_JOB_ID"
diff --git a/examples/slurm_dependency/run_downstream.py b/examples/slurm_dependency/run_downstream.py
@@ -0,0 +1,26 @@
+"""Example script to query a launched model via the OpenAI-compatible API."""
+
+import sys
+
+from openai import OpenAI
+
+from vec_inf.client import VecInfClient
+
+
+if len(sys.argv) < 2:
+    raise ValueError("Expected server job ID as the first argument.")
+job_id = int(sys.argv[1])
+
+vi_client = VecInfClient()
+print(f"Waiting for SLURM job {job_id} to be ready...")
+status = vi_client.wait_until_ready(slurm_job_id=job_id)
+print(f"Server is ready at {status.base_url}")
+
+api_client = OpenAI(base_url=status.base_url, api_key="EMPTY")
+resp = api_client.completions.create(
+    model="Meta-Llama-3.1-8B-Instruct",
+    prompt="Where is the capital of Canada?",
+    max_tokens=20,
+)
+
+print(resp)
diff --git a/examples/slurm_dependency/run_workflow.sh b/examples/slurm_dependency/run_workflow.sh
@@ -0,0 +1,14 @@
+#!/bin/bash
+
+# ---- Config ----
+MODEL_NAME="Meta-Llama-3.1-8B-Instruct"
+LAUNCH_ARGS="$MODEL_NAME"
+
+# ---- Step 1: Launch the server
+RAW_JSON=$(vec-inf launch $LAUNCH_ARGS --json-mode)
+SERVER_JOB_ID=$(echo "$RAW_JSON" | python3 -c "import sys, json; print(json.load(sys.stdin)['slurm_job_id'])")
+echo "Launched server as job $SERVER_JOB_ID"
+echo "$RAW_JSON"
+
+# ---- Step 2: Submit downstream job
+sbatch --dependency=after:$SERVER_JOB_ID --export=SERVER_JOB_ID=$SERVER_JOB_ID downstream_job.sbatch
diff --git a/tests/vec_inf/client/test_api.py b/tests/vec_inf/client/test_api.py
@@ -5,6 +5,7 @@
 import pytest
 
 from vec_inf.client import ModelStatus, ModelType, VecInfClient
+from vec_inf.client._exceptions import ServerError, SlurmJobError
 
 
 @pytest.fixture
@@ -128,3 +129,84 @@ def test_wait_until_ready():
             assert result.server_status == ModelStatus.READY
             assert result.base_url == "http://gpu123:8080/v1"
             assert mock_status.call_count == 2
+
+
+def test_shutdown_model_success():
+    """Test model shutdown success."""
+    client = VecInfClient()
+    with patch("vec_inf.client.api.run_bash_command") as mock_command:
+        mock_command.return_value = ("", "")
+        result = client.shutdown_model(12345)
+
+        assert result is True
+        mock_command.assert_called_once_with("scancel 12345")
+
+
+def test_shutdown_model_failure():
+    """Test model shutdown failure."""
+    client = VecInfClient()
+    with patch("vec_inf.client.api.run_bash_command") as mock_command:
+        mock_command.return_value = ("", "Error: Job not found")
+        with pytest.raises(
+            SlurmJobError, match="Failed to shutdown model: Error: Job not found"
+        ):
+            client.shutdown_model(12345)
+
+
+def test_wait_until_ready_timeout():
+    """Test timeout in wait_until_ready."""
+    client = VecInfClient()
+
+    with patch.object(client, "get_status") as mock_status:
+        mock_response = MagicMock()
+        mock_response.server_status = ModelStatus.LAUNCHING
+        mock_status.return_value = mock_response
+
+        with (
+            patch("time.sleep"),
+            pytest.raises(ServerError, match="Timed out waiting for model"),
+        ):
+            client.wait_until_ready(12345, timeout_seconds=1, poll_interval_seconds=0.5)
+
+
+def test_wait_until_ready_failed_status():
+    """Test wait_until_ready when model fails."""
+    client = VecInfClient()
+
+    with patch.object(client, "get_status") as mock_status:
+        mock_response = MagicMock()
+        mock_response.server_status = ModelStatus.FAILED
+        mock_response.failed_reason = "Out of memory"
+        mock_status.return_value = mock_response
+
+        with pytest.raises(ServerError, match="Model failed to start: Out of memory"):
+            client.wait_until_ready(12345)
+
+
+def test_wait_until_ready_failed_no_reason():
+    """Test wait_until_ready when model fails without reason."""
+    client = VecInfClient()
+
+    with patch.object(client, "get_status") as mock_status:
+        mock_response = MagicMock()
+        mock_response.server_status = ModelStatus.FAILED
+        mock_response.failed_reason = None
+        mock_status.return_value = mock_response
+
+        with pytest.raises(ServerError, match="Model failed to start: Unknown error"):
+            client.wait_until_ready(12345)
+
+
+def test_wait_until_ready_shutdown():
+    """Test wait_until_ready when model is shutdown."""
+    client = VecInfClient()
+
+    with patch.object(client, "get_status") as mock_status:
+        mock_response = MagicMock()
+        mock_response.server_status = ModelStatus.SHUTDOWN
+        mock_status.return_value = mock_response
+
+        with pytest.raises(
+            ServerError, match="Model was shutdown before it became ready"
+        ):
+            client.wait_until_ready(12345)
diff --git a/vec_inf/cli/_cli.py b/vec_inf/cli/_cli.py
@@ -18,6 +18,7 @@
     Stream real-time performance metrics
 """
 
+import json
 import time
 from typing import Optional, Union
 
@@ -72,6 +73,21 @@ def cli() -> None:
     type=str,
     help="Quality of service",
 )
+@click.option(
+    "--exclude",
+    type=str,
+    help="Exclude certain nodes from the resources granted to the job",
+)
+@click.option(
+    "--node-list",
+    type=str,
+    help="Request a specific list of nodes for deployment",
+)
+@click.option(
+    "--bind",
+    type=str,
+    help="Additional binds for the singularity container as a comma separated list of bind paths",
+)
 @click.option(
     "--time",
     type=str,
@@ -124,8 +140,16 @@ def launch(
             Number of nodes to use
         - gpus_per_node : int, optional
             Number of GPUs per node
+        - account : str, optional
+            Charge resources used by this job to specified account
         - qos : str, optional
             Quality of service tier
+        - exclude : str, optional
+            Exclude certain nodes from the resources granted to the job
+        - node_list : str, optional
+            Request a specific list of nodes for deployment
+        - bind : str, optional
+            Additional binds for the singularity container
         - time : str, optional
             Time limit for job
         - venv : str, optional
@@ -157,8 +181,9 @@ def launch(
 
         # Display launch information
         launch_formatter = LaunchResponseFormatter(model_name, launch_response.config)
+
         if json_mode:
-            click.echo(launch_response.config)
+            click.echo(json.dumps(launch_response.config))
         else:
             launch_info_table = launch_formatter.format_table_output()
             CONSOLE.print(launch_info_table)
diff --git a/vec_inf/client/_client_vars.py b/vec_inf/client/_client_vars.py
@@ -21,7 +21,12 @@
 from pathlib import Path
 from typing import TypedDict
 
-from vec_inf.client.slurm_vars import SINGULARITY_LOAD_CMD
+from vec_inf.client.slurm_vars import (
+    LD_LIBRARY_PATH,
+    SINGULARITY_IMAGE,
+    SINGULARITY_LOAD_CMD,
+    VLLM_NCCL_SO_PATH,
+)
 
 
 MODEL_READY_SIGNATURE = "INFO:     Application startup complete."
@@ -60,6 +65,8 @@
     "qos": "qos",
     "time": "time",
     "nodes": "num_nodes",
+    "exclude": "exclude",
+    "nodelist": "node_list",
     "gpus-per-node": "gpus_per_node",
     "cpus-per-task": "cpus_per_task",
     "mem": "mem_per_node",
@@ -71,7 +78,12 @@
 VLLM_SHORT_TO_LONG_MAP = {
     "-tp": "--tensor-parallel-size",
     "-pp": "--pipeline-parallel-size",
+    "-dp": "--data-parallel-size",
+    "-dpl": "--data-parallel-size-local",
+    "-dpa": "--data-parallel-address",
+    "-dpp": "--data-parallel-rpc-port",
     "-O": "--compilation-config",
+    "-q": "--quantization",
 }
 
 
@@ -117,6 +129,8 @@ class SlurmScriptTemplate(TypedDict):
         Commands for Singularity container setup
     imports : str
         Import statements and source commands
+    env_vars : list[str]
+        Environment variables to set
     singularity_command : str
         Template for Singularity execution command
     activate_venv : str
@@ -134,6 +148,7 @@ class SlurmScriptTemplate(TypedDict):
     shebang: ShebangConfig
     singularity_setup: list[str]
     imports: str
+    env_vars: list[str]
     singularity_command: str
     activate_venv: str
     server_setup: ServerSetupConfig
@@ -152,10 +167,14 @@ class SlurmScriptTemplate(TypedDict):
     },
     "singularity_setup": [
         SINGULARITY_LOAD_CMD,
-        "singularity exec {singularity_image} ray stop",
+        f"singularity exec {SINGULARITY_IMAGE} ray stop",
     ],
     "imports": "source {src_dir}/find_port.sh",
-    "singularity_command": "singularity exec --nv --bind {model_weights_path}:{model_weights_path} --containall {singularity_image}",
+    "env_vars": [
+        f"export LD_LIBRARY_PATH={LD_LIBRARY_PATH}",
+        f"export VLLM_NCCL_SO_PATH={VLLM_NCCL_SO_PATH}",
+    ],
+    "singularity_command": f"singularity exec --nv --bind {{model_weights_path}}{{additional_binds}} --containall {SINGULARITY_IMAGE}",
     "activate_venv": "source {venv}/bin/activate",
     "server_setup": {
         "single_node": [
@@ -203,8 +222,7 @@ class SlurmScriptTemplate(TypedDict):
         '    && mv temp.json "$json_path"',
     ],
     "launch_cmd": [
-        "python3.10 -m vllm.entrypoints.openai.api_server \\",
-        "    --model {model_weights_path} \\",
+        "vllm serve {model_weights_path} \\",
         "    --served-model-name {model_name} \\",
         '    --host "0.0.0.0" \\',
         "    --port $vllm_port_number \\",
diff --git a/vec_inf/client/_helper.py b/vec_inf/client/_helper.py
diff --git a/vec_inf/client/_slurm_script_generator.py b/vec_inf/client/_slurm_script_generator.py
diff --git a/vec_inf/client/config.py b/vec_inf/client/config.py
diff --git a/vec_inf/client/models.py b/vec_inf/client/models.py
diff --git a/vec_inf/config/models.yaml b/vec_inf/config/models.yaml