Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions INSTALLING_ONTO_EXISTING_CLUSTER_README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ If you have existing node pools in your original OKE cluster that you'd like Blu
- If you get a warning about security, sometimes it takes a bit for the certificates to get signed. This will go away once that process completes on the OKE side.
3. Login with the `Admin Username` and `Admin Password` in the Application information tab.
4. Click the link next to "deployment" which will take you to a page with "Deployment List", and a content box.
5. Paste in the sample blueprint json found [here](docs/sample_blueprints/platform_feature_blueprints/exisiting_cluster_installation/add_node_to_control_plane.json).
5. Paste in the sample blueprint json found [here](docs/sample_blueprints/other/exisiting_cluster_installation/add_node_to_control_plane.json).
6. Modify the "recipe_node_name" field to the private IP address you found in step 1 above.
7. Click "POST". This is a fast operation.
8. Wait about 20 seconds and refresh the page. It should look like:
Expand All @@ -108,10 +108,10 @@ If you have existing node pools in your original OKE cluster that you'd like Blu
- If you get a warning about security, sometimes it takes a bit for the certificates to get signed. This will go away once that process completes on the OKE side.
3. Login with the `Admin Username` and `Admin Password` in the Application information tab.
4. Click the link next to "deployment" which will take you to a page with "Deployment List", and a content box.
5. If you added a node from [Step 4](./INSTALLING_ONTO_EXISTING_CLUSTER_README.md#step-4-add-existing-nodes-to-cluster-optional), use the following shared node pool [blueprint](docs/sample_blueprints/platform_feature_blueprints/shared_node_pools/vllm_inference_sample_shared_pool_blueprint.json).
5. If you added a node from [Step 4](./INSTALLING_ONTO_EXISTING_CLUSTER_README.md#step-4-add-existing-nodes-to-cluster-optional), use the following shared node pool [blueprint](docs/sample_blueprints/platform_features/shared_node_pools/vllm_inference_sample_shared_pool_blueprint.json).
- Depending on the node shape, you will need to change:
`"recipe_node_shape": "BM.GPU.A10.4"` to match your shape.
6. If you did not add a node, or just want to deploy a fresh node, use the following [blueprint](docs/sample_blueprints/workload_blueprints/llm_inference_with_vllm/vllm-open-hf-model.json).
6. If you did not add a node, or just want to deploy a fresh node, use the following [blueprint](docs/sample_blueprints/model_serving/llm_inference_with_vllm/vllm-open-hf-model.json).
7. Paste the blueprint you selected into context box on the deployment page and click "POST"
8. To monitor the deployment, go back to "Api Root" and click "deployment_logs".
- If you are deploying without a shared node pool, it can take 10-30 minutes to bring up a node, depending on shape and whether it is bare-metal or virtual.
Expand Down
20 changes: 10 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,16 +52,16 @@ After you install OCI AI Blueprints to an OKE cluster in your tenancy, you can d

| Blueprint | Description |
| --------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| [**LLM & VLM Inference with vLLM**](docs/sample_blueprints/workload_blueprints/llm_inference_with_vllm/README.md) | Deploy Llama 2/3/3.1 7B/8B models using NVIDIA GPU shapes and the vLLM inference engine with auto-scaling. |
| [**Llama Stack**](docs/sample_blueprints/workload_blueprints/llama-stack/README.md) | Complete GenAI runtime with vLLM, ChromaDB, Postgres, and Jaeger for production deployments with unified API for inference, RAG, and telemetry. |
| [**Fine-Tuning Benchmarking**](docs/sample_blueprints/workload_blueprints/lora-benchmarking/README.md) | Run MLCommons quantized Llama-2 70B LoRA finetuning on A100 for performance benchmarking. |
| [**LoRA Fine-Tuning**](docs/sample_blueprints/workload_blueprints/lora-fine-tuning/README.md) | LoRA fine-tuning of custom or HuggingFace models using any dataset. Includes flexible hyperparameter tuning. |
| [**GPU Performance Benchmarking**](docs/sample_blueprints/workload_blueprints/gpu-health-check/README.md) | Comprehensive evaluation of GPU performance to ensure optimal hardware readiness before initiating any intensive computational workload. |
| [**CPU Inference**](docs/sample_blueprints/workload_blueprints/cpu-inference/README.md) | Leverage Ollama to test CPU-based inference with models like Mistral, Gemma, and more. |
| [**Multi-node Inference with RDMA and vLLM**](docs/sample_blueprints/workload_blueprints/multi-node-inference/README.md) | Deploy Llama-405B sized LLMs across multiple nodes with RDMA using H100 nodes with vLLM and LeaderWorkerSet. |
| [**Autoscaling Inference with vLLM**](docs/sample_blueprints/platform_feature_blueprints/auto_scaling/README.md) | Serve LLMs with auto-scaling using KEDA, which scales to multiple GPUs and nodes using application metrics like inference latency. |
| [**LLM Inference with MIG**](docs/sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/README.md) | Deploy LLMs to a fraction of a GPU with Nvidia’s multi-instance GPUs and serve them with vLLM. |
| [**Job Queuing**](docs/sample_blueprints/platform_feature_blueprints/teams/README.md) | Take advantage of job queuing and enforce resource quotas and fair sharing between teams. |
| [**LLM & VLM Inference with vLLM**](docs/sample_blueprints/model_serving/llm_inference_with_vllm/README.md) | Deploy Llama 2/3/3.1 7B/8B models using NVIDIA GPU shapes and the vLLM inference engine with auto-scaling. |
| [**Llama Stack**](docs/sample_blueprints/other/llama-stack/README.md) | Complete GenAI runtime with vLLM, ChromaDB, Postgres, and Jaeger for production deployments with unified API for inference, RAG, and telemetry. |
| [**Fine-Tuning Benchmarking**](docs/sample_blueprints/gpu_benchmarking/lora-benchmarking/README.md) | Run MLCommons quantized Llama-2 70B LoRA finetuning on A100 for performance benchmarking. |
| [**LoRA Fine-Tuning**](docs/sample_blueprints/model_fine_tuning/lora-fine-tuning/README.md) | LoRA fine-tuning of custom or HuggingFace models using any dataset. Includes flexible hyperparameter tuning. |
| [**GPU Performance Benchmarking**](docs/sample_blueprints/gpu_health_check/gpu-health-check/README.md) | Comprehensive evaluation of GPU performance to ensure optimal hardware readiness before initiating any intensive computational workload. |
| [**CPU Inference**](docs/sample_blueprints/model_serving/cpu-inference/README.md) | Leverage Ollama to test CPU-based inference with models like Mistral, Gemma, and more. |
| [**Multi-node Inference with RDMA and vLLM**](docs/sample_blueprints/model_serving/multi-node-inference/README.md) | Deploy Llama-405B sized LLMs across multiple nodes with RDMA using H100 nodes with vLLM and LeaderWorkerSet. |
| [**Autoscaling Inference with vLLM**](docs/sample_blueprints/model_serving/auto_scaling/README.md) | Serve LLMs with auto-scaling using KEDA, which scales to multiple GPUs and nodes using application metrics like inference latency. |
| [**LLM Inference with MIG**](docs/sample_blueprints/model_serving/mig_multi_instance_gpu/README.md) | Deploy LLMs to a fraction of a GPU with Nvidia’s multi-instance GPUs and serve them with vLLM. |
| [**Job Queuing**](docs/sample_blueprints/platform_features/teams/README.md) | Take advantage of job queuing and enforce resource quotas and fair sharing between teams. |

## Support & Contact

Expand Down
10 changes: 5 additions & 5 deletions docs/about.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,8 @@
| ------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------- |
| **Customize Blueprints** | Tailor existing OCI AI Blueprints to suit your exact AI workload needs—everything from hyperparameters to node counts and hardware. | [Read More](custom_blueprints/README.md) |
| **Updating OCI AI Blueprints** | Keep your OCI AI Blueprints environment current with the latest control plane and portal updates. | [Read More](../INSTALLING_ONTO_EXISTING_CLUSTER_README.md) |
| **Shared Node Pool** | Use longer-lived resources (e.g., bare metal nodes) across multiple blueprints or to persist resources after a blueprint is undeployed. | [Read More](sample_blueprints/platform_feature_blueprints/shared_node_pools/README.md) |
| **Auto-Scaling** | Automatically adjust resource usage based on infrastructure or application-level metrics to optimize performance and costs. | [Read More](sample_blueprints/platform_feature_blueprints/auto_scaling/README.md) |
| **Shared Node Pool** | Use longer-lived resources (e.g., bare metal nodes) across multiple blueprints or to persist resources after a blueprint is undeployed. | [Read More](sample_blueprints/platform_features/shared_node_pools/README.md) |
| **Auto-Scaling** | Automatically adjust resource usage based on infrastructure or application-level metrics to optimize performance and costs. | [Read More](sample_blueprints/model_serving/auto_scaling/README.md) |

---

Expand Down Expand Up @@ -76,13 +76,13 @@ A:
A: Deploy a vLLM blueprint, then use a tool like LLMPerf to run benchmarking against your inference endpoint. Contact us for more details.

**Q: Where can I see the full list of blueprints?**
A: All available blueprints are listed [here](sample_blueprints/platform_feature_blueprints/exisiting_cluster_installation/README.md). If you need something custom, please let us know.
A: All available blueprints are listed [here](sample_blueprints/other/exisiting_cluster_installation/README.md). If you need something custom, please let us know.

**Q: How do I check logs for troubleshooting?**
A: Use `kubectl` to inspect pod logs in your OKE cluster.

**Q: Does OCI AI Blueprints support auto-scaling?**
A: Yes, we leverage KEDA for application-driven auto-scaling. See [documentation](sample_blueprints/platform_feature_blueprints/auto_scaling/README.md).
A: Yes, we leverage KEDA for application-driven auto-scaling. See [documentation](sample_blueprints/model_serving/auto_scaling/README.md).

**Q: Which GPUs are compatible?**
A: Any NVIDIA GPUs available in your OCI region (A10, A100, H100, etc.).
Expand All @@ -91,4 +91,4 @@ A: Any NVIDIA GPUs available in your OCI region (A10, A100, H100, etc.).
A: Yes, though testing on clusters running other workloads is ongoing. We recommend a clean cluster for best stability.

**Q: How do I run multiple blueprints on the same node?**
A: Enable shared node pools. [Read more here](sample_blueprints/platform_feature_blueprints/shared_node_pools/README.md).
A: Enable shared node pools. [Read more here](sample_blueprints/platform_features/shared_node_pools/README.md).
20 changes: 10 additions & 10 deletions docs/api_documentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,11 @@
| recipe_container_env | string | No | Values of the recipe container init arguments. See the Blueprint Arguments section below for details. Example: `[{"key": "tensor_parallel_size","value": "2"},{"key": "model_name","value": "NousResearch/Meta-Llama-3.1-8B-Instruct"},{"key": "Model_Path","value": "/models/NousResearch/Meta-Llama-3.1-8B-Instruct"}]` |
| skip_capacity_validation | boolean | No | Determines whether validation checks on shape capacity are performed before initiating deployment. If your deployment is failing validation due to capacity errors but you believe this not to be true, you should set `skip_capacity_validation` to be `true` in the recipe JSON to bypass all checks for Shape capacity. |

For autoscaling parameters, visit [autoscaling](sample_blueprints/platform_feature_blueprints/auto_scaling/README.md).
For autoscaling parameters, visit [autoscaling](sample_blueprints/model_serving/auto_scaling/README.md).

For multinode inference parameters, visit [multinode inference](sample_blueprints/workload_blueprints/multi-node-inference/README.md)
For multinode inference parameters, visit [multinode inference](sample_blueprints/model_serving/multi-node-inference/README.md)

For MIG parameters, visit [MIG shared pool configurations](sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/mig_inference_single_replica.json), [update MIG configuration](sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/mig_inference_single_replica.json), and [MIG recipe configuration](sample_blueprints/platform_feature_blueprints/mig_multi_instance_gpu/mig_inference_single_replica.json).
For MIG parameters, visit [MIG shared pool configurations](sample_blueprints/model_serving/mig_multi_instance_gpu/mig_inference_single_replica.json), [update MIG configuration](sample_blueprints/model_serving/mig_multi_instance_gpu/mig_inference_single_replica.json), and [MIG recipe configuration](sample_blueprints/model_serving/mig_multi_instance_gpu/mig_inference_single_replica.json).

### Blueprint Container Arguments

Expand Down Expand Up @@ -94,13 +94,13 @@ This recipe deploys the vLLM container image. Follow the vLLM docs to pass the c
There are 3 blueprints that we are providing out of the box. Following are example recipe.json snippets that you can use to deploy the blueprints quickly for a test run.
|Blueprint|Scenario|Sample JSON|
|----|----|----
|LLM Inference using NVIDIA shapes and vLLM|Deployment with default Llama-3.1-8B model using PAR|View sample JSON here [here](sample_blueprints/workload_blueprints/llm_inference_with_vllm/vllm-open-hf-model.json)
|MLCommons Llama-2 Quantized 70B LORA Fine-Tuning on A100|Default deployment with model and dataset ingested using PAR|View sample JSON here [here](sample_blueprints/workload_blueprints/lora-benchmarking/mlcommons_lora_finetune_nvidia_sample_recipe.json)
|LORA Fine-Tune Blueprint|Open Access Model Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/workload_blueprints/lora-fine-tuning/open_model_open_dataset_hf.backend.json)
|LORA Fine-Tune Blueprint|Closed Access Model Open Access Dataset Download from Huggingface (Valid Auth Token Is Required!!)|View sample JSON [here](sample_blueprints/workload_blueprints/lora-fine-tuning/closed_model_open_dataset_hf.backend.json)
|LORA Fine-Tune Blueprint|Bucket Model Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/workload_blueprints/lora-fine-tuning/bucket_par_open_dataset.backend.json)
|LORA Fine-Tune Blueprint|Get Model from Bucket in Another Region / Tenancy using Pre-Authenticated_Requests (PAR) Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/workload_blueprints/lora-fine-tuning/bucket_model_open_dataset.backend.json)
|LORA Fine-Tune Blueprint|Bucket Model Bucket Checkpoint Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/workload_blueprints/lora-fine-tuning/bucket_par_open_dataset.backend.json)
|LLM Inference using NVIDIA shapes and vLLM|Deployment with default Llama-3.1-8B model using PAR|View sample JSON here [here](sample_blueprints/model_serving/llm_inference_with_vllm/vllm-open-hf-model.json)
|MLCommons Llama-2 Quantized 70B LORA Fine-Tuning on A100|Default deployment with model and dataset ingested using PAR|View sample JSON here [here](sample_blueprints/gpu_benchmarking/lora-benchmarking/mlcommons_lora_finetune_nvidia_sample_recipe.json)
|LORA Fine-Tune Blueprint|Open Access Model Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/model_fine_tuning/lora-fine-tuning/open_model_open_dataset_hf.backend.json)
|LORA Fine-Tune Blueprint|Closed Access Model Open Access Dataset Download from Huggingface (Valid Auth Token Is Required!!)|View sample JSON [here](sample_blueprints/model_fine_tuning/lora-fine-tuning/closed_model_open_dataset_hf.backend.json)
|LORA Fine-Tune Blueprint|Bucket Model Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/model_fine_tuning/lora-fine-tuning/bucket_par_open_dataset.backend.json)
|LORA Fine-Tune Blueprint|Get Model from Bucket in Another Region / Tenancy using Pre-Authenticated_Requests (PAR) Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/model_fine_tuning/lora-fine-tuning/bucket_model_open_dataset.backend.json)
|LORA Fine-Tune Blueprint|Bucket Model Bucket Checkpoint Open Access Dataset Download from Huggingface (no token required)|View sample JSON [here](sample_blueprints/model_fine_tuning/lora-fine-tuning/bucket_par_open_dataset.backend.json)

## Undeploy a Blueprint

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ If you have existing node pools in your original OKE cluster that you'd like Blu
2. Go to the stack and click "Application information". Click the API Url.
3. Login with the `Admin Username` and `Admin Password` in the Application information tab.
4. Click the link next to "deployment" which will take you to a page with "Deployment List", and a content box.
5. Paste in the sample blueprint json found [here](../../sample_blueprints/platform_feature_blueprints/exisiting_cluster_installation/add_node_to_control_plane.json).
5. Paste in the sample blueprint json found [here](../../sample_blueprints/other/exisiting_cluster_installation/add_node_to_control_plane.json).
6. Modify the "recipe_node_name" field to the private IP address you found in step 1 above.
7. Click "POST". This is a fast operation.
8. Wait about 20 seconds and refresh the page. It should look like:
Expand Down
Loading