-
Notifications
You must be signed in to change notification settings - Fork 67
on prem changes to disable cloud solutions #700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
jen-scymanski-scale
wants to merge
12
commits into
scaleapi:main
Choose a base branch
from
jen-scymanski-scale:js-model-engine-on-prrem-test
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
16bea80
on prem changes to disable cloud solutions
jen-scymanski-scale d72f6c7
update to use config
jen-scymanski-scale 4442dd8
updated based on comments
jen-scymanski-scale 3febc44
remove old vars and update logging
jen-scymanski-scale a1fe268
remove update to roles.py
jen-scymanski-scale 61a67ae
add back optional field
jen-scymanski-scale 10f9e4f
WIP -needs cleaned up onprem updates - doesnt include charts folder u…
jen-scymanski-scale 0dcd54a
adding latest updates
jen-scymanski-scale de8cb1a
TEMP KT Doc
jen-scymanski-scale 324fe4d
on prem modifications
jen-scymanski-scale b138a2e
update vllm image "onprem-vllm-0.10.0" update build script for local …
jen-scymanski-scale 1ee453a
sandesh's updates into this branch as well ass updated vllm_adapter p…
jen-scymanski-scale File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,367 @@ | ||
# Model Engine Knowledge Transfer Documentation | ||
|
||
## 🎯 Executive Summary & Current State | ||
|
||
### **Current Working Configuration (STABLE)** | ||
- **Model-Engine Image**: `onprem20` (PVC removed, optimized AWS CLI) | ||
- **VLLM Image**: `vllm-onprem` (model architecture fixes for Qwen3ForCausalLM) | ||
- **Storage Configuration**: `50GB ephemeral` (prevents container termination) | ||
- **Status**: First stable endpoint deployment with active model downloads (8+ minutes uptime, 655MB downloaded) | ||
|
||
--- | ||
|
||
## **Image Ecosystem & Build Process** | ||
|
||
#### **Model-Engine Images** (Business Logic) | ||
- **Repository**: `registry.odp.om/odp-development/oman-national-llm/model-engine:onpremXX` | ||
- **Contents**: Python application code, endpoint builder logic, Kubernetes resource generation | ||
- **Build Source**: `llm-engine` repository | ||
- **Current Working**: `onprem20` (PVC removed, optimized AWS CLI) | ||
- **Build Trigger**: Code changes in llm-engine repository | ||
|
||
#### **VLLM Images** (Inference Runtime) | ||
- **Repository**: `registry.odp.om/odp-development/oman-national-llm/vllm:TAG` | ||
- **Contents**: VLLM inference framework, model serving logic, runtime dependencies | ||
- **Build Source**: Separate VLLM Dockerfile (not in main repos) | ||
- **Current Working**: `vllm-onprem` (with Qwen3ForCausalLM compatibility) | ||
- **Build Trigger**: VLLM framework updates or model architecture fixes | ||
|
||
#### **Image Relationship** | ||
``` | ||
model-engine image (onprem20) | ||
↓ (generates Kubernetes manifests) | ||
VLLM container (vllm-onprem) | ||
↓ (downloads models and runs inference) | ||
Model Files (S3) → VLLM Server → API Endpoints | ||
``` | ||
|
||
### **Storage Architecture** | ||
- **Ephemeral Storage**: Node-local, lost on pod restart, 189GB total capacity | ||
- **PVC Storage**: Persistent, Ceph RBD backed, attempted but has async bugs | ||
- **Current**: Using ephemeral with 50GB limits (within node capacity) | ||
|
||
--- | ||
|
||
|
||
|
||
|
||
|
||
#### **Model Architecture Compatibility** | ||
- **Problem**: `ValueError: Model architectures ['Qwen3ForCausalLM'] are not supported` | ||
- **Impact**: VLLM failed to load Qwen3 models | ||
- **Solution**: Updated to `vllm-onprem` image with architecture fixes | ||
|
||
--- | ||
|
||
## 🔧 Technical Deep Dive | ||
|
||
### **Working Configuration Details** | ||
|
||
#### **Image Configuration** | ||
```yaml | ||
# values.yaml | ||
tag: onprem20 | ||
vllm_repository: "odp-development/oman-national-llm/vllm" | ||
vllm_tag: "vllm-onprem" | ||
``` | ||
|
||
|
||
|
||
|
||
|
||
### **S3 Integration Details** | ||
|
||
#### **Working Environment Variables** | ||
```bash | ||
AWS_ACCESS_KEY_ID=<from-kubernetes-secret> | ||
AWS_SECRET_ACCESS_KEY=<from-kubernetes-secret> | ||
AWS_ENDPOINT_URL=https://oss.odp.om | ||
AWS_REGION=us-east-1 | ||
AWS_EC2_METADATA_DISABLED=true | ||
``` | ||
|
||
#### **S3 Download Command** | ||
```bash | ||
# Full command with environment variables | ||
AWS_ACCESS_KEY_ID=<from-kubernetes-secret> \ | ||
AWS_SECRET_ACCESS_KEY=<from-kubernetes-secret> \ | ||
AWS_ENDPOINT_URL=https://oss.odp.om \ | ||
AWS_REGION=us-east-1 \ | ||
AWS_EC2_METADATA_DISABLED=true \ | ||
aws s3 sync s3://scale-gp-models/intermediate-model-aws model_files --no-progress | ||
|
||
# S3 Endpoint Details | ||
# Scality S3 Endpoint: https://oss.odp.om | ||
# Bucket: scale-gp-models | ||
# Path: intermediate-model-aws/ | ||
``` | ||
|
||
### **Timing Coordination Logic** | ||
The working timing coordination waits for: | ||
1. **config.json** file to exist | ||
2. **All .safetensors files** to be present | ||
3. **No temp suffixes** on any files (indicating AWS CLI completion) | ||
|
||
### **Endpoint Creation Workflow** | ||
|
||
When an endpoint is created via API call, here's the complete workflow: | ||
|
||
#### **Step 1: API Request Processing** | ||
``` | ||
curl -X POST /v1/llm/model-endpoints → model-engine service | ||
``` | ||
- **model-engine** receives API request | ||
- Validates parameters and creates endpoint record | ||
- Queues build task for **endpoint-builder** | ||
|
||
#### **Step 2: Kubernetes Resource Generation** | ||
``` | ||
endpoint-builder → reads hardware config → generates K8s manifests | ||
``` | ||
- **endpoint-builder** processes the build task | ||
- Reads `recommendedHardware` from ConfigMap | ||
- Generates template variables: `${STORAGE_DICT}`, `${WORKDIR_VOLUME_CONFIG}` | ||
- Substitutes variables into deployment template | ||
- Creates: Deployment, Service, HPA | ||
|
||
#### **Step 3: Pod Scheduling & Container Creation** | ||
``` | ||
K8s Scheduler → GPU Node → Container Creation | ||
``` | ||
- **Scheduler** assigns pod to `hpc-k8s-phy-wrk-g01` (only GPU node) | ||
- **kubelet** pulls images: `model-engine:onprem20`, `vllm:vllm-onprem` | ||
- Creates **2 containers**: `http-forwarder` + `main` | ||
|
||
#### **Step 4: Model Download & Preparation** | ||
``` | ||
main container → AWS CLI install → S3 download → File verification | ||
``` | ||
- **AWS CLI installation**: `pip install --quiet awscli --no-cache-dir` | ||
- **S3 download**: `aws s3 sync s3://scale-gp-models/intermediate-model-aws model_files` | ||
- **File verification**: Wait for temp suffixes to be removed | ||
- **Timing coordination**: Verify `config.json` and `.safetensors` files ready | ||
|
||
#### **Step 5: VLLM Server Startup** | ||
``` | ||
Model files ready → VLLM startup → Health checks → Service ready | ||
``` | ||
- **VLLM startup**: `python -m vllm_server --model model_files` | ||
- **Health checks**: `/health` endpoint on port 5005 | ||
- **Service routing**: `http-forwarder` routes traffic to VLLM | ||
- **Pod status**: Transitions from `0/2` → `2/2` Running | ||
|
||
#### **Step 6: Inference Ready** | ||
``` | ||
2/2 Running → Load balancer → External access | ||
``` | ||
- Both containers healthy and ready | ||
- Service endpoints accessible | ||
- Ready for inference requests | ||
|
||
|
||
### **Container Architecture** | ||
``` | ||
Pod: launch-endpoint-id-end-{ID} | ||
├── Container: http-forwarder (model-engine:onprem20) | ||
│ └── Routes traffic to main container | ||
└── Container: main (vllm:vllm-onprem) | ||
├── AWS CLI installation (~5-10 min) | ||
├── S3 model download (~30-60 min) | ||
├── File verification & timing coordination | ||
└── VLLM server startup | ||
``` | ||
|
||
--- | ||
|
||
## 🛠️ Operational Procedures | ||
|
||
### **Testing Workflow** | ||
|
||
#### **1. Deploy New Image Version** | ||
```bash | ||
# Update values.yaml tag, then: | ||
kubectl rollout restart deployment model-engine -n llm-core | ||
kubectl rollout restart deployment model-engine-endpoint-builder -n llm-core | ||
|
||
# Verify image deployment | ||
kubectl describe pod $(kubectl get pods -n llm-core | grep "model-engine" | head -1 | awk '{print $1}') -n llm-core | grep "Image:" | ||
``` | ||
|
||
#### **2. Create Test Endpoint** | ||
```bash | ||
# Start port-forward | ||
kubectl port-forward svc/model-engine -n llm-core 5000:80 & | ||
|
||
# Create endpoint (50GB storage is critical!) | ||
curl -X POST -H "Content-Type: application/json" -u "test-user-id:" "http://localhost:5000/v1/llm/model-endpoints" -d '{ | ||
"name": "test-endpoint-v1", | ||
"model_name": "test-model", | ||
"endpoint_type": "streaming", | ||
"inference_framework": "vllm", | ||
"inference_framework_image_tag": "vllm-onprem", | ||
"source": "hugging_face", | ||
"checkpoint_path": "s3://scale-gp-models/intermediate-model-aws/", | ||
"num_shards": 1, | ||
"cpus": 4, | ||
"memory": "16Gi", | ||
"storage": "50Gi", | ||
"gpus": 1, | ||
"gpu_type": "nvidia-tesla-t4", | ||
"nodes_per_worker": 1, | ||
"min_workers": 1, | ||
"max_workers": 1, | ||
"per_worker": 1, | ||
"metadata": {"team": "test", "product": "llm-engine"}, | ||
"labels": {"team": "test", "product": "llm-engine"} | ||
}' | ||
``` | ||
|
||
#### **3. Monitor Endpoint Progress** | ||
```bash | ||
# Check pod creation | ||
kubectl get all -n llm-core | grep "launch-endpoint" | ||
|
||
# Monitor container processes | ||
kubectl exec ENDPOINT_POD -n llm-core -c main -- ps aux | ||
|
||
# Check download progress | ||
kubectl exec ENDPOINT_POD -n llm-core -c main -- ls -la model_files/ | ||
kubectl exec ENDPOINT_POD -n llm-core -c main -- du -sh model_files/ | ||
|
||
# Monitor logs | ||
kubectl logs ENDPOINT_POD -n llm-core -c main --tail=10 -f | ||
``` | ||
|
||
#### **4. Cleanup Failed Endpoints** | ||
```bash | ||
# Delete endpoint resources | ||
kubectl delete deployment ENDPOINT_DEPLOYMENT -n llm-core | ||
kubectl delete service ENDPOINT_SERVICE -n llm-core | ||
kubectl delete hpa ENDPOINT_HPA -n llm-core | ||
|
||
# Clean up old replica sets | ||
kubectl get replicasets -n llm-core | grep model-engine | awk '$3 == 0 {print $1}' | xargs -r kubectl delete replicaset -n llm-core | ||
``` | ||
|
||
### **Common Issues & Quick Fixes** | ||
|
||
| Issue | Symptoms | Root Cause | Solution | | ||
|-------|----------|------------|----------| | ||
| **Container Termination** | Exit Code 137, pod dies in <5min | Storage limits exceeded | Use 50GB storage (not 100GB+) | | ||
| **Slow AWS CLI Install** | 30+ minute installations | Missing optimization flag | Verify `--no-cache-dir` in command | | ||
| **Architecture Errors** | `Qwen3ForCausalLM not supported` | Wrong VLLM image | Use `vllm-onprem` tag | | ||
| **Download Fails** | No model_files directory | AWS CLI or S3 auth issues | Check `which aws`, verify credentials | | ||
| **Premature VLLM Start** | `No config format found` | Timing coordination missing | Verify `while` loop in command | | ||
|
||
### **Key Monitoring Commands** | ||
```bash | ||
# Check cluster storage capacity | ||
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.allocatable.'nvidia\.com/gpu',EPHEMERAL-STORAGE:.status.allocatable.ephemeral-storage | ||
|
||
# Monitor active downloads | ||
kubectl exec ENDPOINT_POD -n llm-core -c main -- ps aux | grep aws | ||
|
||
# Check file finalization status | ||
kubectl exec ENDPOINT_POD -n llm-core -c main -- ls -la model_files/ | grep -E "\.tmp|\..*[A-Za-z0-9]{8}$" | ||
|
||
# Monitor endpoint builder | ||
kubectl logs deployment/model-engine-endpoint-builder -n llm-core --tail=20 | ||
``` | ||
|
||
--- | ||
|
||
## 🚨 Known Issues & Future Work | ||
|
||
### **Critical Unresolved Issues** | ||
|
||
#### **1. PVC Functionality Broken** | ||
- **Status**: All attempts to use PVC storage fail | ||
- **Root Cause**: Async hardware config bug in appcode | ||
- **Error**: `RuntimeWarning: coroutine '_get_recommended_hardware_config_map' was never awaited` | ||
- **Impact**: Always falls back to EmptyDir instead of PVC | ||
- **Workaround**: Using ephemeral storage with reduced limits | ||
- **PVC Code Status**: PVC implementation has been **reverted from both repositories** and is **scheduled for rework next week** | ||
- **Fix Required**: Changes to `llm-engine` repository to properly await async hardware config function | ||
|
||
#### **2. Storage Scaling Limitations** | ||
- **Current**: Single GPU node with 189GB ephemeral storage | ||
- **Constraint**: Large models require more storage than available | ||
- **Options**: Add GPU nodes, expand node storage, or implement working PVC | ||
|
||
#### **3. Download Performance** | ||
- **Current**: ~4MB/s download speeds from Scality S3 | ||
- **Optimization**: Could pre-install AWS CLI in base images | ||
- **Alternative**: Use faster download tools or local mirrors | ||
|
||
### **Prevention Guidelines** | ||
- **Always use 50GB storage** for tesla-t4 hardware (not 100GB+) | ||
- **Always use `vllm-onprem` tag** (not version-specific like `0.6.3-rc1`) | ||
- **Always include `--no-cache-dir`** in AWS CLI installation commands | ||
- **Test endpoint creation** immediately after any image updates | ||
- **Monitor container uptime** - quick termination indicates problems | ||
|
||
--- | ||
|
||
## 📁 Critical File Locations | ||
|
||
### **oman-national-llm Repository** | ||
``` | ||
infra/charts/model-engine/ | ||
├── values.yaml # Main configuration | ||
├── templates/ | ||
│ ├── service_template_config_map.yaml # Pod/deployment templates | ||
│ ├── recommended_hardware_config_map.yaml # Hardware specifications | ||
│ ├── service_config_map.yaml # Service configuration | ||
│ └── _helpers.tpl # Helm helper functions | ||
``` | ||
|
||
|
||
|
||
--- | ||
|
||
## 🚀 Quick Reference | ||
|
||
### **Working API Call** | ||
```bash | ||
curl -X POST -H "Content-Type: application/json" -u "test-user-id:" "http://localhost:5000/v1/llm/model-endpoints" -d '{ | ||
"name": "test-endpoint-v1", | ||
"model_name": "test-model", | ||
"endpoint_type": "streaming", | ||
"inference_framework": "vllm", | ||
"inference_framework_image_tag": "vllm-onprem", | ||
"source": "hugging_face", | ||
"checkpoint_path": "s3://scale-gp-models/intermediate-model-aws/", | ||
"num_shards": 1, | ||
"cpus": 4, | ||
"memory": "16Gi", | ||
"storage": "50Gi", # CRITICAL: Must be 50Gi or less | ||
"gpus": 1, | ||
"gpu_type": "nvidia-tesla-t4", | ||
"nodes_per_worker": 1, | ||
"min_workers": 1, | ||
"max_workers": 1, | ||
"per_worker": 1, | ||
"metadata": {"team": "test", "product": "llm-engine"}, | ||
"labels": {"team": "test", "product": "llm-engine"} | ||
}' | ||
``` | ||
|
||
### **Emergency Revert Procedure** | ||
```bash | ||
# Revert to last working state | ||
kubectl set image deployment/model-engine model-engine=registry.odp.om/odp-development/oman-national-llm/model-engine:onprem20 -n llm-core | ||
kubectl set image deployment/model-engine-endpoint-builder model-engine-endpoint-builder=registry.odp.om/odp-development/oman-national-llm/model-engine:onprem20 -n llm-core | ||
|
||
# Update values.yaml | ||
tag: onprem20 | ||
vllm_tag: "vllm-onprem" | ||
|
||
# Verify storage configuration | ||
storage: 50Gi # In hardware specs | ||
``` | ||
|
||
|
||
|
||
--- | ||
|
||
*This documentation represents the culmination of extensive testing and debugging to achieve the first stable model-engine deployment. Preserve this configuration as the baseline for future development.* |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anecdotally, we found it a lot easier to performance tune pure uvicorn, so we actually migrated most usage of gunicorn back to uvicorn. That being said, won't block your usage of it