Launch vLLM as a custom application
This tutorial walks you through deploying language model inference endpoints on Denvr Dataworks using the Custom Application feature. We'll set up a vLLM server hosting mistralai/Mistral-7B-v0.1 with OpenAI-compatible APIs, leveraging Denvr Dataworks's GPU infrastructure and container orchestration capabilities.
For detailed information about the Mistral-7B model, including its architecture, capabilities, and licensing terms, visit the official model page on Hugging Face.
💡 Using Different Models: To use a different model, replace
mistralai/Mistral-7B-v0.1in the command override, select the appropriate hardware, accept any license agreements on the model's Hugging Face page, and update API examples accordingly.
Understanding Custom Applications in Denvr Dataworks
Denvr Dataworks's Custom Application feature is a powerful deployment method that gives you complete control over container configuration. Unlike other applications with predefined settings, custom applications allow you to:
Bring Your Own Container: Use any Docker image from public or private registries
Reverse Proxy with API Authentication: Configure an optional reverse proxy layer with API key authentication for secure access control
Readiness Monitoring: Configure a readiness port that gets pinged to verify your application is ready
Command Override: Replace the default container entry point with your own commands
Environment Customization: Set custom environment variables for your specific needs
User Scripts: Upload custom scripts that are mounted at
/etc/script/user-scriptsand executable from your command override
ℹ️ Deployment Requirement: Custom applications are currently only available to launch through reserved resource pools. On-demand resource pools are not yet supported for custom application deployments.
Prerequisites
For this tutorial you need:
A Denvr Dataworks account with Custom Application deployment capability (found through the Application Catalog)
A Hugging Face account with a User Access Token (you'll create this in the next section)
Command line access with
curlfor testing the deployment (or any HTTP client)
Hugging Face Model Access
This tutorial requires a Hugging Face token to be configured as an environment variable in your custom application.
Creating Your Hugging Face Token:
Log into your Hugging Face account
Navigate to Settings → Access Tokens
Create a new token with Read permissions
Copy the token value for use in Denvr Dataworks's environment variable configuration
Model License Requirements: Before deployment, visit the Mistral-7B model page and accept any required license agreements. vLLM will not complete initialization if your account hasn't accepted the model's usage terms.
Create the Deployment Using Denvr Dataworks
Step 1: Navigate to Application Catalog
Login to Denvr Dataworks: Access your organization's Denvr Dataworks console
Navigate to Applications: Go to Applications → Catalog from the main navigation
Select Custom Application: Look for the Custom Application card
Initialize Configuration: Click to start the custom application deployment
Step 2: Basic Information Configuration
Application Name: Use a descriptive, systematic naming convention:
Example:
vllm-[model-name]-[environment]-[version]
Reserved Resource Pools Only: Custom applications are currently only allowed through reserved resource pools
Step 3: Custom Application Configuration
In the Application section (custom application specific fields):
Image Repository Type: Select
Public(since we're using the official Docker Hub image)Container Image URL:
docker.io/vllm/vllm-openai:v0.10.1.1
Step 4: Hardware Configuration
In the Instance Type Configuration section, select a GPU with sufficient VRAM for your model:
For Mistral-7B (14-16GB VRAM required):
Recommended: A100 MIG
3g.20gb(20GB) - Cost-effective for 7B modelsAlternative: A100 Full (40GB) - Higher throughput
💡 VRAM Estimates: Requirements assume FP16 inference. Using quantization (4-bit/8-bit) can significantly reduce VRAM needs.
Step 5: Environment Variables
In the Environment Variables section, configure the following variables:
Required:
HF_TOKEN- Your Hugging Face User Access Token obtained in the prerequisites
Recommended - Cache Configuration:
To avoid filling up the limited root filesystem storage, configure caching to use direct-attached storage:
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxx
XDG_CACHE_HOME=/mnt/direct-attached/.cache
HF_HOME=/mnt/direct-attached/.cache/huggingface
HUGGINGFACE_HUB_CACHE=/mnt/direct-attached/.cache/huggingface/hub
CUDA_CACHE_PATH=/mnt/direct-attached/.cache/nv/ComputeCache
PIP_CACHE_DIR=/mnt/direct-attached/.cache/pip
VLLM_CACHE_ROOT=/mnt/direct-attached/.cache/vllm
VLLM_ASSETS_CACHE=/mnt/direct-attached/.cache/vllm/assets💡 Cache Benefits: Configuring cache directories in direct-attached storage prevents filling up the limited root filesystem. Model downloads, compiled CUDA kernels, and other cached data can consume significant space and should be stored on direct-attached storage.
Step 6: Advanced Settings
In the Advanced Settings section, configure both the command override and port settings:
Command Override Configuration:
In the Command Override field, enter the following command. You can add additional vLLM engine arguments (like --max-model-len 8192 or --enable-prefix-caching) after --port 8000 and before the & symbol to customize performance and behavior.
For a complete list of available vLLM engine arguments, see the vLLM Engine Arguments documentation.
["/bin/bash", "-c", "vllm serve mistralai/Mistral-7B-v0.1 --download-dir /mnt/direct-attached/vllm-cache --host 127.0.0.1 --port 8000 & sleep infinity"]ℹ️ About Command Override: This field allows you to specify exactly how the container should start, overriding the default Docker image command.
Step 7: Proxy Settings
In the Proxy Settings section, configure the reverse proxy for your deployment:
Port Configuration:
Proxy Port: Set to
8000(required to enable the reverse proxy and route external traffic to your vLLM service)
API Authentication (Optional but Recommended):
API Keys: Add your API keys in the API Keys field to enable secure access through the reverse proxy.
ℹ️ Proxy Port: This setting is required to enable the reverse proxy layer. When set, all external traffic routes through the reverse proxy to your vLLM service on this port.
💡 API Key Best Practices: Use descriptive prefixes like
sk-prod-,sk-staging-, orsk-dev-to identify different keys. Keep these keys secure and never commit them to version control.
ℹ️ Without API Keys: If you don't configure any API keys but set the Proxy Port, requests will pass through the reverse proxy without authentication requirements.
Step 9: Deploy the Application
Review all configuration settings
Click Deploy Application
Monitor the deployment status in the Applications Overview or Application Details
To view runtime logs, click on your deployment name and navigate to the Runtime Logs tab to monitor the model download and vLLM initialization progress
The vLLM initialization process will:
Pull the vLLM Docker image
Start the container with your specified configuration
Download the Mistral-7B model weights from Hugging Face
Initialize the vLLM server with the OpenAI-compatible API
ℹ️ Deployment Time: Initial deployment may take 5-10 minutes as the system downloads the ~14GB model weights. The application will show as ONLINE, but the vLLM OpenAI-compatible endpoints will not be accessible until initialization is complete. You can monitor progress by viewing the runtime logs in the application details page.
Alternative: Deploy Using API
You can also use the Denvr Dataworks API directly to create Custom Applications.
Using the Swagger API Interface
Access Swagger UI: Navigate to https://api.cloud.denvrdata.com/
Login: Log in with your user information (tenancy name, username/email, and password)
Find the Endpoint: Locate
POST /api/v1/servers/applications/CreateCustomApplicationTry It Out: Click "Try it out" to enable the request body editor
Configure Request: Customize and paste the JSON configuration below into the request body
Execute: Click "Execute" to create your custom application
JSON Configuration Template
{
"name": "vllm-mistral-7b-v1",
"cluster": "Msc1",
"hardwarePackageName": "g-nvidia-1xa100-20gb-pcie-6vcpu-45gb",
"imageUrl": "docker.io/vllm/vllm-openai:v0.10.1.1",
"imageCmdOverride": [
"/bin/bash",
"-c",
"vllm serve mistralai/Mistral-7B-v0.1 --download-dir /mnt/direct-attached/vllm-cache --host 127.0.0.1 --port 8000 & sleep infinity"
],
"environmentVariables": {
"HF_TOKEN": "hf_xxxxxxxxxxxxxxxxxxxxx",
"XDG_CACHE_HOME": "/mnt/direct-attached/.cache",
"HF_HOME": "/mnt/direct-attached/.cache/huggingface",
"HUGGINGFACE_HUB_CACHE": "/mnt/direct-attached/.cache/huggingface/hub",
"CUDA_CACHE_PATH": "/mnt/direct-attached/.cache/nv/ComputeCache",
"PIP_CACHE_DIR": "/mnt/direct-attached/.cache/pip",
"VLLM_CACHE_ROOT": "/mnt/direct-attached/.cache/vllm",
"VLLM_ASSETS_CACHE": "/mnt/direct-attached/.cache/vllm/assets"
},
"imageRepository": {
"hostname": "https://index.docker.io/v1/",
},
"resourcePool": "your-reserved-resource-pool",
"proxyPort": 8000,
"proxyApiKeys": ["your-api-key1", "your-api-key2s"],
"persistDirectAttachedStorage": false,
"personalSharedStorage": true,
"tenantSharedStorage": true,
"securityContext": {
"runAsRoot": true,
"containerUid": null,
"containerGid": null
}
}Key Parameters to Customize:
name: Your application namecluster: Your cluster name (e.g., "Msc1")hardwarePackageName: Match available hardware in your clusterresourcePool: Your reserved resource pool nameHF_TOKEN: Your Hugging Face tokenproxyApiKeys: Array of API keys for reverse proxy authentication
Accessing Your vLLM Deployment
Finding Your Application's DNS Address
Navigate to Applications → Overview
Find your vLLM deployment in the list - the DNS address is displayed in the overview table
Alternatively, click on the deployment name to view Application Details where you'll also find the DNS address
Once you have the DNS address, you can access your vLLM endpoint at that address (e.g., https://abc-123-456-789-012.cloud.denvrdata.com)
Authentication
If you configured API keys, your vLLM endpoint will require authentication through the reverse proxy. Include one of your configured API keys in all requests as a bearer token:
-H "Authorization: Bearer your-api-key"Replace your-api-key with one of the API keys you set (e.g., API_KEY_1, API_KEY_2, etc.).
ℹ️ Without API Keys: If you didn't configure any API keys but set the Proxy Port, requests will pass through the reverse proxy without authentication requirements.
Testing Your Deployment
1. Health Check
First, verify that your vLLM server is running by checking the health endpoint:
curl https://your-dns-address/health \
-H "Authorization: Bearer your-api-key"A successful response returns HTTP/1.1 200 OK with no body, indicating the server is healthy and ready to accept requests.
2. List Available Models
Once the health check passes, verify your model is loaded and accessible:
curl https://your-dns-address/v1/models \
-H "Authorization: Bearer your-api-key"This should return a response showing the mistralai/Mistral-7B-v0.1 model is available:
{
"object": "list",
"data": [
{
"id": "mistralai/Mistral-7B-v0.1",
"object": "model",
"created": 1737380356,
"owned_by": "vllm",
"root": "mistralai/Mistral-7B-v0.1",
"parent": null,
"max_model_len": 32768,
"permission": [
{
"id": "modelperm-e3d0f87e19a548b2be64ca274a4550a6",
"object": "model_permission",
"created": 1737380356,
"allow_create_engine": false,
"allow_sampling": true,
"allow_logprobs": true,
"allow_search_indices": false,
"allow_view": true,
"allow_fine_tuning": false,
"organization": "*",
"group": null,
"is_blocking": false
}
]
}
]
}Using Your vLLM Deployment
Using curl (Command Line)
The simplest way to test your deployment is with curl commands:
Text Completion
curl https://your-dns-address/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-api-key" \
-d '{
"model": "mistralai/Mistral-7B-v0.1",
"prompt": "Explain machine learning in simple terms:",
"max_tokens": 150,
"temperature": 0.7
}'Metrics
vLLM exposes Prometheus metrics for monitoring performance and resource utilization:
curl https://your-dns-address/metrics \
-H "Authorization: Bearer your-api-key"This returns metrics like request counts, GPU memory usage, and latency measurements that you can use for monitoring and optimization.
ℹ️ Note: If you didn't configure API keys, omit the
-H "Authorization: Bearer your-api-key"header from all commands.
Simple Python Client
Here's a simple Python client to get you started with your vLLM deployment:
import requests
def test_vllm_deployment(endpoint_url, api_key):
"""Simple function to test your vLLM deployment."""
# Test basic completion
response = requests.post(
f"{endpoint_url}/v1/completions",
headers={"Authorization": f"Bearer {api_key}"},
json={
"model": "mistralai/Mistral-7B-v0.1",
"prompt": "Explain machine learning in simple terms:",
"max_tokens": 100,
"temperature": 0.7
}
)
if response.status_code == 200:
result = response.json()
print("Generated text:", result['choices'][0]['text'])
return True
else:
print(f"Error: {response.status_code} - {response.text}")
return False
# Example usage
if __name__ == "__main__":
endpoint_url = "https://your-dns-address" # Replace with your actual DNS address
api_key = "your-api-key" # Replace with one of your configured API keys
test_vllm_deployment(endpoint_url, api_key)ℹ️ Note: If you didn't configure API keys, you can omit the
Authorizationheader or passNoneas theapi_key.
Using with OpenAI SDK
Your vLLM deployment is compatible with the OpenAI Python SDK:
from openai import OpenAI
# Initialize client with your vLLM endpoint
client = OpenAI(
base_url="https://your-dns-address/v1",
api_key="your-api-key", # Use one of your configured API keys
)
# Generate text completion
completion = client.completions.create(
model="mistralai/Mistral-7B-v0.1",
prompt="Explain the benefits of containerized ML inference:",
max_tokens=150,
temperature=0.7
)
print(completion.choices[0].text)ℹ️ Note: If you didn't configure API keys, you can set
api_key="not-needed"or any string value.💡 Base vs Instruction Models:
mistralai/Mistral-7B-v0.1is a base model, so we useclient.completions.create()with aprompt. For instruction-tuned models (likemistralai/Mistral-7B-Instruct-v0.1), you would useclient.chat.completions.create()withmessages.
Quick Reference
Essential curl Commands
Once your deployment is running, use these commands to interact with your vLLM endpoint. Replace your-dns-address with your deployment's DNS address and your-api-key with one of your configured API keys:
# Health check
curl https://your-dns-address/health \
-H "Authorization: Bearer your-api-key"
# List available models
curl https://your-dns-address/v1/models \
-H "Authorization: Bearer your-api-key"
# Text completion
curl https://your-dns-address/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-api-key" \
-d '{
"model": "mistralai/Mistral-7B-v0.1",
"prompt": "Your prompt here",
"max_tokens": 100,
"temperature": 0.7
}'Common Request Parameters
Key parameters for the /v1/completions endpoint:
prompt: The input text to generate completion for (required)max_tokens: Maximum tokens to generate (optional, defaults to model's max length)temperature: Randomness control, 0.0-2.0 (default: 1.0)stream: Enable streaming responses (default: false)
See the vLLM OpenAI-Compatible Server docs for all parameters.
Troubleshooting
Connection refused
Check the runtime logs to see if vLLM is still initializing
Unauthorized error
Verify your API key matches one of the configured API_KEY_* seen in the proxy settings
Model not found
Ensure model name exactly matches: mistralai/Mistral-7B-v0.1
RuntimeError: Engine core initialization failed.
This is likely an out-of-memory error — choose a GPU with more VRAM, use a smaller model or reduce batch size, or enable quantization (e.g., 4-bit)
Last updated