Launch vLLM as a custom application

This tutorial walks you through deploying language model inference endpoints on Denvr Dataworks using the Custom Application feature. We'll set up a vLLM server hosting mistralai/Mistral-7B-v0.1 with OpenAI-compatible APIs, leveraging Denvr Dataworks's GPU infrastructure and container orchestration capabilities.

For detailed information about the Mistral-7B model, including its architecture, capabilities, and licensing terms, visit the official model page on Hugging Face.

💡 Using Different Models: To use a different model, replace mistralai/Mistral-7B-v0.1 in the command override, select the appropriate hardware, accept any license agreements on the model's Hugging Face page, and update API examples accordingly.


Understanding Custom Applications in Denvr Dataworks

Denvr Dataworks's Custom Application feature is a powerful deployment method that gives you complete control over container configuration. Unlike other applications with predefined settings, custom applications allow you to:

  • Bring Your Own Container: Use any Docker image from public or private registries

  • Reverse Proxy with API Authentication: Configure an optional reverse proxy layer with API key authentication for secure access control

  • Readiness Monitoring: Configure a readiness port that gets pinged to verify your application is ready

  • Command Override: Replace the default container entry point with your own commands

  • Environment Customization: Set custom environment variables for your specific needs

  • User Scripts: Upload custom scripts that are mounted at /etc/script/user-scripts and executable from your command override

ℹ️ Deployment Requirement: Custom applications are currently only available to launch through reserved resource pools. On-demand resource pools are not yet supported for custom application deployments.

Prerequisites

For this tutorial you need:

  • A Denvr Dataworks account with Custom Application deployment capability (found through the Application Catalog)

  • A Hugging Face account with a User Access Token (you'll create this in the next section)

  • Command line access with curl for testing the deployment (or any HTTP client)

Hugging Face Model Access

This tutorial requires a Hugging Face token to be configured as an environment variable in your custom application.

Creating Your Hugging Face Token:

  1. Log into your Hugging Face account

  2. Navigate to SettingsAccess Tokens

  3. Create a new token with Read permissions

  4. Copy the token value for use in Denvr Dataworks's environment variable configuration

Model License Requirements: Before deployment, visit the Mistral-7B model page and accept any required license agreements. vLLM will not complete initialization if your account hasn't accepted the model's usage terms.

Create the Deployment Using Denvr Dataworks

Step 1: Navigate to Application Catalog

  1. Login to Denvr Dataworks: Access your organization's Denvr Dataworks console

  2. Navigate to Applications: Go to ApplicationsCatalog from the main navigation

  3. Select Custom Application: Look for the Custom Application card

  4. Initialize Configuration: Click to start the custom application deployment

Step 2: Basic Information Configuration

  • Application Name: Use a descriptive, systematic naming convention:

    • Example: vllm-[model-name]-[environment]-[version]

  • Reserved Resource Pools Only: Custom applications are currently only allowed through reserved resource pools

Step 3: Custom Application Configuration

In the Application section (custom application specific fields):

  • Image Repository Type: Select Public (since we're using the official Docker Hub image)

  • Container Image URL: docker.io/vllm/vllm-openai:v0.10.1.1

Step 4: Hardware Configuration

In the Instance Type Configuration section, select a GPU with sufficient VRAM for your model:

For Mistral-7B (14-16GB VRAM required):

  • Recommended: A100 MIG 3g.20gb (20GB) - Cost-effective for 7B models

  • Alternative: A100 Full (40GB) - Higher throughput

💡 VRAM Estimates: Requirements assume FP16 inference. Using quantization (4-bit/8-bit) can significantly reduce VRAM needs.

Step 5: Environment Variables

In the Environment Variables section, configure the following variables:

Required:

  • HF_TOKEN - Your Hugging Face User Access Token obtained in the prerequisites

Recommended - Cache Configuration:

To avoid filling up the limited root filesystem storage, configure caching to use direct-attached storage:

HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxx
XDG_CACHE_HOME=/mnt/direct-attached/.cache
HF_HOME=/mnt/direct-attached/.cache/huggingface
HUGGINGFACE_HUB_CACHE=/mnt/direct-attached/.cache/huggingface/hub
CUDA_CACHE_PATH=/mnt/direct-attached/.cache/nv/ComputeCache
PIP_CACHE_DIR=/mnt/direct-attached/.cache/pip
VLLM_CACHE_ROOT=/mnt/direct-attached/.cache/vllm
VLLM_ASSETS_CACHE=/mnt/direct-attached/.cache/vllm/assets

💡 Cache Benefits: Configuring cache directories in direct-attached storage prevents filling up the limited root filesystem. Model downloads, compiled CUDA kernels, and other cached data can consume significant space and should be stored on direct-attached storage.

Step 6: Advanced Settings

In the Advanced Settings section, configure both the command override and port settings:

Command Override Configuration:

In the Command Override field, enter the following command. You can add additional vLLM engine arguments (like --max-model-len 8192 or --enable-prefix-caching) after --port 8000 and before the & symbol to customize performance and behavior.

For a complete list of available vLLM engine arguments, see the vLLM Engine Arguments documentation.

["/bin/bash", "-c", "vllm serve mistralai/Mistral-7B-v0.1 --download-dir /mnt/direct-attached/vllm-cache --host 127.0.0.1 --port 8000 & sleep infinity"]

ℹ️ About Command Override: This field allows you to specify exactly how the container should start, overriding the default Docker image command.

Step 7: Proxy Settings

In the Proxy Settings section, configure the reverse proxy for your deployment:

Port Configuration:

  • Proxy Port: Set to 8000 (required to enable the reverse proxy and route external traffic to your vLLM service)

API Authentication (Optional but Recommended):

  • API Keys: Add your API keys in the API Keys field to enable secure access through the reverse proxy.

ℹ️ Proxy Port: This setting is required to enable the reverse proxy layer. When set, all external traffic routes through the reverse proxy to your vLLM service on this port.

💡 API Key Best Practices: Use descriptive prefixes like sk-prod-, sk-staging-, or sk-dev- to identify different keys. Keep these keys secure and never commit them to version control.

ℹ️ Without API Keys: If you don't configure any API keys but set the Proxy Port, requests will pass through the reverse proxy without authentication requirements.

Step 9: Deploy the Application

  1. Review all configuration settings

  2. Click Deploy Application

  3. Monitor the deployment status in the Applications Overview or Application Details

  4. To view runtime logs, click on your deployment name and navigate to the Runtime Logs tab to monitor the model download and vLLM initialization progress

The vLLM initialization process will:

  • Pull the vLLM Docker image

  • Start the container with your specified configuration

  • Download the Mistral-7B model weights from Hugging Face

  • Initialize the vLLM server with the OpenAI-compatible API

ℹ️ Deployment Time: Initial deployment may take 5-10 minutes as the system downloads the ~14GB model weights. The application will show as ONLINE, but the vLLM OpenAI-compatible endpoints will not be accessible until initialization is complete. You can monitor progress by viewing the runtime logs in the application details page.

Alternative: Deploy Using API

You can also use the Denvr Dataworks API directly to create Custom Applications.

Using the Swagger API Interface

  1. Access Swagger UI: Navigate to https://api.cloud.denvrdata.com/

  2. Login: Log in with your user information (tenancy name, username/email, and password)

  3. Find the Endpoint: Locate POST /api/v1/servers/applications/CreateCustomApplication

  4. Try It Out: Click "Try it out" to enable the request body editor

  5. Configure Request: Customize and paste the JSON configuration below into the request body

  6. Execute: Click "Execute" to create your custom application

JSON Configuration Template

{
  "name": "vllm-mistral-7b-v1",
  "cluster": "Msc1",
  "hardwarePackageName": "g-nvidia-1xa100-20gb-pcie-6vcpu-45gb",
  "imageUrl": "docker.io/vllm/vllm-openai:v0.10.1.1",
  "imageCmdOverride": [
    "/bin/bash",
    "-c",
    "vllm serve mistralai/Mistral-7B-v0.1 --download-dir /mnt/direct-attached/vllm-cache --host 127.0.0.1 --port 8000 & sleep infinity"
  ],
  "environmentVariables": {
    "HF_TOKEN": "hf_xxxxxxxxxxxxxxxxxxxxx",
    "XDG_CACHE_HOME": "/mnt/direct-attached/.cache",
    "HF_HOME": "/mnt/direct-attached/.cache/huggingface",
    "HUGGINGFACE_HUB_CACHE": "/mnt/direct-attached/.cache/huggingface/hub",
    "CUDA_CACHE_PATH": "/mnt/direct-attached/.cache/nv/ComputeCache",
    "PIP_CACHE_DIR": "/mnt/direct-attached/.cache/pip",
    "VLLM_CACHE_ROOT": "/mnt/direct-attached/.cache/vllm",
    "VLLM_ASSETS_CACHE": "/mnt/direct-attached/.cache/vllm/assets"
  },
  "imageRepository": {
    "hostname": "https://index.docker.io/v1/",
  },
  "resourcePool": "your-reserved-resource-pool",
  "proxyPort": 8000,
  "proxyApiKeys": ["your-api-key1", "your-api-key2s"],
  "persistDirectAttachedStorage": false,
  "personalSharedStorage": true,
  "tenantSharedStorage": true,
  "securityContext": {
    "runAsRoot": true,
    "containerUid": null,
    "containerGid": null
  }
}

Key Parameters to Customize:

  • name: Your application name

  • cluster: Your cluster name (e.g., "Msc1")

  • hardwarePackageName: Match available hardware in your cluster

  • resourcePool: Your reserved resource pool name

  • HF_TOKEN: Your Hugging Face token

  • proxyApiKeys: Array of API keys for reverse proxy authentication

Accessing Your vLLM Deployment

Finding Your Application's DNS Address

  1. Navigate to ApplicationsOverview

  2. Find your vLLM deployment in the list - the DNS address is displayed in the overview table

  3. Alternatively, click on the deployment name to view Application Details where you'll also find the DNS address

Once you have the DNS address, you can access your vLLM endpoint at that address (e.g., https://abc-123-456-789-012.cloud.denvrdata.com)

Authentication

If you configured API keys, your vLLM endpoint will require authentication through the reverse proxy. Include one of your configured API keys in all requests as a bearer token:

-H "Authorization: Bearer your-api-key"

Replace your-api-key with one of the API keys you set (e.g., API_KEY_1, API_KEY_2, etc.).

ℹ️ Without API Keys: If you didn't configure any API keys but set the Proxy Port, requests will pass through the reverse proxy without authentication requirements.

Testing Your Deployment

1. Health Check

First, verify that your vLLM server is running by checking the health endpoint:

curl https://your-dns-address/health \
  -H "Authorization: Bearer your-api-key"

A successful response returns HTTP/1.1 200 OK with no body, indicating the server is healthy and ready to accept requests.

2. List Available Models

Once the health check passes, verify your model is loaded and accessible:

curl https://your-dns-address/v1/models \
  -H "Authorization: Bearer your-api-key"

This should return a response showing the mistralai/Mistral-7B-v0.1 model is available:

{
  "object": "list",
  "data": [
    {
      "id": "mistralai/Mistral-7B-v0.1",
      "object": "model",
      "created": 1737380356,
      "owned_by": "vllm",
      "root": "mistralai/Mistral-7B-v0.1",
      "parent": null,
      "max_model_len": 32768,
      "permission": [
        {
          "id": "modelperm-e3d0f87e19a548b2be64ca274a4550a6",
          "object": "model_permission",
          "created": 1737380356,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}

Using Your vLLM Deployment

Using curl (Command Line)

The simplest way to test your deployment is with curl commands:

Text Completion

curl https://your-dns-address/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "model": "mistralai/Mistral-7B-v0.1",
    "prompt": "Explain machine learning in simple terms:",
    "max_tokens": 150,
    "temperature": 0.7
  }'

Metrics

vLLM exposes Prometheus metrics for monitoring performance and resource utilization:

curl https://your-dns-address/metrics \
  -H "Authorization: Bearer your-api-key"

This returns metrics like request counts, GPU memory usage, and latency measurements that you can use for monitoring and optimization.

ℹ️ Note: If you didn't configure API keys, omit the -H "Authorization: Bearer your-api-key" header from all commands.

Simple Python Client

Here's a simple Python client to get you started with your vLLM deployment:

import requests

def test_vllm_deployment(endpoint_url, api_key):
    """Simple function to test your vLLM deployment."""
    
    # Test basic completion
    response = requests.post(
        f"{endpoint_url}/v1/completions",
        headers={"Authorization": f"Bearer {api_key}"},
        json={
            "model": "mistralai/Mistral-7B-v0.1",
            "prompt": "Explain machine learning in simple terms:",
            "max_tokens": 100,
            "temperature": 0.7
        }
    )
    
    if response.status_code == 200:
        result = response.json()
        print("Generated text:", result['choices'][0]['text'])
        return True
    else:
        print(f"Error: {response.status_code} - {response.text}")
        return False

# Example usage
if __name__ == "__main__":
    endpoint_url = "https://your-dns-address"  # Replace with your actual DNS address
    api_key = "your-api-key"                   # Replace with one of your configured API keys
    
    test_vllm_deployment(endpoint_url, api_key)

ℹ️ Note: If you didn't configure API keys, you can omit the Authorization header or pass None as the api_key.

Using with OpenAI SDK

Your vLLM deployment is compatible with the OpenAI Python SDK:

from openai import OpenAI

# Initialize client with your vLLM endpoint
client = OpenAI(
    base_url="https://your-dns-address/v1",
    api_key="your-api-key",  # Use one of your configured API keys
)

# Generate text completion
completion = client.completions.create(
    model="mistralai/Mistral-7B-v0.1",
    prompt="Explain the benefits of containerized ML inference:",
    max_tokens=150,
    temperature=0.7
)

print(completion.choices[0].text)

ℹ️ Note: If you didn't configure API keys, you can set api_key="not-needed" or any string value.

💡 Base vs Instruction Models: mistralai/Mistral-7B-v0.1 is a base model, so we use client.completions.create() with a prompt. For instruction-tuned models (like mistralai/Mistral-7B-Instruct-v0.1), you would use client.chat.completions.create() with messages.

Quick Reference

Essential curl Commands

Once your deployment is running, use these commands to interact with your vLLM endpoint. Replace your-dns-address with your deployment's DNS address and your-api-key with one of your configured API keys:

# Health check
curl https://your-dns-address/health \
  -H "Authorization: Bearer your-api-key"

# List available models
curl https://your-dns-address/v1/models \
  -H "Authorization: Bearer your-api-key"

# Text completion
curl https://your-dns-address/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "model": "mistralai/Mistral-7B-v0.1",
    "prompt": "Your prompt here",
    "max_tokens": 100,
    "temperature": 0.7
  }'

Common Request Parameters

Key parameters for the /v1/completions endpoint:

  • prompt: The input text to generate completion for (required)

  • max_tokens: Maximum tokens to generate (optional, defaults to model's max length)

  • temperature: Randomness control, 0.0-2.0 (default: 1.0)

  • stream: Enable streaming responses (default: false)

See the vLLM OpenAI-Compatible Server docs for all parameters.

Troubleshooting

Issue
Solution

Connection refused

Check the runtime logs to see if vLLM is still initializing

Unauthorized error

Verify your API key matches one of the configured API_KEY_* seen in the proxy settings

Model not found

Ensure model name exactly matches: mistralai/Mistral-7B-v0.1

RuntimeError: Engine core initialization failed.

This is likely an out-of-memory error — choose a GPU with more VRAM, use a smaller model or reduce batch size, or enable quantization (e.g., 4-bit)

Last updated