# Launch vLLM as a custom application

This tutorial walks you through deploying language model inference endpoints on Denvr Dataworks using the Custom Application feature. We'll set up a [vLLM](https://docs.vllm.ai/) server hosting `mistralai/Mistral-7B-v0.1` with OpenAI-compatible APIs, leveraging Denvr Dataworks's GPU infrastructure and container orchestration capabilities.

For detailed information about the Mistral-7B model, including its architecture, capabilities, and licensing terms, visit the [official model page on Hugging Face](https://huggingface.co/mistralai/Mistral-7B-v0.1).

> **💡 Using Different Models**: To use a different model, replace `mistralai/Mistral-7B-v0.1` in the command override, select the appropriate hardware, accept any license agreements on the model's Hugging Face page, and update API examples accordingly.

***

### Understanding Custom Applications in Denvr Dataworks

Denvr Dataworks's **Custom Application** feature is a powerful deployment method that gives you complete control over container configuration. Unlike other applications with predefined settings, custom applications allow you to:

* **Bring Your Own Container**: Use any Docker image from public or private registries
* **Reverse Proxy with API Authentication**: Configure an optional reverse proxy layer with API key authentication for secure access control
* **Readiness Monitoring**: Configure a readiness port that gets pinged to verify your application is ready
* **Command Override**: Replace the default container entry point with your own commands
* **Environment Customization**: Set custom environment variables for your specific needs
* **User Scripts**: Upload custom scripts that are mounted at `/etc/script/user-scripts` and executable from your command override

> **ℹ️ Deployment Requirement**: Custom applications are currently only available to launch through reserved resource pools. On-demand resource pools are not yet supported for custom application deployments.

### Prerequisites

For this tutorial you need:

* A Denvr Dataworks account with **Custom Application** deployment capability (found through the **Application Catalog**)
* A Hugging Face account with a User Access Token (you'll create this in the next section)
* Command line access with `curl` for testing the deployment (or any HTTP client)

#### Hugging Face Model Access

This tutorial requires a Hugging Face token to be configured as an environment variable in your custom application.

**Creating Your Hugging Face Token:**

1. Log into your [Hugging Face account](https://huggingface.co/)
2. Navigate to **Settings** → **Access Tokens**
3. Create a new token with **Read** permissions
4. Copy the token value for use in Denvr Dataworks's environment variable configuration

**Model License Requirements:** Before deployment, visit the [Mistral-7B model page](https://huggingface.co/mistralai/Mistral-7B-v0.1) and accept any required license agreements. vLLM will not complete initialization if your account hasn't accepted the model's usage terms.

### Create the Deployment Using Denvr Dataworks

#### Step 1: Navigate to Application Catalog

1. **Login to Denvr Dataworks**: Access your organization's Denvr Dataworks console
2. **Navigate to Applications**: Go to **Applications** → **Catalog** from the main navigation
3. **Select Custom Application**: Look for the **Custom Application** card
4. **Initialize Configuration**: Click to start the custom application deployment

#### Step 2: Basic Information Configuration

* **Application Name**: Use a descriptive, systematic naming convention:
  * Example: `vllm-[model-name]-[environment]-[version]`
* **Reserved Resource Pools Only**: Custom applications are currently only allowed through reserved resource pools

#### Step 3: Custom Application Configuration

In the **Application** section (custom application specific fields):

* **Image Repository Type**: Select `Public` (since we're using the official Docker Hub image)
* **Container Image URL**: `docker.io/vllm/vllm-openai:v0.10.1.1`

#### Step 4: Hardware Configuration

In the **Instance Type Configuration** section, select a GPU with sufficient VRAM for your model:

**For Mistral-7B (14-16GB VRAM required):**

* **Recommended**: A100 MIG `3g.20gb` (20GB) - Cost-effective for 7B models
* **Alternative**: A100 Full (40GB) - Higher throughput

> **💡 VRAM Estimates**: Requirements assume FP16 inference. Using quantization (4-bit/8-bit) can significantly reduce VRAM needs.

#### Step 5: Environment Variables

In the **Environment Variables** section, configure the following variables:

**Required:**

* `HF_TOKEN` - Your Hugging Face User Access Token obtained in the prerequisites

**Recommended - Cache Configuration:**

To avoid filling up the limited root filesystem storage, configure caching to use direct-attached storage:

```env
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxx
XDG_CACHE_HOME=/mnt/direct-attached/.cache
HF_HOME=/mnt/direct-attached/.cache/huggingface
HUGGINGFACE_HUB_CACHE=/mnt/direct-attached/.cache/huggingface/hub
CUDA_CACHE_PATH=/mnt/direct-attached/.cache/nv/ComputeCache
PIP_CACHE_DIR=/mnt/direct-attached/.cache/pip
VLLM_CACHE_ROOT=/mnt/direct-attached/.cache/vllm
VLLM_ASSETS_CACHE=/mnt/direct-attached/.cache/vllm/assets
```

> **💡 Cache Benefits**: Configuring cache directories in direct-attached storage prevents filling up the limited root filesystem. Model downloads, compiled CUDA kernels, and other cached data can consume significant space and should be stored on direct-attached storage.

#### Step 6: Advanced Settings

In the **Advanced Settings** section, configure both the command override and port settings:

**Command Override Configuration:**

In the **Command Override** field, enter the following command. You can add additional vLLM engine arguments (like `--max-model-len 8192` or `--enable-prefix-caching`) after `--port 8000` and before the `&` symbol to customize performance and behavior.

For a complete list of available vLLM engine arguments, see the [vLLM Engine Arguments documentation](https://docs.vllm.ai/en/latest/configuration/engine_args.html).

```bash
["/bin/bash", "-c", "vllm serve mistralai/Mistral-7B-v0.1 --download-dir /mnt/direct-attached/vllm-cache --host 127.0.0.1 --port 8000 & sleep infinity"]
```

> **ℹ️ About Command Override**: This field allows you to specify exactly how the container should start, overriding the default Docker image command.

#### Step 7: Proxy Settings

In the **Proxy Settings** section, configure the reverse proxy for your deployment:

**Port Configuration:**

* **Proxy Port**: Set to `8000` (required to enable the reverse proxy and route external traffic to your vLLM service)

**API Authentication (Optional but Recommended):**

* **API Keys**: Add your API keys in the **API Keys** field to enable secure access through the reverse proxy.

> **ℹ️ Proxy Port**: This setting is required to enable the reverse proxy layer. When set, all external traffic routes through the reverse proxy to your vLLM service on this port.

> **💡 API Key Best Practices**: Use descriptive prefixes like `sk-prod-`, `sk-staging-`, or `sk-dev-` to identify different keys. Keep these keys secure and never commit them to version control.

> **ℹ️ Without API Keys**: If you don't configure any API keys but set the Proxy Port, requests will pass through the reverse proxy without authentication requirements.

#### Step 9: Deploy the Application

1. Review all configuration settings
2. Click **Deploy Application**
3. Monitor the deployment status in the **Applications Overview** or **Application Details**
4. To view runtime logs, click on your deployment name and navigate to the **Runtime Logs** tab to monitor the model download and vLLM initialization progress

The vLLM initialization process will:

* Pull the vLLM Docker image
* Start the container with your specified configuration
* Download the Mistral-7B model weights from Hugging Face
* Initialize the vLLM server with the OpenAI-compatible API

> **ℹ️ Deployment Time**: Initial deployment may take 5-10 minutes as the system downloads the \~14GB model weights. The application will show as **ONLINE**, but the vLLM OpenAI-compatible endpoints will not be accessible until initialization is complete. You can monitor progress by viewing the runtime logs in the application details page.

### Alternative: Deploy Using API

You can also use the Denvr Dataworks API directly to create Custom Applications.

#### Using the Swagger API Interface

1. **Access Swagger UI**: Navigate to <https://api.cloud.denvrdata.com/>
2. **Login**: Log in with your user information (tenancy name, username/email, and password)
3. **Find the Endpoint**: Locate `POST /api/v1/servers/applications/CreateCustomApplication`
4. **Try It Out**: Click "Try it out" to enable the request body editor
5. **Configure Request**: Customize and paste the JSON configuration below into the request body
6. **Execute**: Click "Execute" to create your custom application

#### JSON Configuration Template

```json
{
  "name": "vllm-mistral-7b-v1",
  "cluster": "Msc1",
  "hardwarePackageName": "g-nvidia-1xa100-20gb-pcie-6vcpu-45gb",
  "imageUrl": "docker.io/vllm/vllm-openai:v0.10.1.1",
  "imageCmdOverride": [
    "/bin/bash",
    "-c",
    "vllm serve mistralai/Mistral-7B-v0.1 --download-dir /mnt/direct-attached/vllm-cache --host 127.0.0.1 --port 8000 & sleep infinity"
  ],
  "environmentVariables": {
    "HF_TOKEN": "hf_xxxxxxxxxxxxxxxxxxxxx",
    "XDG_CACHE_HOME": "/mnt/direct-attached/.cache",
    "HF_HOME": "/mnt/direct-attached/.cache/huggingface",
    "HUGGINGFACE_HUB_CACHE": "/mnt/direct-attached/.cache/huggingface/hub",
    "CUDA_CACHE_PATH": "/mnt/direct-attached/.cache/nv/ComputeCache",
    "PIP_CACHE_DIR": "/mnt/direct-attached/.cache/pip",
    "VLLM_CACHE_ROOT": "/mnt/direct-attached/.cache/vllm",
    "VLLM_ASSETS_CACHE": "/mnt/direct-attached/.cache/vllm/assets"
  },
  "imageRepository": {
    "hostname": "https://index.docker.io/v1/",
  },
  "resourcePool": "your-reserved-resource-pool",
  "proxyPort": 8000,
  "proxyApiKeys": ["your-api-key1", "your-api-key2s"],
  "persistDirectAttachedStorage": false,
  "personalSharedStorage": true,
  "tenantSharedStorage": true,
  "securityContext": {
    "runAsRoot": true,
    "containerUid": null,
    "containerGid": null
  }
}
```

**Key Parameters to Customize:**

* `name`: Your application name
* `cluster`: Your cluster name (e.g., "Msc1")
* `hardwarePackageName`: Match available hardware in your cluster
* `resourcePool`: Your reserved resource pool name
* `HF_TOKEN`: Your Hugging Face token
* `proxyApiKeys`: Array of API keys for reverse proxy authentication

### Accessing Your vLLM Deployment

#### Finding Your Application's DNS Address

1. Navigate to **Applications** → **Overview**
2. Find your vLLM deployment in the list - the DNS address is displayed in the overview table
3. Alternatively, click on the deployment name to view **Application Details** where you'll also find the DNS address

Once you have the DNS address, you can access your vLLM endpoint at that address (e.g., `https://abc-123-456-789-012.cloud.denvrdata.com`)

#### Authentication

If you configured API keys, your vLLM endpoint will require authentication through the reverse proxy. Include one of your configured API keys in all requests as a bearer token:

```bash
-H "Authorization: Bearer your-api-key"
```

Replace `your-api-key` with one of the API keys you set (e.g., `API_KEY_1`, `API_KEY_2`, etc.).

> **ℹ️ Without API Keys**: If you didn't configure any API keys but set the Proxy Port, requests will pass through the reverse proxy without authentication requirements.

### Testing Your Deployment

#### 1. Health Check

First, verify that your vLLM server is running by checking the health endpoint:

```bash
curl https://your-dns-address/health \
  -H "Authorization: Bearer your-api-key"
```

A successful response returns `HTTP/1.1 200 OK` with no body, indicating the server is healthy and ready to accept requests.

#### 2. List Available Models

Once the health check passes, verify your model is loaded and accessible:

```bash
curl https://your-dns-address/v1/models \
  -H "Authorization: Bearer your-api-key"
```

This should return a response showing the `mistralai/Mistral-7B-v0.1` model is available:

```json
{
  "object": "list",
  "data": [
    {
      "id": "mistralai/Mistral-7B-v0.1",
      "object": "model",
      "created": 1737380356,
      "owned_by": "vllm",
      "root": "mistralai/Mistral-7B-v0.1",
      "parent": null,
      "max_model_len": 32768,
      "permission": [
        {
          "id": "modelperm-e3d0f87e19a548b2be64ca274a4550a6",
          "object": "model_permission",
          "created": 1737380356,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": false,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    }
  ]
}
```

### Using Your vLLM Deployment

#### Using curl (Command Line)

The simplest way to test your deployment is with curl commands:

**Text Completion**

```bash
curl https://your-dns-address/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "model": "mistralai/Mistral-7B-v0.1",
    "prompt": "Explain machine learning in simple terms:",
    "max_tokens": 150,
    "temperature": 0.7
  }'
```

**Metrics**

vLLM exposes Prometheus metrics for monitoring performance and resource utilization:

```bash
curl https://your-dns-address/metrics \
  -H "Authorization: Bearer your-api-key"
```

This returns metrics like request counts, GPU memory usage, and latency measurements that you can use for monitoring and optimization.

> **ℹ️ Note**: If you didn't configure API keys, omit the `-H "Authorization: Bearer your-api-key"` header from all commands.

#### Simple Python Client

Here's a simple Python client to get you started with your vLLM deployment:

```python
import requests

def test_vllm_deployment(endpoint_url, api_key):
    """Simple function to test your vLLM deployment."""
    
    # Test basic completion
    response = requests.post(
        f"{endpoint_url}/v1/completions",
        headers={"Authorization": f"Bearer {api_key}"},
        json={
            "model": "mistralai/Mistral-7B-v0.1",
            "prompt": "Explain machine learning in simple terms:",
            "max_tokens": 100,
            "temperature": 0.7
        }
    )
    
    if response.status_code == 200:
        result = response.json()
        print("Generated text:", result['choices'][0]['text'])
        return True
    else:
        print(f"Error: {response.status_code} - {response.text}")
        return False

# Example usage
if __name__ == "__main__":
    endpoint_url = "https://your-dns-address"  # Replace with your actual DNS address
    api_key = "your-api-key"                   # Replace with one of your configured API keys
    
    test_vllm_deployment(endpoint_url, api_key)
```

> **ℹ️ Note**: If you didn't configure API keys, you can omit the `Authorization` header or pass `None` as the `api_key`.

#### Using with OpenAI SDK

Your vLLM deployment is compatible with the OpenAI Python SDK:

```python
from openai import OpenAI

# Initialize client with your vLLM endpoint
client = OpenAI(
    base_url="https://your-dns-address/v1",
    api_key="your-api-key",  # Use one of your configured API keys
)

# Generate text completion
completion = client.completions.create(
    model="mistralai/Mistral-7B-v0.1",
    prompt="Explain the benefits of containerized ML inference:",
    max_tokens=150,
    temperature=0.7
)

print(completion.choices[0].text)
```

> **ℹ️ Note**: If you didn't configure API keys, you can set `api_key="not-needed"` or any string value.
>
> **💡 Base vs Instruction Models**: `mistralai/Mistral-7B-v0.1` is a base model, so we use `client.completions.create()` with a `prompt`. For instruction-tuned models (like `mistralai/Mistral-7B-Instruct-v0.1`), you would use `client.chat.completions.create()` with `messages`.

### Quick Reference

#### Essential curl Commands

Once your deployment is running, use these commands to interact with your vLLM endpoint. Replace `your-dns-address` with your deployment's DNS address and `your-api-key` with one of your configured API keys:

```bash
# Health check
curl https://your-dns-address/health \
  -H "Authorization: Bearer your-api-key"

# List available models
curl https://your-dns-address/v1/models \
  -H "Authorization: Bearer your-api-key"

# Text completion
curl https://your-dns-address/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-api-key" \
  -d '{
    "model": "mistralai/Mistral-7B-v0.1",
    "prompt": "Your prompt here",
    "max_tokens": 100,
    "temperature": 0.7
  }'
```

#### Common Request Parameters

Key parameters for the `/v1/completions` endpoint:

* **`prompt`**: The input text to generate completion for (required)
* **`max_tokens`**: Maximum tokens to generate (optional, defaults to model's max length)
* **`temperature`**: Randomness control, 0.0-2.0 (default: 1.0)
* **`stream`**: Enable streaming responses (default: false)

See the [vLLM OpenAI-Compatible Server docs](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html) for all parameters.

#### Troubleshooting

| Issue                                            | Solution                                                                                                                                            |
| ------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------- |
| Connection refused                               | Check the runtime logs to see if vLLM is still initializing                                                                                         |
| Unauthorized error                               | Verify your API key matches one of the configured API\_KEY\_\* seen in the proxy settings                                                           |
| Model not found                                  | Ensure model name exactly matches: `mistralai/Mistral-7B-v0.1`                                                                                      |
| RuntimeError: Engine core initialization failed. | This is likely an out-of-memory error — choose a GPU with more VRAM, use a smaller model or reduce batch size, or enable quantization (e.g., 4-bit) |


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.denvrdata.com/docs/overview/getting-started/launch-vllm-as-a-custom-application.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
