🚀 c3-vllm

A Docker container that seamlessly runs vLLM with automatic HuggingFace model downloads. Deploy any LLM with blazing-fast inference using a single command!

🌟 Features

High-Performance Inference: Built on vLLM for production-grade serving
Automatic Model Downloads: Downloads models from HuggingFace at runtime
OpenAI Compatible API: Drop-in replacement for OpenAI API endpoints
Advanced Features: Tool calling, FP8 quantization, tensor parallelism
Flexible Configuration: Environment variables for easy customization
Production Ready: Health checks, metrics, and Traefik integration

🙏 Credits

This project builds upon the excellent work of:

vLLM - High-performance LLM inference engine
The HuggingFace community for hosting and sharing models

📦 Quick Start

Pull the image

docker pull ghcr.io/comput3ai/c3-vllm:latest

Run with your model

docker run --gpus all \
  -e MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
  -e SERVED_MODEL_NAME=llama-3.1-8b \
  -e TENSOR_PARALLEL_SIZE=1 \
  -p 8080:8000 \
  ghcr.io/comput3ai/c3-vllm:latest

🔧 Configuration

Core Environment Variables

Variable	Description	Default
`MODEL_NAME`	HuggingFace model repository	Required
`SERVED_MODEL_NAME`	Model name in API responses	-
`TENSOR_PARALLEL_SIZE`	Number of GPUs for tensor parallelism	`1`
`MAX_MODEL_LEN`	Maximum context length (e.g., 128000 for Kimi K2)	Auto-detected
`DTYPE`	Model data type (auto, float16, bfloat16, float32)	`auto`
`TRUST_REMOTE_CODE`	Allow remote code execution (required for some models)	`false`

Performance Tuning

Variable	Description	Default
`GPU_MEMORY_UTILIZATION`	Fraction of GPU memory to use (0.0-1.0)	`0.9`
`MAX_NUM_SEQS`	Maximum sequences to process concurrently	`256`
`MAX_NUM_BATCHED_TOKENS`	Maximum tokens per iteration	Auto
`QUANTIZATION`	Quantization method (fp8, awq, gptq, etc.)	-
`KV_CACHE_DTYPE`	KV cache data type (auto, fp8, fp8_e4m3, fp8_e5m2)	`auto`
`ENABLE_CHUNKED_PREFILL`	Enable chunked prefill for long prompts	`false`
`ENFORCE_EAGER`	Disable CUDA graphs (may help with large models)	`false`
`DISABLE_SLIDING_WINDOW`	Disable sliding window attention	`false`
`VLLM_USE_V1`	Use vLLM V1 engine (set to 0 for V0 engine)	`1`
`DISTRIBUTED_EXECUTOR_BACKEND`	Backend for distributed execution (ray, mp)	-
`PIPELINE_PARALLEL_SIZE`	Number of pipeline parallel ranks	-

Tool Calling Support

Variable	Description	Default
`ENABLE_AUTO_TOOL_CHOICE`	Enable automatic tool choice	`false`
`TOOL_CALL_PARSER`	Tool call parser to use (e.g., kimi_k2, hermes)	-
`CHAT_TEMPLATE`	Custom chat template file path or URL (must be raw URL)	-

API Configuration

Variable	Description	Default
`PORT`	Internal server port	`8000`
`API_KEY`	API key for authentication	-
`HF_TOKEN`	HuggingFace token for gated models	-
`DOWNLOAD_DIR`	Directory to download/cache models	`/models`
`DISABLE_LOG_STATS`	Disable logging statistics	`false`

📚 Examples

Running Different Models

Small Model (7B-13B) on Single GPU

docker run --gpus all \
  -e MODEL_NAME=meta-llama/Llama-3.2-3B-Instruct \
  -e SERVED_MODEL_NAME=llama-3.2-3b \
  -p 8080:8000 \
  ghcr.io/comput3ai/c3-vllm:latest

Large Model with Tensor Parallelism

docker run --gpus all \
  -e MODEL_NAME=meta-llama/Llama-3.1-70B-Instruct \
  -e SERVED_MODEL_NAME=llama-3.1-70b \
  -e TENSOR_PARALLEL_SIZE=4 \
  -e MAX_MODEL_LEN=8192 \
  -p 8080:8000 \
  ghcr.io/comput3ai/c3-vllm:latest

Kimi-K2-0905 with FP8 Quantization (256K Context)

docker run --gpus all \
  -e MODEL_NAME=moonshotai/Kimi-K2-Instruct-0905 \
  -e SERVED_MODEL_NAME=kimi-k2 \
  -e TENSOR_PARALLEL_SIZE=8 \
  -e QUANTIZATION=fp8 \
  -e MAX_MODEL_LEN=262144 \
  -e ENABLE_AUTO_TOOL_CHOICE=true \
  -e TOOL_CALL_PARSER=kimi_k2 \
  -e TRUST_REMOTE_CODE=true \
  -p 8080:8000 \
  ghcr.io/comput3ai/c3-vllm:latest

Kimi-K2-0905 with Custom Upstream Chat Template

To use the latest upstream chat template from HuggingFace:

docker run --gpus all \
  -e MODEL_NAME=moonshotai/Kimi-K2-Instruct-0905 \
  -e SERVED_MODEL_NAME=kimi-k2 \
  -e TENSOR_PARALLEL_SIZE=8 \
  -e QUANTIZATION=fp8 \
  -e MAX_MODEL_LEN=262144 \
  -e ENABLE_AUTO_TOOL_CHOICE=true \
  -e TOOL_CALL_PARSER=kimi_k2 \
  -e CHAT_TEMPLATE=https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905/raw/main/chat_template.jinja \
  -e TRUST_REMOTE_CODE=true \
  -p 8080:8000 \
  ghcr.io/comput3ai/c3-vllm:latest

Important: When using URLs for CHAT_TEMPLATE, ensure you use the raw file URL:

✅ Correct: https://huggingface.co/repo/model/raw/main/template.jinja
❌ Wrong: https://huggingface.co/repo/model/blob/main/template.jinja

API Authentication

To secure your vLLM API with authentication:

Set the API_KEY environment variable in your .env file:

API_KEY=your-secret-api-key-here

All API requests must include the API key in the Authorization header:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-secret-api-key-here" \
  -d '{...}'

Note: Without the API_KEY environment variable set, the API will accept all requests without authentication.

Using the API

With curl

# Chat completion (no auth)
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b",
    "messages": [
      {"role": "user", "content": "Hello! How are you?"}
    ],
    "temperature": 0.7
  }'

# Chat completion (with auth)
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-secret-api-key-here" \
  -d '{
    "model": "llama-3.1-8b",
    "messages": [
      {"role": "user", "content": "Hello! How are you?"}
    ]
  }'

# Streaming response
curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-secret-api-key-here" \
  -d '{
    "model": "llama-3.1-8b",
    "messages": [
      {"role": "user", "content": "Write a haiku about coding"}
    ],
    "stream": true
  }'

# Check model info
curl http://localhost:8000/v1/models \
  -H "Authorization: Bearer your-secret-api-key-here"

# Health check (usually doesn't require auth)
curl http://localhost:8000/health

With Python

from openai import OpenAI

# Without authentication
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # Required by client but ignored by server
)

# With authentication
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-secret-api-key-here"  # Must match API_KEY env var
)

response = client.chat.completions.create(
    model="llama-3.1-8b",  # Uses the SERVED_MODEL_NAME
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)
print(response.choices[0].message.content)

🐳 Deployment Options

Option 1: Docker Run (Simple)

docker run -t --gpus all \
  -e MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
  -e SERVED_MODEL_NAME=llama-3.1-8b \
  -p 8080:8000 \
  ghcr.io/comput3ai/c3-vllm:latest

Option 2: Docker Compose (Recommended)

Copy the environment template:

cp env.sample .env

Edit .env with your model configuration:

MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
SERVED_MODEL_NAME=llama-3.1-8b
TENSOR_PARALLEL_SIZE=1
MAX_MODEL_LEN=16384

# Optional but recommended: Enable API authentication
# API_KEY=your-secret-api-key-here

# Optional: For private/gated models
# HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Start the service:

docker-compose up -d

Option 3: Production Deployment

For production deployments with SSL/TLS and optimized storage:

docker-compose -f docker-compose.prod.yaml up -d

This uses:

NVMe storage for models (/opt/dlami/nvme/vllm-models)
Traefik for SSL termination
Health checks and monitoring

🔨 Building Locally

Quick Build (Standard Images)

For standard vLLM releases:

git clone https://github.com/comput3ai/c3-vllm
cd c3-vllm
docker build -t c3-vllm:latest .

Building All Image Variants

We provide three image variants with different vLLM versions:

./build-all.sh

This builds:

c3-vllm:latest - Based on upstream vllm/vllm-openai:latest (V1 engine)
c3-vllm:v0 - Based on upstream vllm/vllm-openai:v0.10.2 (V0 engine, for compatibility)
c3-vllm:minimax - Custom build from source with MiniMax M2 support (requires triton-kernels)

Image Architecture

Our build system consists of:

Dockerfiles

Dockerfile - Parameterized base (takes BASE_IMAGE arg), adds C3 customizations:
- Custom entrypoint wrapper for model download
- Additional dependencies (tiktoken, python-dotenv, etc.)
- Runtime chat template download support
Dockerfile.minimax - Patches vLLM with triton-kernels for MiniMax M2 model support
Dockerfile.v0 - Legacy v0 build (kept for reference)

Build Scripts

build-all.sh - Comprehensive build of all variants:
1. Builds vllm:git from upstream source (submodule)
2. Patches to create vllm:minimax with triton-kernels
3. Builds all c3-vllm variants (latest, v0, minimax)
build-minimax.sh - Targeted build for just the minimax variant (faster if you already have other variants)

vLLM Submodule

The repository includes the upstream vLLM as a git submodule at vllm/. This allows us to:

Build bleeding-edge vLLM from source
Apply custom patches (e.g., triton-kernels for MiniMax M2)
Track specific upstream commits

To update the submodule:

git submodule update --remote vllm

Building Individual Variants

You can build specific variants by passing BASE_IMAGE:

# Build with custom base
docker build -t c3-vllm:custom \
  --build-arg BASE_IMAGE=vllm/vllm-openai:v0.11.0 \
  -f Dockerfile \
  .

# Build minimax variant only
./build-minimax.sh

🛠️ How It Works

The container starts and checks for the model in the download directory
If not present, it downloads the model from HuggingFace
vLLM server starts with optimized settings for your hardware
OpenAI-compatible API endpoints become available

📊 Performance Tips

General Recommendations

Tensor Parallelism: Use TENSOR_PARALLEL_SIZE equal to your GPU count for large models
Quantization: Use QUANTIZATION=fp8 for models with FP8 weights (e.g., Kimi-K2)
KV Cache: Enable KV_CACHE_DTYPE=fp8 for 2x memory efficiency with minimal quality loss
Context Length: Set MAX_MODEL_LEN based on your GPU memory
Batch Size: Tune MAX_NUM_BATCHED_TOKENS for your workload

Large MoE Models (e.g., Kimi K2)

For models with 1T+ parameters and 128k context support:

# Recommended settings for 8xB200 or 8xH100
TENSOR_PARALLEL_SIZE=8
MAX_MODEL_LEN=262144         # Full 256k context for Kimi-K2-0905
MAX_NUM_SEQS=256             # High concurrency
MAX_NUM_BATCHED_TOKENS=32768 # Large batch for throughput
GPU_MEMORY_UTILIZATION=0.95  # Can push higher on datacenter GPUs
QUANTIZATION=fp8             # For FP8 models like Kimi-K2
ENABLE_CHUNKED_PREFILL=true  # Helps with long prompts
TRUST_REMOTE_CODE=true       # Required for custom architectures
ENABLE_AUTO_TOOL_CHOICE=true # For models with tool calling
TOOL_CALL_PARSER=kimi_k2     # Model-specific parser

Note: If you encounter "TritonMLA V1 with FP8 KV cache not yet supported" errors on Blackwell (B200) GPUs, set VLLM_USE_V1=0 to use the V0 engine which has better compatibility.

Memory Estimation

For FP8 models with FP8 KV cache:

Model weights: ~1.2TB for 1T parameter model
KV cache per token: ~0.5MB per sequence
Total VRAM needed: Model weights + (context_length × batch_size × 0.5MB)

📄 License

BSD 3-Clause License

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.gitmodules		.gitmodules
Dockerfile		Dockerfile
Dockerfile.minimax		Dockerfile.minimax
README.md		README.md
build-all.sh		build-all.sh
build-minimax.sh		build-minimax.sh
docker-compose.traefik.yaml		docker-compose.traefik.yaml
docker-compose.yaml		docker-compose.yaml
download.py		download.py
entrypoint.sh		entrypoint.sh
env.sample		env.sample
requirements.txt		requirements.txt

comput3ai/c3-vllm

Folders and files

Latest commit

History

Repository files navigation