A Docker container that seamlessly runs vLLM with automatic HuggingFace model downloads. Deploy any LLM with blazing-fast inference using a single command!
- High-Performance Inference: Built on vLLM for production-grade serving
- Automatic Model Downloads: Downloads models from HuggingFace at runtime
- OpenAI Compatible API: Drop-in replacement for OpenAI API endpoints
- Advanced Features: Tool calling, FP8 quantization, tensor parallelism
- Flexible Configuration: Environment variables for easy customization
- Production Ready: Health checks, metrics, and Traefik integration
This project builds upon the excellent work of:
- vLLM - High-performance LLM inference engine
- The HuggingFace community for hosting and sharing models
docker pull ghcr.io/comput3ai/c3-vllm:latestdocker run --gpus all \
-e MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
-e SERVED_MODEL_NAME=llama-3.1-8b \
-e TENSOR_PARALLEL_SIZE=1 \
-p 8080:8000 \
ghcr.io/comput3ai/c3-vllm:latest| Variable | Description | Default |
|---|---|---|
MODEL_NAME |
HuggingFace model repository | Required |
SERVED_MODEL_NAME |
Model name in API responses | - |
TENSOR_PARALLEL_SIZE |
Number of GPUs for tensor parallelism | 1 |
MAX_MODEL_LEN |
Maximum context length (e.g., 128000 for Kimi K2) | Auto-detected |
DTYPE |
Model data type (auto, float16, bfloat16, float32) | auto |
TRUST_REMOTE_CODE |
Allow remote code execution (required for some models) | false |
| Variable | Description | Default |
|---|---|---|
GPU_MEMORY_UTILIZATION |
Fraction of GPU memory to use (0.0-1.0) | 0.9 |
MAX_NUM_SEQS |
Maximum sequences to process concurrently | 256 |
MAX_NUM_BATCHED_TOKENS |
Maximum tokens per iteration | Auto |
QUANTIZATION |
Quantization method (fp8, awq, gptq, etc.) | - |
KV_CACHE_DTYPE |
KV cache data type (auto, fp8, fp8_e4m3, fp8_e5m2) | auto |
ENABLE_CHUNKED_PREFILL |
Enable chunked prefill for long prompts | false |
ENFORCE_EAGER |
Disable CUDA graphs (may help with large models) | false |
DISABLE_SLIDING_WINDOW |
Disable sliding window attention | false |
VLLM_USE_V1 |
Use vLLM V1 engine (set to 0 for V0 engine) | 1 |
DISTRIBUTED_EXECUTOR_BACKEND |
Backend for distributed execution (ray, mp) | - |
PIPELINE_PARALLEL_SIZE |
Number of pipeline parallel ranks | - |
| Variable | Description | Default |
|---|---|---|
ENABLE_AUTO_TOOL_CHOICE |
Enable automatic tool choice | false |
TOOL_CALL_PARSER |
Tool call parser to use (e.g., kimi_k2, hermes) | - |
CHAT_TEMPLATE |
Custom chat template file path or URL (must be raw URL) | - |
| Variable | Description | Default |
|---|---|---|
PORT |
Internal server port | 8000 |
API_KEY |
API key for authentication | - |
HF_TOKEN |
HuggingFace token for gated models | - |
DOWNLOAD_DIR |
Directory to download/cache models | /models |
DISABLE_LOG_STATS |
Disable logging statistics | false |
docker run --gpus all \
-e MODEL_NAME=meta-llama/Llama-3.2-3B-Instruct \
-e SERVED_MODEL_NAME=llama-3.2-3b \
-p 8080:8000 \
ghcr.io/comput3ai/c3-vllm:latestdocker run --gpus all \
-e MODEL_NAME=meta-llama/Llama-3.1-70B-Instruct \
-e SERVED_MODEL_NAME=llama-3.1-70b \
-e TENSOR_PARALLEL_SIZE=4 \
-e MAX_MODEL_LEN=8192 \
-p 8080:8000 \
ghcr.io/comput3ai/c3-vllm:latestdocker run --gpus all \
-e MODEL_NAME=moonshotai/Kimi-K2-Instruct-0905 \
-e SERVED_MODEL_NAME=kimi-k2 \
-e TENSOR_PARALLEL_SIZE=8 \
-e QUANTIZATION=fp8 \
-e MAX_MODEL_LEN=262144 \
-e ENABLE_AUTO_TOOL_CHOICE=true \
-e TOOL_CALL_PARSER=kimi_k2 \
-e TRUST_REMOTE_CODE=true \
-p 8080:8000 \
ghcr.io/comput3ai/c3-vllm:latestTo use the latest upstream chat template from HuggingFace:
docker run --gpus all \
-e MODEL_NAME=moonshotai/Kimi-K2-Instruct-0905 \
-e SERVED_MODEL_NAME=kimi-k2 \
-e TENSOR_PARALLEL_SIZE=8 \
-e QUANTIZATION=fp8 \
-e MAX_MODEL_LEN=262144 \
-e ENABLE_AUTO_TOOL_CHOICE=true \
-e TOOL_CALL_PARSER=kimi_k2 \
-e CHAT_TEMPLATE=https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905/raw/main/chat_template.jinja \
-e TRUST_REMOTE_CODE=true \
-p 8080:8000 \
ghcr.io/comput3ai/c3-vllm:latestImportant: When using URLs for CHAT_TEMPLATE, ensure you use the raw file URL:
- β
Correct:
https://huggingface.co/repo/model/raw/main/template.jinja - β Wrong:
https://huggingface.co/repo/model/blob/main/template.jinja
To secure your vLLM API with authentication:
- Set the
API_KEYenvironment variable in your.envfile:
API_KEY=your-secret-api-key-here- All API requests must include the API key in the Authorization header:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-secret-api-key-here" \
-d '{...}'Note: Without the API_KEY environment variable set, the API will accept all requests without authentication.
# Chat completion (no auth)
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.1-8b",
"messages": [
{"role": "user", "content": "Hello! How are you?"}
],
"temperature": 0.7
}'
# Chat completion (with auth)
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-secret-api-key-here" \
-d '{
"model": "llama-3.1-8b",
"messages": [
{"role": "user", "content": "Hello! How are you?"}
]
}'
# Streaming response
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-secret-api-key-here" \
-d '{
"model": "llama-3.1-8b",
"messages": [
{"role": "user", "content": "Write a haiku about coding"}
],
"stream": true
}'
# Check model info
curl http://localhost:8000/v1/models \
-H "Authorization: Bearer your-secret-api-key-here"
# Health check (usually doesn't require auth)
curl http://localhost:8000/healthfrom openai import OpenAI
# Without authentication
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # Required by client but ignored by server
)
# With authentication
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="your-secret-api-key-here" # Must match API_KEY env var
)
response = client.chat.completions.create(
model="llama-3.1-8b", # Uses the SERVED_MODEL_NAME
messages=[
{"role": "user", "content": "Hello!"}
]
)
print(response.choices[0].message.content)docker run -t --gpus all \
-e MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
-e SERVED_MODEL_NAME=llama-3.1-8b \
-p 8080:8000 \
ghcr.io/comput3ai/c3-vllm:latest- Copy the environment template:
cp env.sample .env- Edit
.envwith your model configuration:
MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct
SERVED_MODEL_NAME=llama-3.1-8b
TENSOR_PARALLEL_SIZE=1
MAX_MODEL_LEN=16384
# Optional but recommended: Enable API authentication
# API_KEY=your-secret-api-key-here
# Optional: For private/gated models
# HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx- Start the service:
docker-compose up -dFor production deployments with SSL/TLS and optimized storage:
docker-compose -f docker-compose.prod.yaml up -dThis uses:
- NVMe storage for models (
/opt/dlami/nvme/vllm-models) - Traefik for SSL termination
- Health checks and monitoring
For standard vLLM releases:
git clone https://github.com/comput3ai/c3-vllm
cd c3-vllm
docker build -t c3-vllm:latest .We provide three image variants with different vLLM versions:
./build-all.shThis builds:
- c3-vllm:latest - Based on upstream
vllm/vllm-openai:latest(V1 engine) - c3-vllm:v0 - Based on upstream
vllm/vllm-openai:v0.10.2(V0 engine, for compatibility) - c3-vllm:minimax - Custom build from source with MiniMax M2 support (requires triton-kernels)
Our build system consists of:
-
Dockerfile - Parameterized base (takes
BASE_IMAGEarg), adds C3 customizations:- Custom entrypoint wrapper for model download
- Additional dependencies (tiktoken, python-dotenv, etc.)
- Runtime chat template download support
-
Dockerfile.minimax - Patches vLLM with triton-kernels for MiniMax M2 model support
-
Dockerfile.v0 - Legacy v0 build (kept for reference)
-
build-all.sh - Comprehensive build of all variants:
- Builds
vllm:gitfrom upstream source (submodule) - Patches to create
vllm:minimaxwith triton-kernels - Builds all c3-vllm variants (latest, v0, minimax)
- Builds
-
build-minimax.sh - Targeted build for just the minimax variant (faster if you already have other variants)
The repository includes the upstream vLLM as a git submodule at vllm/. This allows us to:
- Build bleeding-edge vLLM from source
- Apply custom patches (e.g., triton-kernels for MiniMax M2)
- Track specific upstream commits
To update the submodule:
git submodule update --remote vllmYou can build specific variants by passing BASE_IMAGE:
# Build with custom base
docker build -t c3-vllm:custom \
--build-arg BASE_IMAGE=vllm/vllm-openai:v0.11.0 \
-f Dockerfile \
.
# Build minimax variant only
./build-minimax.sh- The container starts and checks for the model in the download directory
- If not present, it downloads the model from HuggingFace
- vLLM server starts with optimized settings for your hardware
- OpenAI-compatible API endpoints become available
- Tensor Parallelism: Use
TENSOR_PARALLEL_SIZEequal to your GPU count for large models - Quantization: Use
QUANTIZATION=fp8for models with FP8 weights (e.g., Kimi-K2) - KV Cache: Enable
KV_CACHE_DTYPE=fp8for 2x memory efficiency with minimal quality loss - Context Length: Set
MAX_MODEL_LENbased on your GPU memory - Batch Size: Tune
MAX_NUM_BATCHED_TOKENSfor your workload
For models with 1T+ parameters and 128k context support:
# Recommended settings for 8xB200 or 8xH100
TENSOR_PARALLEL_SIZE=8
MAX_MODEL_LEN=262144 # Full 256k context for Kimi-K2-0905
MAX_NUM_SEQS=256 # High concurrency
MAX_NUM_BATCHED_TOKENS=32768 # Large batch for throughput
GPU_MEMORY_UTILIZATION=0.95 # Can push higher on datacenter GPUs
QUANTIZATION=fp8 # For FP8 models like Kimi-K2
ENABLE_CHUNKED_PREFILL=true # Helps with long prompts
TRUST_REMOTE_CODE=true # Required for custom architectures
ENABLE_AUTO_TOOL_CHOICE=true # For models with tool calling
TOOL_CALL_PARSER=kimi_k2 # Model-specific parserNote: If you encounter "TritonMLA V1 with FP8 KV cache not yet supported" errors on Blackwell (B200) GPUs, set VLLM_USE_V1=0 to use the V0 engine which has better compatibility.
For FP8 models with FP8 KV cache:
- Model weights: ~1.2TB for 1T parameter model
- KV cache per token: ~0.5MB per sequence
- Total VRAM needed: Model weights + (context_length Γ batch_size Γ 0.5MB)
BSD 3-Clause License
Copyright (c) 2025, Comput3.ai
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
-
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
-
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
-
Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.