Osaurus 🦕

Native, Apple Silicon–only local LLM server. Similar to Ollama, but built on Apple's MLX for maximum performance on M‑series chips. SwiftUI app + SwiftNIO server with OpenAI‑compatible endpoints.

Created by Dinoki Labs (dinoki.ai), a fully native desktop AI assistant and companion.

Highlights

Native MLX runtime: Optimized for Apple Silicon using MLX/MLXLLM
Apple Silicon only: Designed and tested for M‑series Macs
OpenAI API compatible: /v1/models and /v1/chat/completions (stream and non‑stream)
Function/Tool calling: OpenAI‑style tools + tool_choice, with tool_calls parsing and streaming deltas
Chat templates: Uses model‑provided Jinja chat_template with BOS/EOS, with smart fallback
Session reuse (KV cache): Faster multi‑turn chats via session_id
Fast token streaming: Server‑Sent Events for low‑latency output
Model manager UI: Browse, download, and manage MLX models from mlx-community
System resource monitor: Real-time CPU and RAM usage visualization
Self‑contained: SwiftUI app with an embedded SwiftNIO HTTP server

Requirements

macOS 15.5+
Apple Silicon (M1 or newer)
Xcode 16.4+ (to build from source)

osaurus/
├── Core/
│   ├── AppDelegate.swift
│   └── osaurusApp.swift
├── Controllers/
│   ├── ServerController.swift      # NIO server lifecycle
│   └── ModelManager.swift          # Model discovery & downloads (Hugging Face)
├── Models/
│   ├── MLXModel.swift
│   ├── OpenAIAPI.swift             # OpenAI‑compatible DTOs
│   ├── ServerConfiguration.swift
│   └── ServerHealth.swift
├── Networking/
│   ├── HTTPHandler.swift           # Request parsing & routing entry
│   ├── Router.swift                # Routes → handlers
│   └── AsyncHTTPHandler.swift      # SSE streaming for chat completions
├── Services/
│   ├── MLXService.swift            # MLX loading, session caching, generation
│   ├── SearchService.swift
│   └── SystemMonitorService.swift  # Real-time CPU and RAM monitoring
├── Theme/
│   └── Theme.swift
├── Views/
│   ├── Components/SimpleComponents.swift
│   ├── ContentView.swift           # Start/stop server, quick controls
│   └── ModelDownloadView.swift     # Browse/download/manage models
└── Assets.xcassets/

Features

Native MLX text generation with model session caching
Model manager with curated suggestions (Llama, Qwen, Gemma, Mistral, etc.)
Download sizes estimated via Hugging Face metadata
Streaming and non‑streaming chat completions
OpenAI‑compatible function calling with robust parser for model outputs (handles code fences/formatting noise)
Chat templates from model tokenizer_config.json (Jinja), auto‑uses BOS/EOS; falls back to friendly transcript format when missing
Session reuse across turns via session_id (reuses KV cache when possible)
Auto‑detects stop sequences and BOS token from tokenizer configs
Health endpoint and simple status UI
Real-time system resource monitoring

Benchmarks

The following are 20-run averages from our batch benchmark suite. See raw results for details and variance.

Server	Model	TTFT avg (ms)	Total avg (ms)	Chars/s avg	Success
Osaurus	llama-3.2-3b-instruct-4bit	191	1461	521	100%
Ollama	llama3.2	59	1667	439	100%
LM Studio	llama-3.2-3b-instruct	56	1205	605	100%

Metrics: TTFT = time-to-first-token, Total = time to final token, Chars/s = streaming throughput.
Data sources: results/osaurus-vs-ollama-lmstudio-batch.summary.json, results/osaurus-vs-ollama-lmstudio-batch.results.csv.
How to reproduce: scripts/run_bench.sh calls scripts/benchmark_models.py to run prompts across servers and write results.

API Endpoints

GET / → Plain text status
GET /health → JSON health info
GET /models and GET /v1/models → OpenAI‑compatible models list
POST /chat/completions and POST /v1/chat/completions → OpenAI‑compatible chat completions

Getting Started

Download

Download the latest signed build from the Releases page.

Build and run

Open osaurus.xcodeproj in Xcode 16.4+
Build and run the osaurus target
In the UI, configure the port via the gear icon (default 8080) and press Start
Open the model manager to download a model (e.g., "Llama 3.2 3B Instruct 4bit")

Models are stored by default at ~/Documents/MLXModels. Override with the environment variable OSU_MODELS_DIR.

Use the API

Base URL: http://127.0.0.1:8080 (or your chosen port)

List models:

curl -s http://127.0.0.1:8080/v1/models | jq

Non‑streaming chat completion:

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-3.2-3b-instruct-4bit",
        "messages": [{"role":"user","content":"Write a haiku about dinosaurs"}],
        "max_tokens": 200
      }'

Streaming chat completion (SSE):

curl -N http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-3.2-3b-instruct-4bit",
        "messages": [{"role":"user","content":"Summarize Jurassic Park in one paragraph"}],
        "stream": true
      }'

Tip: Model names are lower‑cased with hyphens (derived from the friendly name), for example: Llama 3.2 3B Instruct 4bit → llama-3.2-3b-instruct-4bit.

Function/Tool Calling (OpenAI‑compatible)

Osaurus supports OpenAI‑style function calling. Send tools and optional tool_choice in your request. The model is instructed to reply with an exact JSON object containing tool_calls, and the server parses it, including common formatting like code fences.

Define tools and let the model decide (tool_choice: "auto"):

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-3.2-3b-instruct-4bit",
        "messages": [
          {"role":"system","content":"You can call functions to answer queries succinctly."},
          {"role":"user","content":"What\'s the weather in SF?"}
        ],
        "tools": [
          {
            "type": "function",
            "function": {
              "name": "get_weather",
              "description": "Get weather by city name",
              "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"]
              }
            }
          }
        ],
        "tool_choice": "auto"
      }'

Non‑stream response will include message.tool_calls and finish_reason: "tool_calls". Streaming responses emit OpenAI‑style deltas for tool_calls (id, type, function name, and chunked arguments), finishing with finish_reason: "tool_calls" and [DONE].

After you execute a tool, continue the conversation by sending a tool role message with tool_call_id:

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-3.2-3b-instruct-4bit",
        "messages": [
          {"role":"user","content":"What\'s the weather in SF?"},
          {"role":"assistant","content":"","tool_calls":[{"id":"call_1","type":"function","function":{"name":"get_weather","arguments":"{\"city\":\"SF\"}"}}]},
          {"role":"tool","tool_call_id":"call_1","content":"{\"tempC\":18,\"conditions\":\"Foggy\"}"}
        ]
      }'

Notes:

Only type: "function" tools are supported.
Arguments must be a JSON‑escaped string in the assistant response; Osaurus also tolerates a nested parameters object and will normalize.
Parser accepts minor formatting noise like code fences and assistant: prefixes.

Chat Templates

Osaurus loads Jinja chat templates from a model's tokenizer_config.json when available:

Sources: chat_template or default_chat_template (string or {text|content|template} object).
Context: Renders with messages, add_generation_prompt: true, and includes bos_token/eos_token if defined.
System handling: If your request includes role: system messages, Osaurus combines them and passes as model instructions while rendering the template over the remaining turns.
Fallback: If no template is present or rendering fails, Osaurus uses a concise transcript format: User: ... / Assistant: ..., with the system text prepended.
Tools: When tools/tool_choice are provided, a compact tools block is appended to the rendered prompt.

This keeps prompts aligned with each model’s native formatting while remaining OpenAI‑compatible at the API level.

Session reuse (KV cache)

For faster multi‑turn conversations, you can reuse a chat session’s KV cache by providing session_id in your request. When possible (and not concurrently in use), Osaurus will reuse the session for the same model to reduce latency and cost.

curl -s http://127.0.0.1:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "llama-3.2-3b-instruct-4bit",
        "session_id": "my-session-1",
        "messages": [
          {"role":"user","content":"Tell me a fact about stegosaurs"}
        ]
      }'

Notes:

Sessions are opportunistically reused for a short window and only when not actively used by another request.
Keep session_id stable per ongoing conversation and per model.

Use with OpenAI SDKs

Point your client at Osaurus and use any placeholder API key.

Python example:

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="osaurus")

resp = client.chat.completions.create(
    model="llama-3.2-3b-instruct-4bit",
    messages=[{"role": "user", "content": "Hello there!"}],
)

print(resp.choices[0].message.content)

Python with tools (non‑stream):

import json
from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8080/v1", api_key="osaurus")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get weather by city",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"],
            },
        },
    }
]

resp = client.chat.completions.create(
    model="llama-3.2-3b-instruct-4bit",
    messages=[{"role": "user", "content": "Weather in SF?"}],
    tools=tools,
    tool_choice="auto",
)

tool_calls = resp.choices[0].message.tool_calls or []
for call in tool_calls:
    args = json.loads(call.function.arguments)
    result = {"tempC": 18, "conditions": "Foggy"}  # your tool result
    followup = client.chat.completions.create(
        model="llama-3.2-3b-instruct-4bit",
        messages=[
            {"role": "user", "content": "Weather in SF?"},
            {"role": "assistant", "content": "", "tool_calls": tool_calls},
            {"role": "tool", "tool_call_id": call.id, "content": json.dumps(result)},
        ],
    )
    print(followup.choices[0].message.content)

Models

Curated suggestions include Llama, Qwen, Gemma, Mistral, Phi, DeepSeek, etc. (4‑bit variants for speed)
Discovery pulls from Hugging Face mlx-community and computes size estimates
Required files are fetched automatically (tokenizer/config/weights)
Change the models directory with OSU_MODELS_DIR

Notes & Limitations

Apple Silicon only (requires MLX); Intel Macs are not supported
Localhost only, no authentication; put behind a proxy if exposing externally
/transcribe endpoints are placeholders pending Whisper integration

Dependencies

SwiftNIO (HTTP server)
SwiftUI/AppKit (UI)
MLX‑Swift, MLXLLM (runtime and chat session)

Contributors

wizardeur — first PR creator

Community

Join us on Discord
Read the Contributing Guide and our Code of Conduct
See our Security Policy for reporting vulnerabilities
Get help in Support
Pick up a good first issue or help wanted

If you find Osaurus useful, please ⭐ the repo and share it!

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
.github		.github
docs		docs
osaurus.xcodeproj		osaurus.xcodeproj
osaurus		osaurus
osaurusTests		osaurusTests
osaurusUITests		osaurusUITests
results		results
scripts		scripts
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
OpenAI_API_GUIDE.md		OpenAI_API_GUIDE.md
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
dmg-bg.png		dmg-bg.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Osaurus 🦕

Highlights

Requirements

Features

Benchmarks

API Endpoints

Getting Started

Download

Build and run

Use the API

Function/Tool Calling (OpenAI‑compatible)

Chat Templates

Session reuse (KV cache)

Use with OpenAI SDKs

Models

Notes & Limitations

Dependencies

Contributors

Community

About

Uh oh!

Releases

Packages

Languages

License

maxe-xq/osaurus

Folders and files

Latest commit

History

Repository files navigation

Osaurus 🦕

Highlights

Requirements

Features

Benchmarks

API Endpoints

Getting Started

Download

Build and run

Use the API

Function/Tool Calling (OpenAI‑compatible)

Chat Templates

Session reuse (KV cache)

Use with OpenAI SDKs

Models

Notes & Limitations

Dependencies

Contributors

Community

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages