"Models are born production-ready"
Born is a modern deep learning framework for Go, inspired by Burn (Rust). Build ML models in pure Go and deploy as single binaries - no Python runtime, no complex dependencies.
Project Status: 🚀 v0.7.1 Released! (Code Quality + Burn Patterns!)
Latest: 🔧 Applied Burn framework patterns, Flash Attention complexity 111→<30, new internal/parallel package
Pure Go ML with GPU acceleration - no CGO required!
Deploying ML models is hard:
- Python runtime required
- Complex dependency management
- Large Docker images
- Slow startup times
- Integration friction with Go backends
import "github.com/born-ml/born"
// Models "born" ready for production
model := born.Load("resnet50.born")
prediction := model.Predict(image)
// That's it. No Python. No containers. Just Go.Benefits:
- Single binary deployment
- Fast startup (< 100ms)
- Small memory footprint
- Native Go integration
- Cross-platform out of the box
- Pure Go - No CGO dependencies, trivial cross-compilation
- Type Safe - Generics-powered API for compile-time guarantees
- Autodiff - Automatic differentiation via decorator pattern
- Production Ready - Single binary deployment, fast startup
- WebAssembly - Run inference in browsers natively
- WebGPU Backend - Zero-CGO GPU via go-webgpu, 123x MatMul speedup
- 38+ GPU Operations - MatMul, BatchMatMul, Conv2D, MaxPool2D, Softmax, and more
- Lazy Evaluation - GPU-resident tensors, command batching (~90s → <5s/step)
- Multi-dim Transpose - GPU-accelerated 3D/4D/5D/6D tensors
- Automatic Memory -
runtime.SetFinalizerfor GPU buffer cleanup
- Flash Attention 2 - O(N) memory, WebGPU WGSL shader, 2x+ speedup on long sequences
- Speculative Decoding - Draft model + verification, 2-4x inference speedup
- Multi-Head Attention - MHA, SDPA, Grouped Query Attention (GQA)
- KV-Cache - Efficient autoregressive generation (3.94x speedup)
- Positional Encodings - RoPE, ALiBi, Sinusoidal, Learned
- Modern FFN - SwiGLU, GeGLU, ReGLU with gated activations
- Normalizations - LayerNorm, RMSNorm (LLaMA style)
- Tokenizers - TikToken, BPE, HuggingFace format, chat templates
- Sampling - Temperature, Top-K, Top-P, Min-P, repetition penalty
- Text Generation - Streaming API, stop sequences
- ONNX Import - Load PyTorch/TensorFlow models via
.onnx(30+ operators) - GGUF Import - llama.cpp format with K-quant dequantization (Q4_K, Q5_K, Q6_K, Q8_0)
- Native Format -
.bornformat withnn.Save()/nn.Load() - Checkpoints - Resume training with optimizer state preservation
- SafeTensors - HuggingFace compatible export
# Clone repository
git clone https://github.com/born-ml/born.git
cd born
# Build
make build
# Or install CLI
make installRequirements:
- Go 1.25+
- Make (optional, but recommended)
- golangci-lint (for linting)
Build:
make build # Build all binaries
make test # Run tests
make lint # Run linter
make bench # Run benchmarksWorking example included! See examples/mnist/ for complete implementation.
package main
import (
"github.com/born-ml/born/autodiff"
"github.com/born-ml/born/backend/cpu"
"github.com/born-ml/born/nn"
"github.com/born-ml/born/optim"
)
func main() {
// Create backend with autodiff
backend := autodiff.New(cpu.New())
// Define model (784 → 128 → 10)
model := NewMNISTNet(backend)
// Create loss and optimizer
criterion := nn.NewCrossEntropyLoss(backend)
optimizer := optim.NewAdam(model.Parameters(), optim.AdamConfig{
LR: 0.001,
Betas: [2]float32{0.9, 0.999},
}, backend)
// Training loop
for epoch := range 10 {
// Forward pass
logits := model.Forward(batch.ImagesTensor)
loss := criterion.Forward(logits, batch.LabelsTensor)
// Backward pass
optimizer.ZeroGrad()
grads := backend.Backward(loss.Raw())
optimizer.Step(grads)
// Log progress
acc := nn.Accuracy(logits, batch.LabelsTensor)
fmt.Printf("Epoch %d: Loss=%.4f, Accuracy=%.2f%%\n",
epoch, loss.Raw().AsFloat32()[0], acc*100)
}
}Run it: cd examples/mnist && go run .
package main
import (
"fmt"
"github.com/born-ml/born/generate"
"github.com/born-ml/born/tokenizer"
"github.com/born-ml/born/loader"
)
func main() {
// Load tokenizer
tok, _ := tokenizer.NewTikTokenForModel("gpt-4")
// Load model (GGUF format)
model, _ := loader.OpenModel("llama-7b.gguf")
// Create generator with sampling config
gen := generate.NewTextGenerator(model, tok, generate.SamplingConfig{
Temperature: 0.7,
TopP: 0.9,
TopK: 40,
})
// Generate text
result, _ := gen.Generate("Hello, world!", generate.GenerateConfig{
MaxTokens: 100,
})
fmt.Println(result)
// Or use streaming
stream, _ := gen.GenerateStream("Once upon a time", generate.GenerateConfig{
MaxTokens: 50,
Stream: true,
})
for chunk := range stream {
fmt.Print(chunk.Token)
}
}Core Features:
- ✅ Tensor operations (Add, MatMul, Reshape, Exp, Sqrt, Cat, etc.)
- ✅ 35+ GPU operations (BatchMatMul, Conv2D, MaxPool2D, Comparisons, Reductions)
- ✅ 31 type-safe public API operations (MulScalar, Greater, Softmax, Int32, etc.)
- ✅ Automatic differentiation with gradient tape
- ✅ Neural network modules (Linear, Conv2D, ReLU, SiLU, RMSNorm, Embedding)
- ✅ Optimizers (SGD with momentum, Adam with bias correction)
- ✅ Losses (CrossEntropyLoss with numerical stability)
- ✅ Complete WebGPU backend (zero-CGO, 123x MatMul speedup)
- ✅ Transformer primitives (for LLaMA, GPT, Mistral architectures)
Born uses a backend interface for device independence:
type Backend interface {
Add(a, b *RawTensor) *RawTensor
MatMul(a, b *RawTensor) *RawTensor
// ... other operations
}Available Backends:
| Backend | Status | Description |
|---|---|---|
| CPU | ✅ Available | Pure Go implementation, all operations |
| WebGPU | ✅ Available | Zero-CGO GPU via go-webgpu |
| Vulkan | 📋 Planned | Cross-platform GPU compute (Linux focus) |
| CUDA | 📋 Planned | NVIDIA GPU via zero-CGO |
| Metal | 📋 Planned | Apple GPU (macOS/iOS) |
WebGPU Operation Support 🎉
| Category | Operations | Backend |
|---|---|---|
| Math | Add, Sub, Mul, Div (float32 + int32), Exp, Sqrt, Rsqrt, Log, Cos, Sin | ✅ GPU |
| Matrix | MatMul, BatchMatMul (3D/4D), Transpose, Reshape | ✅ GPU |
| CNN | Conv2D, MaxPool2D | ✅ GPU |
| Activation | ReLU, Sigmoid, Tanh, Softmax | ✅ GPU |
| Scalar | MulScalar, AddScalar, SubScalar, DivScalar | ✅ GPU |
| Reduction | Sum, SumDim, MeanDim, Argmax | ✅ GPU/CPU hybrid |
| Compare | Greater, Lower, GreaterEqual, LowerEqual, Equal, NotEqual | ✅ GPU |
| Boolean | And, Or, Not | ✅ GPU |
| Shape | Cat, Chunk, Unsqueeze, Squeeze, Expand | ✅ CPU (efficient) |
| Selection | Where, Gather, Embedding | ✅ GPU |
| Type | Cast (float32, int32) | ✅ CPU |
Total: 38+ GPU-accelerated operations!
All operations required for LLM inference (Attention, RoPE, LayerNorm, etc.) are fully supported on GPU.
GPU Backend Setup:
WebGPU requires the wgpu_native library. Download from wgpu-native releases:
Windows (x64):
# Download latest release
curl -LO https://github.com/gfx-rs/wgpu-native/releases/latest/download/wgpu-windows-x86_64-msvc-release.zip
unzip wgpu-windows-x86_64-msvc-release.zip
# Install DLL system-wide (requires admin)
copy lib\wgpu_native.dll C:\Windows\System32\
# Or place next to your executable
copy lib\wgpu_native.dll .\your-app\Linux (x64):
curl -LO https://github.com/gfx-rs/wgpu-native/releases/latest/download/wgpu-linux-x86_64-release.zip
unzip wgpu-linux-x86_64-release.zip
sudo cp lib/libwgpu_native.so /usr/local/lib/
sudo ldconfigmacOS (ARM64):
curl -LO https://github.com/gfx-rs/wgpu-native/releases/latest/download/wgpu-macos-aarch64-release.zip
unzip wgpu-macos-aarch64-release.zip
sudo cp lib/libwgpu_native.dylib /usr/local/lib/Usage:
import (
"github.com/born-ml/born/autodiff"
"github.com/born-ml/born/backend/cpu"
"github.com/born-ml/born/backend/webgpu"
)
// Automatic GPU/CPU selection with graceful fallback
var backend tensor.Backend
if webgpu.IsAvailable() {
gpu, err := webgpu.New()
if err == nil {
backend = autodiff.New(gpu)
defer gpu.Release() // Don't forget to release GPU resources
}
}
if backend == nil {
backend = autodiff.New(cpu.New())
}Functionality composed via decorators (inspired by Burn):
// Basic backend
base := cpu.New()
// Add autodiff
withAutodiff := autodiff.New(base)
// Add kernel fusion
optimized := fusion.New(withAutodiff)
// Your code works with any backend!
model := createModel(optimized)type Tensor[T DType, B Backend] struct {
raw *RawTensor
backend B
}
// Compile-time type checking
func (t *Tensor[float32, B]) MatMul(other *Tensor[float32, B]) *Tensor[float32, B]Core Framework
- Tensor API with generics, autodiff, NN modules (Linear, Conv2D, ReLU, etc.)
- Optimizers (SGD, Adam), losses (CrossEntropyLoss)
- MNIST: 97.44% MLP, 98.18% CNN accuracy
GPU Acceleration
- WebGPU backend with 38+ operations (123x MatMul speedup)
- Lazy evaluation, command batching (~90s → <5s/step)
- CNN support (Conv2D, MaxPool2D, BatchMatMul)
LLM & Transformers
- Multi-Head Attention, GQA, KV-Cache (3.94x speedup)
- RoPE, ALiBi, RMSNorm, SwiGLU
- Tokenizers (TikToken, BPE), text generation with streaming
Model Import & Export
- ONNX import (30+ operators)
- GGUF loading (LLaMA, Mistral, DeepSeek)
- Native
.bornformat, SafeTensors export
Quantization (v0.8.0) - GPTQ/AWQ (4x smaller), KV Cache compression, Model Zoo
Production Serving - PagedAttention, Continuous Batching, OpenAI-compatible API
Scale & Stability - Multi-GPU, CPU SIMD (AVX2/Neon), Gradient Checkpointing
v1.0 LTS - API freeze, 3+ years support, production hardening
Full roadmap & changelog: See ROADMAP.md and CHANGELOG.md
- Philosophy - Production-first design principles
- Use Cases - When to use Born (and when not)
- Getting Started - Installation and first steps (coming soon)
- API Reference - Complete API documentation
- Examples - Sample code (MNIST MLP, CNN, GPU inference)
- Contributing - How to contribute
- GitHub Issues - Report bugs or request features
Models trained anywhere (PyTorch, TensorFlow) are imported and born production-ready:
Training → Birth → Production
(Burn) (Born) (Run)
PyTorch trains → Born imports → Born deploys
TensorFlow trains → Born imports → Born deploys
Born trains → Born ready → Born serves
- Single Binary: Entire model in one executable
- No Runtime: No Python, no dependencies
- Fast Startup: < 100ms cold start
- Small Memory: Minimal footprint
- Cloud Native: Natural fit for Go services
- Type Safe: Catch errors at compile time
- Clean API: Intuitive and ergonomic
- Great Docs: Comprehensive documentation
- Easy Deploy:
go buildand you're done
Actual Benchmarks (AMD Ryzen 9 5950X, NVIDIA RTX 3080):
| Operation | CPU | GPU | Speedup |
|---|---|---|---|
| MatMul 1024x1024 | 7143ms | 58ms | 123x |
| MatMul 512x512 | 499ms | 12ms | 41x |
| MatMul 256x256 | 56ms | 3.7ms | 15x |
| Batch Size | CPU | GPU | Speedup | Throughput |
|---|---|---|---|---|
| 64 | 48ms | 19ms | 2.5x | 3,357/s |
| 256 | 182ms | 21ms | 8.5x | 11,883/s |
| 512 | 348ms | 32ms | 10.9x | 15,973/s |
Note: CPU backend uses naive O(n³) MatMul. SIMD optimizations planned for future releases.
Born includes 30+ optimized WGSL compute shaders:
| Shader | Workgroup | Description |
|---|---|---|
addShader |
256 | Element-wise addition |
subShader |
256 | Element-wise subtraction |
mulShader |
256 | Element-wise multiplication |
divShader |
256 | Element-wise division |
matmulShader |
16x16 | Matrix multiplication (2D) |
batchMatMulShader |
8x8x1 | Batched matmul (3D/4D) |
conv2dShader |
8x8x1 | 2D convolution with padding |
maxPool2dShader |
8x8x1 | 2D max pooling |
transposeShader |
16x16 | Matrix transpose |
reluShader |
256 | ReLU activation |
sigmoidShader |
256 | Sigmoid activation |
tanhShader |
256 | Tanh activation |
softmaxShader |
256 | Softmax (numerically stable) |
expShader |
256 | Element-wise exp |
sqrtShader |
256 | Element-wise sqrt |
rsqrtShader |
256 | Reciprocal sqrt (1/√x) |
cosShader |
256 | Element-wise cosine |
sinShader |
256 | Element-wise sine |
greaterShader |
256 | Greater-than comparison |
lowerShader |
256 | Less-than comparison |
equalShader |
256 | Equality comparison |
andShader |
256 | Logical AND |
orShader |
256 | Logical OR |
notShader |
256 | Logical NOT |
argmaxShader |
256 | Argmax along dimension |
globalSumShader |
256 | Parallel sum reduction |
scalarMulShader |
256 | Scalar multiplication |
scalarAddShader |
256 | Scalar addition |
addShaderInt32 |
256 | Int32 element-wise addition |
subShaderInt32 |
256 | Int32 element-wise subtraction |
mulShaderInt32 |
256 | Int32 element-wise multiplication |
divShaderInt32 |
256 | Int32 element-wise division |
All shaders use workgroup shared memory for optimal performance and support bounds checking for safety.
Born is inspired by and learns from:
- Burn - Architecture patterns, decorator design
- PyTorch - API ergonomics
- TinyGrad - Simplicity principles
- Gonum - Go numerical computing
- HDF5 for Go - Model serialization, dataset storage (planned)
Special thanks to the projects that made Born possible:
🙏 go-webgpu & wgpu-native
Born's GPU acceleration is powered by go-webgpu - a remarkable pure Go binding for WebGPU via wgpu-native.
Why this stack is special:
- Zero CGO - Pure Go bindings using goffi for FFI
- Cross-platform - Works on Windows (D3D12), Linux (Vulkan), macOS (Metal)
- Modern API - Clean, idiomatic Go interface to WebGPU
- wgpu-native - Battle-tested Rust implementation of WebGPU by gfx-rs
- Active development - Both projects are actively maintained
Without go-webgpu and wgpu-native, Born would need CGO for GPU support, making cross-compilation complex and defeating our "pure Go" goal. This stack enables us to offer production-ready GPU acceleration while maintaining the simplicity of go build.
Thank you to Alfred Dobra, gfx-rs team, and all contributors!
Project is in early development. Star the repo to follow progress!
- GitHub Org: github.com/born-ml
- Main Repo: github.com/born-ml/born
- Discussions: GitHub Discussions
- Issues: Report bugs or request features
Licensed under the Apache License, Version 2.0.
Why Apache 2.0?
- ✅ Patent protection - Critical for ML algorithms and production use
- ✅ Enterprise-friendly - Clear legal framework for commercial adoption
- ✅ Industry standard - Same as TensorFlow, battle-tested in ML ecosystem
- ✅ Contributor protection - Explicit patent grant and termination clauses
See LICENSE file for full terms.
Q: Why not use Gorgonia? A: Gorgonia is great but uses a different approach. Born focuses on modern Go (generics), pure Go (no CGO), and production-first design inspired by Burn.
Q: Can I run LLMs with Born? A: Yes! Full LLM support included - GGUF model loading, tokenizers, sampling strategies, and text generation with streaming. Load LLaMA, Mistral, or DeepSeek models directly.
Q: When will it be ready? A: Core features are released! CPU/GPU backends, transformers, LLM support, and ONNX import all work. See ROADMAP.md for upcoming features.
Q: Can I use PyTorch models? A: Yes! Via ONNX import. Train in PyTorch, export to ONNX, deploy with Born. GGUF models are also supported.
Q: WebAssembly support? A: Yes! Pure Go compiles to WASM natively. Inference in browsers out of the box.
Q: What LLM architectures are supported? A: LLaMA 2/3, Mistral, DeepSeek, and compatible architectures. GQA, RoPE, SwiGLU are all supported.
Q: How do I enable GPU acceleration?
A: Install wgpu_native library from wgpu-native releases, then use webgpu.IsAvailable() to check GPU support. See Architecture for setup instructions. 38+ GPU operations included - everything needed for LLM inference!
Q: What GPU operations are supported? A: All operations needed for production ML! Math (Add, Mul, Exp, etc.), Matrix (MatMul, BatchMatMul, Conv2D), Activations (ReLU, Softmax), Comparisons (Greater, Equal), Boolean (And, Or, Not), Reductions (Sum, Argmax), and more. See the WebGPU Operation Table.
Q: How can I help? A: Check our Contributing Guide and GitHub Issues!
Born for Production. Ready from Day One.
Made with ❤️ by the Born ML team