Born - Production-Ready ML for Go

"Models are born production-ready"

Born is a modern deep learning framework for Go, inspired by Burn (Rust). Build ML models in pure Go and deploy as single binaries - no Python runtime, no complex dependencies.

Project Status: 🚀 v0.7.1 Released! (Code Quality + Burn Patterns!) Latest: 🔧 Applied Burn framework patterns, Flash Attention complexity 111→<30, new internal/parallel package

Pure Go ML with GPU acceleration - no CGO required!

Why Born?

The Problem

Deploying ML models is hard:

Python runtime required
Complex dependency management
Large Docker images
Slow startup times
Integration friction with Go backends

The Born Solution

import "github.com/born-ml/born"

// Models "born" ready for production
model := born.Load("resnet50.born")
prediction := model.Predict(image)

// That's it. No Python. No containers. Just Go.

Benefits:

Single binary deployment
Fast startup (< 100ms)
Small memory footprint
Native Go integration
Cross-platform out of the box

Features

Core

Pure Go - No CGO dependencies, trivial cross-compilation
Type Safe - Generics-powered API for compile-time guarantees
Autodiff - Automatic differentiation via decorator pattern
Production Ready - Single binary deployment, fast startup
WebAssembly - Run inference in browsers natively

GPU Acceleration

WebGPU Backend - Zero-CGO GPU via go-webgpu, 123x MatMul speedup
38+ GPU Operations - MatMul, BatchMatMul, Conv2D, MaxPool2D, Softmax, and more
Lazy Evaluation - GPU-resident tensors, command batching (~90s → <5s/step)
Multi-dim Transpose - GPU-accelerated 3D/4D/5D/6D tensors
Automatic Memory - runtime.SetFinalizer for GPU buffer cleanup

LLM & Transformers

Flash Attention 2 - O(N) memory, WebGPU WGSL shader, 2x+ speedup on long sequences
Speculative Decoding - Draft model + verification, 2-4x inference speedup
Multi-Head Attention - MHA, SDPA, Grouped Query Attention (GQA)
KV-Cache - Efficient autoregressive generation (3.94x speedup)
Positional Encodings - RoPE, ALiBi, Sinusoidal, Learned
Modern FFN - SwiGLU, GeGLU, ReGLU with gated activations
Normalizations - LayerNorm, RMSNorm (LLaMA style)
Tokenizers - TikToken, BPE, HuggingFace format, chat templates
Sampling - Temperature, Top-K, Top-P, Min-P, repetition penalty
Text Generation - Streaming API, stop sequences

Model Import & Export

ONNX Import - Load PyTorch/TensorFlow models via .onnx (30+ operators)
GGUF Import - llama.cpp format with K-quant dequantization (Q4_K, Q5_K, Q6_K, Q8_0)
Native Format - .born format with nn.Save() / nn.Load()
Checkpoints - Resume training with optimizer state preservation
SafeTensors - HuggingFace compatible export

Quick Start

Installation

# Clone repository
git clone https://github.com/born-ml/born.git
cd born

# Build
make build

# Or install CLI
make install

Development Setup

Requirements:

Go 1.25+
Make (optional, but recommended)
golangci-lint (for linting)

Build:

make build          # Build all binaries
make test           # Run tests
make lint           # Run linter
make bench          # Run benchmarks

Example: MNIST Classification

Working example included! See examples/mnist/ for complete implementation.

package main

import (
    "github.com/born-ml/born/autodiff"
    "github.com/born-ml/born/backend/cpu"
    "github.com/born-ml/born/nn"
    "github.com/born-ml/born/optim"
)

func main() {
    // Create backend with autodiff
    backend := autodiff.New(cpu.New())

    // Define model (784 → 128 → 10)
    model := NewMNISTNet(backend)

    // Create loss and optimizer
    criterion := nn.NewCrossEntropyLoss(backend)
    optimizer := optim.NewAdam(model.Parameters(), optim.AdamConfig{
        LR:    0.001,
        Betas: [2]float32{0.9, 0.999},
    }, backend)

    // Training loop
    for epoch := range 10 {
        // Forward pass
        logits := model.Forward(batch.ImagesTensor)
        loss := criterion.Forward(logits, batch.LabelsTensor)

        // Backward pass
        optimizer.ZeroGrad()
        grads := backend.Backward(loss.Raw())
        optimizer.Step(grads)

        // Log progress
        acc := nn.Accuracy(logits, batch.LabelsTensor)
        fmt.Printf("Epoch %d: Loss=%.4f, Accuracy=%.2f%%\n",
            epoch, loss.Raw().AsFloat32()[0], acc*100)
    }
}

Run it: cd examples/mnist && go run .

Example: LLM Text Generation

package main

import (
    "fmt"
    "github.com/born-ml/born/generate"
    "github.com/born-ml/born/tokenizer"
    "github.com/born-ml/born/loader"
)

func main() {
    // Load tokenizer
    tok, _ := tokenizer.NewTikTokenForModel("gpt-4")

    // Load model (GGUF format)
    model, _ := loader.OpenModel("llama-7b.gguf")

    // Create generator with sampling config
    gen := generate.NewTextGenerator(model, tok, generate.SamplingConfig{
        Temperature: 0.7,
        TopP:        0.9,
        TopK:        40,
    })

    // Generate text
    result, _ := gen.Generate("Hello, world!", generate.GenerateConfig{
        MaxTokens: 100,
    })
    fmt.Println(result)

    // Or use streaming
    stream, _ := gen.GenerateStream("Once upon a time", generate.GenerateConfig{
        MaxTokens: 50,
        Stream:    true,
    })
    for chunk := range stream {
        fmt.Print(chunk.Token)
    }
}

Core Features:

✅ Tensor operations (Add, MatMul, Reshape, Exp, Sqrt, Cat, etc.)
✅ 35+ GPU operations (BatchMatMul, Conv2D, MaxPool2D, Comparisons, Reductions)
✅ 31 type-safe public API operations (MulScalar, Greater, Softmax, Int32, etc.)
✅ Automatic differentiation with gradient tape
✅ Neural network modules (Linear, Conv2D, ReLU, SiLU, RMSNorm, Embedding)
✅ Optimizers (SGD with momentum, Adam with bias correction)
✅ Losses (CrossEntropyLoss with numerical stability)
✅ Complete WebGPU backend (zero-CGO, 123x MatMul speedup)
✅ Transformer primitives (for LLaMA, GPT, Mistral architectures)

Architecture

Backend Abstraction

Born uses a backend interface for device independence:

type Backend interface {
    Add(a, b *RawTensor) *RawTensor
    MatMul(a, b *RawTensor) *RawTensor
    // ... other operations
}

Available Backends:

Backend	Status	Description
CPU	✅ Available	Pure Go implementation, all operations
WebGPU	✅ Available	Zero-CGO GPU via go-webgpu
Vulkan	📋 Planned	Cross-platform GPU compute (Linux focus)
CUDA	📋 Planned	NVIDIA GPU via zero-CGO
Metal	📋 Planned	Apple GPU (macOS/iOS)

WebGPU Operation Support 🎉

Category	Operations	Backend
Math	Add, Sub, Mul, Div (float32 + int32), Exp, Sqrt, Rsqrt, Log, Cos, Sin	✅ GPU
Matrix	MatMul, BatchMatMul (3D/4D), Transpose, Reshape	✅ GPU
CNN	Conv2D, MaxPool2D	✅ GPU
Activation	ReLU, Sigmoid, Tanh, Softmax	✅ GPU
Scalar	MulScalar, AddScalar, SubScalar, DivScalar	✅ GPU
Reduction	Sum, SumDim, MeanDim, Argmax	✅ GPU/CPU hybrid
Compare	Greater, Lower, GreaterEqual, LowerEqual, Equal, NotEqual	✅ GPU
Boolean	And, Or, Not	✅ GPU
Shape	Cat, Chunk, Unsqueeze, Squeeze, Expand	✅ CPU (efficient)
Selection	Where, Gather, Embedding	✅ GPU
Type	Cast (float32, int32)	✅ CPU

Total: 38+ GPU-accelerated operations!

All operations required for LLM inference (Attention, RoPE, LayerNorm, etc.) are fully supported on GPU.

GPU Backend Setup:

WebGPU requires the wgpu_native library. Download from wgpu-native releases:

Windows (x64):

# Download latest release
curl -LO https://github.com/gfx-rs/wgpu-native/releases/latest/download/wgpu-windows-x86_64-msvc-release.zip
unzip wgpu-windows-x86_64-msvc-release.zip

# Install DLL system-wide (requires admin)
copy lib\wgpu_native.dll C:\Windows\System32\

# Or place next to your executable
copy lib\wgpu_native.dll .\your-app\

Linux (x64):

curl -LO https://github.com/gfx-rs/wgpu-native/releases/latest/download/wgpu-linux-x86_64-release.zip
unzip wgpu-linux-x86_64-release.zip
sudo cp lib/libwgpu_native.so /usr/local/lib/
sudo ldconfig

macOS (ARM64):

curl -LO https://github.com/gfx-rs/wgpu-native/releases/latest/download/wgpu-macos-aarch64-release.zip
unzip wgpu-macos-aarch64-release.zip
sudo cp lib/libwgpu_native.dylib /usr/local/lib/

Usage:

import (
    "github.com/born-ml/born/autodiff"
    "github.com/born-ml/born/backend/cpu"
    "github.com/born-ml/born/backend/webgpu"
)

// Automatic GPU/CPU selection with graceful fallback
var backend tensor.Backend
if webgpu.IsAvailable() {
    gpu, err := webgpu.New()
    if err == nil {
        backend = autodiff.New(gpu)
        defer gpu.Release() // Don't forget to release GPU resources
    }
}
if backend == nil {
    backend = autodiff.New(cpu.New())
}

Decorator Pattern

Functionality composed via decorators (inspired by Burn):

// Basic backend
base := cpu.New()

// Add autodiff
withAutodiff := autodiff.New(base)

// Add kernel fusion
optimized := fusion.New(withAutodiff)

// Your code works with any backend!
model := createModel(optimized)

Type Safety with Generics

type Tensor[T DType, B Backend] struct {
    raw     *RawTensor
    backend B
}

// Compile-time type checking
func (t *Tensor[float32, B]) MatMul(other *Tensor[float32, B]) *Tensor[float32, B]

Roadmap

✅ What's Working

Core Framework

Tensor API with generics, autodiff, NN modules (Linear, Conv2D, ReLU, etc.)
Optimizers (SGD, Adam), losses (CrossEntropyLoss)
MNIST: 97.44% MLP, 98.18% CNN accuracy

GPU Acceleration

WebGPU backend with 38+ operations (123x MatMul speedup)
Lazy evaluation, command batching (~90s → <5s/step)
CNN support (Conv2D, MaxPool2D, BatchMatMul)

LLM & Transformers

Multi-Head Attention, GQA, KV-Cache (3.94x speedup)
RoPE, ALiBi, RMSNorm, SwiGLU
Tokenizers (TikToken, BPE), text generation with streaming

Model Import & Export

ONNX import (30+ operators)
GGUF loading (LLaMA, Mistral, DeepSeek)
Native .born format, SafeTensors export

🚀 Upcoming

Quantization (v0.8.0) - GPTQ/AWQ (4x smaller), KV Cache compression, Model Zoo

Production Serving - PagedAttention, Continuous Batching, OpenAI-compatible API

Scale & Stability - Multi-GPU, CPU SIMD (AVX2/Neon), Gradient Checkpointing

v1.0 LTS - API freeze, 3+ years support, production hardening

Full roadmap & changelog: See ROADMAP.md and CHANGELOG.md

Documentation

For Users

Philosophy - Production-first design principles
Use Cases - When to use Born (and when not)
Getting Started - Installation and first steps (coming soon)
API Reference - Complete API documentation
Examples - Sample code (MNIST MLP, CNN, GPU inference)

For Contributors

Contributing - How to contribute
GitHub Issues - Report bugs or request features

Philosophy

"Born Ready"

Models trained anywhere (PyTorch, TensorFlow) are imported and born production-ready:

Training → Birth → Production
 (Burn)    (Born)    (Run)

PyTorch trains  →  Born imports  →  Born deploys
TensorFlow trains → Born imports → Born deploys
Born trains    →  Born ready   →  Born serves

Production First

Single Binary: Entire model in one executable
No Runtime: No Python, no dependencies
Fast Startup: < 100ms cold start
Small Memory: Minimal footprint
Cloud Native: Natural fit for Go services

Developer Experience

Type Safe: Catch errors at compile time
Clean API: Intuitive and ergonomic
Great Docs: Comprehensive documentation
Easy Deploy: go build and you're done

Performance

Actual Benchmarks (AMD Ryzen 9 5950X, NVIDIA RTX 3080):

Matrix Operations (WebGPU vs CPU)

Operation	CPU	GPU	Speedup
MatMul 1024x1024	7143ms	58ms	123x
MatMul 512x512	499ms	12ms	41x
MatMul 256x256	56ms	3.7ms	15x

Neural Network Inference

Batch Size	CPU	GPU	Speedup	Throughput
64	48ms	19ms	2.5x	3,357/s
256	182ms	21ms	8.5x	11,883/s
512	348ms	32ms	10.9x	15,973/s

Note: CPU backend uses naive O(n³) MatMul. SIMD optimizations planned for future releases.

WebGPU WGSL Shaders

Born includes 30+ optimized WGSL compute shaders:

Shader	Workgroup	Description
`addShader`	256	Element-wise addition
`subShader`	256	Element-wise subtraction
`mulShader`	256	Element-wise multiplication
`divShader`	256	Element-wise division
`matmulShader`	16x16	Matrix multiplication (2D)
`batchMatMulShader`	8x8x1	Batched matmul (3D/4D)
`conv2dShader`	8x8x1	2D convolution with padding
`maxPool2dShader`	8x8x1	2D max pooling
`transposeShader`	16x16	Matrix transpose
`reluShader`	256	ReLU activation
`sigmoidShader`	256	Sigmoid activation
`tanhShader`	256	Tanh activation
`softmaxShader`	256	Softmax (numerically stable)
`expShader`	256	Element-wise exp
`sqrtShader`	256	Element-wise sqrt
`rsqrtShader`	256	Reciprocal sqrt (1/√x)
`cosShader`	256	Element-wise cosine
`sinShader`	256	Element-wise sine
`greaterShader`	256	Greater-than comparison
`lowerShader`	256	Less-than comparison
`equalShader`	256	Equality comparison
`andShader`	256	Logical AND
`orShader`	256	Logical OR
`notShader`	256	Logical NOT
`argmaxShader`	256	Argmax along dimension
`globalSumShader`	256	Parallel sum reduction
`scalarMulShader`	256	Scalar multiplication
`scalarAddShader`	256	Scalar addition
`addShaderInt32`	256	Int32 element-wise addition
`subShaderInt32`	256	Int32 element-wise subtraction
`mulShaderInt32`	256	Int32 element-wise multiplication
`divShaderInt32`	256	Int32 element-wise division

All shaders use workgroup shared memory for optimal performance and support bounds checking for safety.

Inspiration

Born is inspired by and learns from:

Burn - Architecture patterns, decorator design
PyTorch - API ergonomics
TinyGrad - Simplicity principles
Gonum - Go numerical computing
HDF5 for Go - Model serialization, dataset storage (planned)

Acknowledgments

Special thanks to the projects that made Born possible:

🙏 go-webgpu & wgpu-native

Born's GPU acceleration is powered by go-webgpu - a remarkable pure Go binding for WebGPU via wgpu-native.

Why this stack is special:

Zero CGO - Pure Go bindings using goffi for FFI
Cross-platform - Works on Windows (D3D12), Linux (Vulkan), macOS (Metal)
Modern API - Clean, idiomatic Go interface to WebGPU
wgpu-native - Battle-tested Rust implementation of WebGPU by gfx-rs
Active development - Both projects are actively maintained

Without go-webgpu and wgpu-native, Born would need CGO for GPU support, making cross-compilation complex and defeating our "pure Go" goal. This stack enables us to offer production-ready GPU acceleration while maintaining the simplicity of go build.

Thank you to Alfred Dobra, gfx-rs team, and all contributors!

Community

Project is in early development. Star the repo to follow progress!

GitHub Org: github.com/born-ml
Main Repo: github.com/born-ml/born
Discussions: GitHub Discussions
Issues: Report bugs or request features

License

Licensed under the Apache License, Version 2.0.

Why Apache 2.0?

✅ Patent protection - Critical for ML algorithms and production use
✅ Enterprise-friendly - Clear legal framework for commercial adoption
✅ Industry standard - Same as TensorFlow, battle-tested in ML ecosystem
✅ Contributor protection - Explicit patent grant and termination clauses

See LICENSE file for full terms.

FAQ

Q: Why not use Gorgonia? A: Gorgonia is great but uses a different approach. Born focuses on modern Go (generics), pure Go (no CGO), and production-first design inspired by Burn.

Q: Can I run LLMs with Born? A: Yes! Full LLM support included - GGUF model loading, tokenizers, sampling strategies, and text generation with streaming. Load LLaMA, Mistral, or DeepSeek models directly.

Q: When will it be ready? A: Core features are released! CPU/GPU backends, transformers, LLM support, and ONNX import all work. See ROADMAP.md for upcoming features.

Q: Can I use PyTorch models? A: Yes! Via ONNX import. Train in PyTorch, export to ONNX, deploy with Born. GGUF models are also supported.

Q: WebAssembly support? A: Yes! Pure Go compiles to WASM natively. Inference in browsers out of the box.

Q: What LLM architectures are supported? A: LLaMA 2/3, Mistral, DeepSeek, and compatible architectures. GQA, RoPE, SwiGLU are all supported.

Q: How do I enable GPU acceleration? A: Install wgpu_native library from wgpu-native releases, then use webgpu.IsAvailable() to check GPU support. See Architecture for setup instructions. 38+ GPU operations included - everything needed for LLM inference!

Q: What GPU operations are supported? A: All operations needed for production ML! Math (Add, Mul, Exp, etc.), Matrix (MatMul, BatchMatMul, Conv2D), Activations (ReLU, Softmax), Comparisons (Greater, Equal), Boolean (And, Or, Not), Reductions (Sum, Argmax), and more. See the WebGPU Operation Table.

Q: How can I help? A: Check our Contributing Guide and GitHub Issues!

Born for Production. Ready from Day One.

Made with ❤️ by the Born ML team

Documentation • Contributing • Community

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github		.github
assets		assets
autodiff		autodiff
backend		backend
cmd		cmd
docs		docs
examples		examples
generate		generate
internal		internal
loader		loader
nn		nn
optim		optim
scripts		scripts
tensor		tensor
tokenizer		tokenizer
.codecov.yml		.codecov.yml
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.golangci.yml.backup		.golangci.yml.backup
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
go.mod		go.mod
go.sum		go.sum

License

born-ml/born

Folders and files

Latest commit

History

Repository files navigation

Born - Production-Ready ML for Go

Why Born?

The Problem

The Born Solution

Features

Core

GPU Acceleration

LLM & Transformers

Model Import & Export

Quick Start

Installation

Development Setup

Example: MNIST Classification

Example: LLM Text Generation

Architecture

Backend Abstraction

Decorator Pattern

Type Safety with Generics

Roadmap

✅ What's Working

🚀 Upcoming

Documentation

For Users

For Contributors

Philosophy

"Born Ready"

Production First

Developer Experience

Performance

Matrix Operations (WebGPU vs CPU)

Neural Network Inference

WebGPU WGSL Shaders

Inspiration

Acknowledgments

🙏 go-webgpu & wgpu-native

Community

License

FAQ

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 14

Packages 0

Languages

Packages