macOS · Linux · Windows · Pure Rust

Edge inference for small language models

SAPIENT runs LLMs locally — on your laptop, a Raspberry Pi, or a CI runner. No Python. No Docker. No GPU required.

curl -fsSL https://sapient.openhorizon.so/install | sh

Installs to ~/.local/bin. If sapient isn't found afterward, run: export PATH="$HOME/.local/bin:$PATH"

10 MB
Binary size
1.9×
Faster cold load vs Ollama
16 tok/s
CPU throughput
0
External runtime dependencies

Benchmarked on Apple M4 Pro against Ollama 0.12.6. SAPIENT loads 1.9× faster cold, ships a 3× smaller binary, and requires no background daemon — making it ideal for edge devices, embedded systems, and CI pipelines.

Installation

One command.
Any platform.

macOS & Linux

curl -fsSL https://sapient.openhorizon.so/install | sh

Installs to ~/.local/bin. Add to PATH if needed: export PATH="$HOME/.local/bin:$PATH"

Windows — PowerShell

irm https://sapient.openhorizon.so/install | iex

Same URL — the endpoint detects PowerShell and serves the .ps1 script automatically.

Homebrew — macOS

brew install skidgod4444/tap/sapient

Direct Binary Download

All releases ↗

Usage

30 seconds to
running a model.

Inside sapient chat, use /help, /clear, and /exit.

# Browse the model catalog

sapient models

# Interactive chat with streaming output

sapient chat openhorizon/qwen2.5-0.5b

# One-shot completion

sapient run openhorizon/phi-2 --prompt "Explain transformers simply"

# Download a model to local cache

sapient pull openhorizon/phi-2

# List or remove downloaded models

sapient list / sapient rm openhorizon/phi-2

# Force Metal GPU on Apple Silicon

sapient chat openhorizon/phi-2 --backend metal

# Authenticate for gated models (Llama, Mistral)

sapient login

# Update to the latest release

sapient update

Models

Curated
model registry.

Every openhorizon/* alias resolves to the upstream Hugging Face repo. Downloaded automatically on first use.

Gated models require sapient login.

AliasFamilySizeAccess
openhorizon/phi-2Phi2.7BOpen
openhorizon/phi-1.5Phi1.3BOpen
openhorizon/qwen2.5-0.5bQwen2.50.5BOpen
openhorizon/qwen2.5-1.5bQwen2.51.5BOpen
openhorizon/qwen2.5-3bQwen2.53BOpen
openhorizon/smollm2-360mLlama360MOpen
openhorizon/smollm2-1.7bLlama1.7BOpen
openhorizon/tinyllama-1.1bLlama1.1BOpen
openhorizon/llama-3.2-1bLlama1BGated
openhorizon/llama-3.2-3bLlama3BGated
openhorizon/mistral-7bMistral7BGated

Architecture

Pure Rust,
zero overhead.

No Python runtime, no ONNX Runtime, no CUDA toolkit. Eight focused crates, each with a single responsibility.

sapient-generate       Pipeline API — from_pretrained, generate, chat, stream
├── sapient-hub        HuggingFace client — parallel downloads, cache, auth
├── sapient-tokenizers HF tokenizers + Jinja2 chat templates
├── sapient-models     Forward engines — Phi & Llama (Qwen2.5, Mistral…)
│
├── sapient-runtime    InferenceSession — graph execution + telemetry
│   ├── sapient-ir     Computation graph IR + optimization passes
│   └── sapient-io     Safetensors, GGUF Q4/Q8, ONNX loaders
│
├── sapient-backends-cpu    CPU kernels — GQA, RoPE, RMSNorm, MatMul
└── sapient-backends-metal  Apple Silicon Metal/MLX backend