macOS · Linux · Windows · Pure Rust
Edge inference for small language models
SAPIENT runs LLMs locally — on your laptop, a Raspberry Pi, or a CI runner. No Python. No Docker. No GPU required.
curl -fsSL https://sapient.openhorizon.so/install | shInstalls to ~/.local/bin. If sapient isn't found afterward, run: export PATH="$HOME/.local/bin:$PATH"
Benchmarked on Apple M4 Pro against Ollama 0.12.6. SAPIENT loads 1.9× faster cold, ships a 3× smaller binary, and requires no background daemon — making it ideal for edge devices, embedded systems, and CI pipelines.
Installation
One command.
Any platform.
macOS & Linux
curl -fsSL https://sapient.openhorizon.so/install | shInstalls to ~/.local/bin. Add to PATH if needed: export PATH="$HOME/.local/bin:$PATH"
Windows — PowerShell
irm https://sapient.openhorizon.so/install | iexSame URL — the endpoint detects PowerShell and serves the .ps1 script automatically.
Homebrew — macOS
brew install skidgod4444/tap/sapientDirect Binary Download
All releases ↗Usage
30 seconds to
running a model.
Inside sapient chat, use /help, /clear, and /exit.
# Browse the model catalog
sapient models# Interactive chat with streaming output
sapient chat openhorizon/qwen2.5-0.5b# One-shot completion
sapient run openhorizon/phi-2 --prompt "Explain transformers simply"# Download a model to local cache
sapient pull openhorizon/phi-2# List or remove downloaded models
sapient list / sapient rm openhorizon/phi-2# Force Metal GPU on Apple Silicon
sapient chat openhorizon/phi-2 --backend metal# Authenticate for gated models (Llama, Mistral)
sapient login# Update to the latest release
sapient updateModels
Curated
model registry.
Every openhorizon/* alias resolves to the upstream Hugging Face repo. Downloaded automatically on first use.
Gated models require sapient login.
openhorizon/phi-2Phi2.7BOpenopenhorizon/phi-1.5Phi1.3BOpenopenhorizon/qwen2.5-0.5bQwen2.50.5BOpenopenhorizon/qwen2.5-1.5bQwen2.51.5BOpenopenhorizon/qwen2.5-3bQwen2.53BOpenopenhorizon/smollm2-360mLlama360MOpenopenhorizon/smollm2-1.7bLlama1.7BOpenopenhorizon/tinyllama-1.1bLlama1.1BOpenopenhorizon/llama-3.2-1bLlama1BGatedopenhorizon/llama-3.2-3bLlama3BGatedopenhorizon/mistral-7bMistral7BGatedArchitecture
Pure Rust,
zero overhead.
No Python runtime, no ONNX Runtime, no CUDA toolkit. Eight focused crates, each with a single responsibility.
sapient-generate Pipeline API — from_pretrained, generate, chat, stream ├── sapient-hub HuggingFace client — parallel downloads, cache, auth ├── sapient-tokenizers HF tokenizers + Jinja2 chat templates ├── sapient-models Forward engines — Phi & Llama (Qwen2.5, Mistral…) │ ├── sapient-runtime InferenceSession — graph execution + telemetry │ ├── sapient-ir Computation graph IR + optimization passes │ └── sapient-io Safetensors, GGUF Q4/Q8, ONNX loaders │ ├── sapient-backends-cpu CPU kernels — GQA, RoPE, RMSNorm, MatMul └── sapient-backends-metal Apple Silicon Metal/MLX backend