Newv0.6.0 — mobile SDKs: Swift, Kotlin & React Native, on-device

Make your hardware think, see, listen and speak.

The entire AI pipeline — language, vision and voice — running on your own silicon, from a MacBook to the Pi inside a robot. Pure Rust, one 14 MB download, nothing ever leaves the device.

Install Sapient View on GitHub

$curl -fsSL https://sapient.openhorizon.so/install | sh

macOS · Linux · Windows|No Python · No Docker · No CUDA

Runs the models you already use

CHATLlama 3.21B–3BCHATQwen2.50.5B–3BCHATGemma 31B–4BCHATPhi-22.7BCHATMistral7BCHATSmolLM2135M–1.7BCHATTinyLlama1.1BCHATDeepSeek-R18BCHATLlama 3.21B–3BCHATQwen2.50.5B–3BCHATGemma 31B–4BCHATPhi-22.7BCHATMistral7BCHATSmolLM2135M–1.7BCHATTinyLlama1.1BCHATDeepSeek-R18B

MEDMedGemma4BSEESmolVLM256MSEEGemma 34BSTTWhispertiny–smallTTSKokoro82MTTSOrpheus3BMEDMedGemma4BSEESmolVLM256MSEEGemma 34BSTTWhispertiny–smallTTSKokoro82MTTSOrpheus3B

//What it is[1/11]

One pure-Rust engine. A CLI for your terminal, a crate for your codebase, an SDK for your phone.

{}For your terminal

The Sapient CLI & server.

Streaming chat with live Markdown, Whisper transcription, Kokoro speech, a live voice mode — and a drop-in OpenAI-compatible /v1 server. One line to install, one line to run.

0 ms

Time to first token

Decode tok/s (M4)

0 MB

Compressed download

Chat · See · Transcribe · Speak · Converse — No daemon · GPL-3.0

Install in 30 seconds

For your codebase

The Rust crate.

Embed inference directly in your app with sapient-generate — the same engine, as a git dependency.

Pipeline APIfrom_pretrained · generate · stream

Chat templatesHF tokenizers + Jinja2

Curated registry37 models · chat, vision, STT & TTS

0external runtime dependencies

Read the quickstart

//See it work[2/11]

Real sessions, real outputs.

Every line in this terminal is a genuine, unedited output from our benchmark runs — a medical radiograph read by MedGemma, Gemma 3 answering on a bare CPU, a spoken conversation you can interrupt. Nothing staged, nothing cloud.

Reproduce any of it: docs/BENCHMARKS.md

sapient — live session01/03

//Engine catalog[3/11]

All the engines to run a model anywhere.

Pluggable compute backends paired with hand-written forward engines for each model architecture — the same alias runs on a CPU, an Apple GPU, or a discrete card with the same one-line command.

fig. 01 — one pipeline, every backend

01CPU 02Apple Metal 03wgpu 04Phi engine 05Llama engine 06Whisper engine 07Vision engines 08Speech engines

01 · Runs everywhere

Stable

CPU

Within ~1.3–1.6× of llama.cpp decode after the v0.5.1 int8 kernel ladder — row-interleaved weight repacking, SDOT/SMMLA dot products, Flash-Edge attention. GGUF K-quants load via mmap; 1B-class chat is interactive on a Raspberry Pi 5.

02 · Apple Silicon GPU

Stable

Apple Metal

MlxForwardEngine runs the whole forward pass as one MLX lazy graph — every activation stays on the GPU, one eval() per token. Up to 9.4× faster decode than CPU, with a 21 ms time-to-first-token on 0.5B.

03 · Cross-platform GPU

Stable

wgpu

Portable WGSL compute shaders over Vulkan, DX12 & Metal — Intel Arc, AMD Radeon and Nvidia. Q4_K/Q6_K/Q8_0 weights stay quantized on the GPU and dequantize in-shader, so VRAM ≈ the GGUF file size — verified on a Jetson with zero CUDA.

04 · Forward engine

LayerNorm + partial RoPE

Phi engine

Powers Phi-1 · Phi-1.5 · Phi-2

05 · Forward engine

RMSNorm + RoPE + grouped-query attention

Llama engine

Powers Llama 3.2 · Qwen2.5 · SmolLM2 · TinyLlama · Mistral

06 · Speech-to-text

Pure-Rust log-mel front-end + encoder/decoder

Whisper engine

Powers whisper-tiny · whisper-base · whisper-small — sapient transcribe

07 · Vision-language

SigLIP towers + SmolLM2 / Gemma3 backbones

Vision engines

Powers smolvlm-256m · gemma-3-4b · medgemma-4b (medical imaging) — sapient see

08 · Text-to-speech

StyleTTS2 + ISTFTNet · Llama → SNAC codec

Speech engines

Powers kokoro-82m (~2× real-time on CPU, 54 voices) · orpheus-3b — sapient speak

//Benchmarks[4/11]

We don't think benchmarks tell the full story.

But we run them head-to-head anyway — against llama.cpp, Ollama and mlx-lm on the same hardware, model and quantization. Sapient runs within 10–20% of llama.cpp on Apple Metal and beats Ollama by 1.5× on Llama-1B — with the lowest time-to-first-token of the three.

Metal: 90.6 tok/s on Llama-1B — lowest TTFT (52 ms warm), 1.5× Ollama
CPU within ~1.3–1.6× of llama.cpp (was 1.8–3.8× at v0.5.0)
4.2× faster server TTFT than Ollama (14 ms vs 59 ms)
Raspberry Pi 5: Llama-1B at 11.6 tok/s — 8.9× across two releases

Apple M4 · 4-bit GGUF · Sapient v0.5.3 · same file, same session vs llama.cpp b9860 / Ollama 0.12.6 · 2026-07-09

llama-3.2-1b-q4

Engine · decode throughputTTFTPeak RAM

llama.cppMetal111.3 tok/s

TTFT — · — peak RAM——

SAPIENTMetalLowest TTFT90.6 tok/s

TTFT 52 ms · — peak RAM52 ms—

llama.cppCPU · 4t83.1 tok/s

TTFT — · — peak RAM——

OllamaMetal60.4 tok/s

TTFT 152 ms · — peak RAM152 ms—

SAPIENTCPU56.7 tok/s

TTFT 366 ms · 1.19 GB peak RAM366 ms1.19 GB

4-bit · Apple M4 · v0.5.3 head-to-head (2026-07-09), same GGUF file, same session vs llama.cpp b9860 — SAPIENT keeps the lowest TTFT and is 1.5× Ollama (whose default 1b tag ships Q8_0).

Backend matrixCPU

FastestMetal

wgpu

Runs without a GPU

Apple Silicon acceleration

Intel Arc · AMD · Nvidia

GGUF Q4/Q8/K-quants on-device

Compact KV cache (Q8 / f16)

Whisper speech-to-text

Vision-language (sapient see)Partial

Speculative decoding

On-device iOS & Android

Decode — Qwen2.5-0.5B (M4)57 tok/s187 tok/s—

Vision-language on Metal is partial: SmolVLM's text backbone runs on the GPU, while the SigLIP vision tower stays on CPU — and the Gemma 3 / MedGemma path is CPU-only today.

//Built for[5/11]

Where local wins.

Four markets where the cloud can't follow — every command below runs today.

Robotics

A voice in every robot

The streaming loop runs on the Pi inside the chassis: it hears while you speak, answers out loud, and you can interrupt it mid-sentence. No uplink.

$ sapient converse llama-3.2-1b --speak

Healthcare

Medical imaging, air-gapped

MedGemma reads radiographs on a laptop that never touches a network — the entire scan-to-narrative path stays inside the clinic.

$ sapient see xray.png --model medgemma-4b

Product teams

A drop-in local backend

An OpenAI-compatible /v1 server with multi-model residency. Point your existing SDK at localhost and ship the private version of your feature.

$ sapient serve --port 8080

Field & industry

Edge boxes that just run

Thermal-aware sustained decode on passive hardware, models bigger than RAM via mmap, one static binary per architecture. Built for the cabinet, not the rack.

$ sapient chat qwen2.5-1.5b

//How it works[6/11]

How it works.

Four commands, one binary. Install, pull, run, update — 30 seconds to a model streaming in your terminal.

01 / Install

One command, any platform.

The endpoint detects your shell and serves the right script.

$curl -fsSL https://sapient.openhorizon.so/install | sh

02 / Pull

Grab a curated model.

Curated aliases resolve to upstream Hugging Face repos.

$sapient pull qwen2.5-0.5b

03 / Run

Chat, speak, or serve.

Streaming chat, Whisper transcription, Kokoro speech — or an OpenAI-compatible /v1 server.

$sapient chat qwen2.5-0.5b

04 / Update

Stay current in place.

The CLI updates itself whenever a new release ships.

$sapient update

//Ways to run[7/11]

A CLI, a server, a library, and mobile SDKs. One engine underneath.

CLI

Command line

Streaming chat with live Markdown rendering — plus on-device voice (a streaming converse you can interrupt mid-sentence) and on-device vision: ask questions about any image, including medical imaging with MedGemma.

sapient chat phi-2

sapient see photo.jpg -p "What's here?"

sapient converse qwen2.5-1.5b --speak

Server

OpenAI-compatible server

A drop-in /v1 server — chat (text and images, as base64 data URIs), completions and audio transcriptions — with lazy loading, multi-model LRU residency and speculative decoding. Point the OpenAI SDK or LangChain at localhost.

sapient serve --port 8080

curl localhost:8080/v1/chat/completions \
  -d '{"model":"smolvlm-256m",
       "messages":[{"role":"user","content":[
        {"type":"text","text":"What is this?"},
        {"type":"image_url","image_url":{
         "url":"data:image/png;base64,…"}}]}]}'

Crate

Rust library

Embed inference directly with the sapient-generate crate, pulled straight from GitHub — streaming, chat templates and custom sampling.

use sapient_generate::Pipeline;

let p = Pipeline::from_pretrained(
  "phi-2").await?;
println!("{}", p.generate(
  "The key to good software is").await?);

iOS · Android · RN

Mobile SDKs

The same engine inside your app — Swift, Kotlin and React Native, generated from one Rust FFI crate. GPU by default (Metal on iOS, Vulkan on Android) and the engine sheds decode threads as the phone heats up.

// Swift — add the package by URL, then:
let session = try LlmSession.load(
  model: "qwen2.5-0.5b",
  options: GenerationOptions(maxTokens: 256))
let reply = try session.chat(
  userMessage: "Hi!")

//Install[8/11]

One command.
Any platform.

On x86_64 the installer detects your graphics card and pulls the -gpu build (Intel / AMD / Nvidia via wgpu) when one is present. Force a choice with SAPIENT_VARIANT=cpu|gpu. Prefer a tarball? Grab a static binary below.

$curl -fsSL https://sapient.openhorizon.so/install | sh

Installs to ~/.local/bin. If sapient isn't found afterward, run: export PATH="$HOME/.local/bin:$PATH"

Direct binary download

macOS — Apple Siliconsapient-aarch64-apple-darwin.tar.gz

macOS — Apple Silicon + Metal GPUsapient-aarch64-apple-darwin-metal.tar.gz

macOS — Intelsapient-x86_64-apple-darwin.tar.gz

macOS — Intel GPU (wgpu→Metal)sapient-x86_64-apple-darwin-gpu.tar.gz

Linux — x86_64sapient-x86_64-unknown-linux-gnu.tar.gz

Linux — x86_64 GPU (Vulkan)sapient-x86_64-unknown-linux-gnu-gpu.tar.gz

Linux — ARM64 (Pi 4/5, cloud ARM)sapient-aarch64-unknown-linux-gnu.tar.gz

Windows — x86_64sapient-x86_64-pc-windows-msvc.zip

Windows — x86_64 GPU (DX12)sapient-x86_64-pc-windows-msvc-gpu.zip

Windows — ARM64sapient-aarch64-pc-windows-msvc.zip

//Model registry[9/11]

A curated model registry.

37 curated aliases, grouped into text generation, vision, speech-to-text and text-to-speech. Every alias resolves to an upstream Hugging Face repo and downloads on first use — Safetensors (F16/BF16/F32) or -q4 GGUF.

Gated models require sapient login.

$sapient models

AliasEngineSizeAccess

phi-2Default

Phi2.7B Open

phi-1.5

Phi1.3B Open

phi-1

Phi1.3B Open

qwen2.5-0.5bSmallest

Qwen2.50.5B Open

qwen2.5-1.5b

Qwen2.51.5B Open

qwen2.5-3b

Qwen2.53B Open

smollm2-360m

Llama360M Open

smollm2-1.7b

Llama1.7B Open

tinyllama-1.1b

Llama1.1B Open

llama-3.2-1b

Llama1B Open

llama-3.2-3b

Llama3BGated

mistral-7b

Mistral7BGated

mixtral-8x7b-q4MoE · 32 GB+

Mixtral47B Open

glm-4.5-air-q4MoE · 96 GB+

GLM4-MoE106B Open

whisper-tinySTT

Whisper39M Open

whisper-baseSTT

Whisper74M Open

whisper-smallSTT

Whisper244M Open

kokoro-82mTTS

StyleTTS282M Open

orpheus-3bTTS

Orpheus3B Open

smolvlm-256mVision

SmolVLM256M Open

gemma-3-1b

Gemma31B Open

gemma-3-4bVision

Gemma34B Open

medgemma-4bMedical

Gemma34BGated

//Roadmap[10/11]

One PR per phase. Ship gradually, never a big bang.

01 / Shipped

Mobile SDKs — Swift, Kotlin & React Native running the engine on-device, GPU by default, engine-level thermal governance (v0.6.0)
Sparse MoE — Mixtral 47B & GLM-4.5-Air 106B on a Jetson Thor, pure Rust, zero CUDA (v0.5.3)
Vision over HTTP — OpenAI-compatible image inputs in /v1/chat/completions (v0.5.3)
Vision — sapient see: SmolVLM, Gemma3, and MedGemma medical imaging, fully offline (v0.5.2)
Streaming voice loop — STT while you talk, replies you can interrupt; ~2.4 s perceived latency (v0.5.2)
Fully-quantized wgpu — VRAM ≈ GGUF file size on any GPU vendor (v0.5.0)
CPU int8 kernel ladder — llama.cpp CPU decode gap cut from 1.8–3.8× to ~1.3–1.6× (v0.5.1)

02 / In progress

Sub-second voice replies (Kokoro decoder acceleration — RTF already 0.66 → 0.48)
Server-ARM decode kernels (KleidiAI-class parity on Graviton/Grace/Thor)
Lower peak RAM on Metal (MLX-Q4 embeddings)
Jetson / Orin Vulkan release build
Intel Arc & AMD benchmark datapoints

03 / Planned

Python bindings over the same UniFFI layer
Continuous batching & paged KV in the server
Faster vision towers (blocked W8A8 GEMM) — MedGemma first token < 15 s

//Get started[11/11]

Run a model on your machine in under a minute.

$curl -fsSL https://sapient.openhorizon.so/install | sh

Get started Star on GitHub

macOS · Linux · Windows — no Python, no Docker, no CUDA

Make your hardware think, see, listen and speak.

The Sapient CLI & server.

The Rust crate.

Real sessions, real outputs.

All the engines to run a model anywhere.

CPU

Apple Metal

wgpu

Phi engine

Llama engine

Whisper engine

Vision engines

Speech engines

We don't think benchmarks tell the full story.

Where local wins.

A voice in every robot

Medical imaging, air-gapped

A drop-in local backend

Edge boxes that just run

How it works.

One command, any platform.

Grab a curated model.

Chat, speak, or serve.

Stay current in place.

A CLI, a server, a library, and mobile SDKs. One engine underneath.

Command line

OpenAI-compatible server

Rust library

Mobile SDKs

One command.Any platform.

A curated model registry.

One PR per phase. Ship gradually, never a big bang.

Run a model on your machine in under a minute.

One command.
Any platform.