Alloy · A Prysm project

The fastest inference engine for Apple Silicon.

Run any model on your Mac. Faster than llama.cpp and MLX.

View on GitHub

Benchmarks

Model
Device
M4 Max

Qwen3 0.6B · MLX 4-bit

tg128 decode · pp4096 prefill

Generation tok/s

Alloy
709.6 ± 17.2
MLX
395.8 ± 3.1

Prompt Processing tok/s

Alloy
8,824.6 ± 5.8
MLX
7,541.3 ± 57.6

Built For

App developers

Ship private AI in your app. Alloy serves any model behind an OpenAI-compatible API.

ML researchers

Write, train and fine-tune models on your Mac with Alloy's PyTorch backend.

Performance engineers

Alloy provides a Triton-like DSL so you can write and run Metal kernels from Python.

Get Started

From install to a running model in under a minute.

01Install
uv add 'alloy-kit[serve]'

macOS 13+, Python 3.10+

02Serve
alloy serve -m qwen3:0.6b

Compiles any GGUF or MLX model.

03Run
http://127.0.0.1:11434

Point any OpenAI client at localhost. Done.

Built-in Features

Speculative decoding
Constrained decoding
Tool calling
KV-cache quantization
Vision & audio input
Embeddings

The Prysm Stack

Edge deploymentPrysmCompile any ONNX. Ship to Hailo-8, Jetson Orin, or Kria K26.
On-device · Apple SiliconAlloyRun and train any model on Apple Silicon. You're here.

Get Started

Free and open source.

View on GitHub