Alloy · A Prysm project

The fastest inference engine for Apple Silicon.

Run any model on your Mac. Faster than llama.cpp and MLX.

View on GitHub

Benchmarks

Model

Device

M4 Max

Qwen3 0.6B · MLX 4-bit

tg128 decode · pp4096 prefill

Generation tok/s

Alloy

709.6 ± 17.2

MLX

395.8 ± 3.1

Prompt Processing tok/s

Alloy

8,824.6 ± 5.8

MLX

7,541.3 ± 57.6

Built For

App developers

Ship private AI in your app. Alloy serves any model behind an OpenAI-compatible API.

ML researchers

Write, train and fine-tune models on your Mac with Alloy's PyTorch backend.

Performance engineers

Alloy provides a Triton-like DSL so you can write and run Metal kernels from Python.

Get Started

From install to a running model in under a minute.

01Install

uv add 'alloy-kit[serve]'

macOS 13+, Python 3.10+

02Serve

alloy serve -m qwen3:0.6b

Compiles any GGUF or MLX model.

03Run

http://127.0.0.1:11434

Point any OpenAI client at localhost. Done.

Built-in Features

Speculative decoding

Constrained decoding

Tool calling

KV-cache quantization

Vision & audio input

Embeddings

The Prysm Stack

Edge deploymentPrysmCompile any ONNX. Ship to Hailo-8, Jetson Orin, or Kria K26.

On-device · Apple SiliconAlloyRun and train any model on Apple Silicon. You're here.

Get Started

Free and open source.

View on GitHub