Apple Silicon Benchmark · March 2026

The M5 Max
AI Benchmark

20+ open-source models tested on real hardware. Every number measured, not estimated.

0tok/s
Peak Speed
0+
Models Tested
0GB/s
Bandwidth
$0
API Cost
M5 Max 40-Core GPU 128 GB Unified Memory MLX Framework 4-Bit Quantization

Independent benchmarks · No vendor sponsorship

NEW: Quality evals added. ARC-Challenge, GSM8K, IFEval — per-model, on-device. Open Model Leaderboard →

The M5 Max has 614 GB/s of memory bandwidth and up to 128 GB of unified memory accessible by the GPU at full speed. LLM decode is memory-bandwidth-bound — not compute-bound — so this single number predicts throughput better than TFLOPS. The formula: tok/s ≈ 614 / model_size_GB. This page has measured numbers for 20+ models that confirm it within ~20-30%.

All data comes from real hardware (M5 Max 40-core, 128 GB), real framework versions (mlx 0.31.1, mlx-lm 0.31.1), and isolated subprocess runs. No Ollama, no wrappers — direct mlx_lm.generate calls. Below: the full benchmark tables, interactive comparison tools, a RAM calculator, VLM results, MLX vs GGUF analysis, cloud API cost comparison, and reproduction instructions.

# Quick start: install and run your first model pip install mlx-lm # Generate text (downloads model on first run) mlx_lm.generate --model mlx-community/gemma-3-4b-it-4bit \ --prompt "Explain the attention mechanism in 3 sentences" \ --max-tokens 256 # Benchmark a model (rough one-liner) python -c " from mlx_lm import load, generate model, tok = load('mlx-community/Qwen3-30B-A3B-4bit') generate(model, tok, prompt='Hello', max_tokens=200, verbose=True) "

Hardware: Apple M5 Max

TSMC 3nm. 614 GB/s bandwidth on the 40-core SKU. Bandwidth determines tok/s, not TFLOPS.

LLM decode reads the entire weight matrix once per token. On a memory-bandwidth-bound workload, compute is not the bottleneck — the memory bus is. The 40-core M5 Max has 614 GB/s; the 32-core variant has 460 GB/s. Same die, same RAM options, but ~25% less bandwidth. GPU core count determines bandwidth, not RAM amount. A 64 GB / 40-core Mac is faster than a 128 GB / 32-core Mac for inference. The formula tok/s ≈ bandwidth / model_size_GB holds within 20-30% across all models we tested, with the delta attributable to KV cache overhead, attention compute, and framework efficiency.

M5 Max Chip Specifications

Spec M5 Max (32-core GPU) M5 Max (40-core GPU)
CPU Cores18 (6 Super + 12 Performance)18 (6 Super + 12 Performance)
GPU Cores3240
Neural Engine16-core16-core
GPU Neural AcceleratorsYes (new in M5)Yes (new in M5)
Memory Bandwidth460 GB/s614 GB/s
Max Unified Memory128 GB128 GB
ProcessTSMC 3nm (3rd gen)TSMC 3nm (3rd gen)
The number that matters: Memory bandwidth is set by GPU core count, not RAM size. 32-core = 460 GB/s. 40-core = 614 GB/s. Same RAM, ~25% different tok/s.

Unified memory means no VRAM wall. An RTX 4090 has 24 GB VRAM; an RTX 5090 has 32 GB. Exceed that and you spill to PCIe system RAM at a fraction of the bandwidth. The M5 Max GPU sees all 128 GB at full 614 GB/s. No copies, no offloading. This is why a 37 GB model like Llama 3.3 70B Q4 runs fine on a laptop — something impossible on any consumer NVIDIA GPU without severe throughput penalties.

Config Matrix

Configuration Unified Memory Bandwidth Max Model Size
M5 Max 32-core GPU36 GB460 GB/s~14B dense Q4 comfortably
M5 Max 32-core GPU64 GB460 GB/s~70B Q4 (slow: ~10 tok/s)
M5 Max 40-core GPU64 GB614 GB/s~70B Q4 at full speed
M5 Max 40-core GPU128 GB614 GB/s235B MoE, 70B Q8, multi-model
All benchmarks below: 40-core GPU, 128 GB config.

Model Comparison Tool

Pick two models. Compare tok/s, RAM, and efficiency side by side.

Head to Head

Speed (tok/s)

RAM Usage (GB)

Efficiency (tok/s per GB)

RAM Calculator

Which models fit? 8 GB reserved for macOS. Remainder available for model weights + KV cache.

RAM Allocation

Compatible (0)

Too Large (0)

Best Pick at 128 GB

Text Generation Benchmarks

All Q4, MLX, M5 Max 40-core, 128 GB. 3-5 subprocess-isolated runs, averaged. Click headers to sort.

Dense models read all parameters per token. A 70B Q4 model is ~37 GB, so you get 614 / 37 ≈ 16.6 theoretical tok/s (measured: 12.6, delta from KV cache + compute). MoE models only read active expert weights per token. Qwen 3 30B-A3B activates 3B of 30B params — it decodes at 127 tok/s despite needing 16 GB resident. 70B models are slow but smart; 4B models are fast but dumb. MoE splits the difference. At Q4, memory footprint follows ~0.5 GB per billion parameters for dense models.

Text Generation Models

Model ▲▼ Params ▲▼ Type ▲▼ tok/s ▲▼ Speed TTFT ▲▼ Memory ▲▼ Tier
Table 1: Text generation benchmarks on M5 Max 40-core GPU, 128 GB, MLX 4-bit quantization. Click column headers to sort.

Vision Language Models

ModelParamstok/sSpeedTTFTMemoryTier
Table 2: Vision Language Model benchmarks on M5 Max. See the VLM section for HuggingFace model IDs.
MoE note: Qwen 3 30B-A3B activates 3B of 30B params per token via sparse expert routing. Result: 127.4 tok/s decode at 16.1 GB resident. Effective compute cost of a 3B model, knowledge capacity of a 30B model.

Throughput Formula

Decode is memory-bandwidth-bound, not compute-bound:

tok/s ≈ 614 GB/s ÷ Model Size (GB)

Predicts within 20-30%. Gap = KV cache reads + attention compute + framework overhead.

Quick Reference

  • TTFT < 200ms for all models under 15B params
  • 70B models: TTFT ~730ms, still interactive
  • Q4 memory: ~0.5 * params_B GB
  • MoE breaks the speed-vs-size curve

Vision Language Models (VLMs)

Image + text input. Local OCR, document analysis, screenshot parsing. No data leaves your machine.

VLMs process image inputs alongside text prompts. Use cases: local OCR, chart/diagram parsing, screenshot-to-code pipelines, document classification. All processing stays on-device — no image data hits an API endpoint. The text decode speed is identical to the text-only variants for the same architecture (the vision encoder adds latency to prefill, not decode). HuggingFace model IDs: mlx-community/gemma-3-4b-it-4bit (VLM-capable), mlx-community/Qwen3-VL-8B-4bit, mlx-community/Qwen3-VL-32B-4bit.

178.7
tok/s
Gemma 3 4B VLM
Fastest — 2.4 GB RSS
110.7
tok/s
Qwen3-VL 8B
Best quality/speed — 4.4 GB
27.3
tok/s
Qwen3-VL 32B
Highest accuracy — 17.3 GB

For bulk document processing and classification, Gemma 3 4B VLM at 179 tok/s is faster than any cloud vision API round-trip. For tasks requiring strong spatial reasoning or fine-grained image understanding, Qwen3-VL 32B at 27.3 tok/s fits in 64 GB and delivers the best accuracy we measured.

128 GB Memory Map

Each cell = 1 GB of unified memory. Click a category to highlight.

128 GB Unified Memory Pool

1 cell = 1 GB

Speed Tiers

Models bucketed by decode throughput. >100 tok/s = faster than you can read. >30 tok/s = real-time chat. <25 tok/s = noticeable latency.

Efficiency: tok/s per GB

Speed normalized by memory footprint. Higher = more throughput per GB of RAM consumed.

Rank Model Type tok/s Memory tok/s per GB Quality Agentic Efficiency
Table 3: Efficiency = tok/s / memory_GB. Higher is better. Small dense models dominate; MoE penalized by total weight residency.

Quality Evaluation (lm-eval-harness)

ARC-Challenge, GSM8K, IFEval via lm-evaluation-harness. 4-bit quantization, greedy decoding.

Scores below are from lm-eval with our custom task configs (0-shot ARC with regex extraction, 8-shot GSM8K CoT multiturn, IFEval strict). All runs on MLX with greedy decoding (temp=0). Composite = mean of available scores.

Loading quality evaluation data...

Agentic Eval (terminal-bench)

14 tasks via terminal-bench-core v0.1.1 with terminus-thinking agent. Docker sandbox per task.

Each model runs an agentic loop: receive task description, execute shell commands in a Docker container, iterate until tests pass or timeout. The agent uses structured JSON output (CommandBatchResponse schema). Parse errors = model can't produce valid JSON. Run with: python3 scripts/run_agentic_bench_batch.py --task <name>

Loading agentic benchmark data...

View detailed failure analysis in the Eval Dashboard →

RAM Tier Picks

Q4 memory: ~params_B * 0.5 GB. Reserve ~20% for OS + KV cache + apps.

Pick the model that fits your RAM with headroom to spare. Running at 95% memory utilization works but leaves no room for KV cache growth on long contexts or running anything else. These picks assume you want to keep the model loaded while using other apps.

32 GB

~24 GB usable

Covers all models up to 27B dense. MoE models fit too.

  • Qwen 3 30B-A3B127.4 tok/s · 16.1 GB
  • Gemma 3 27B30.9 tok/s · 15.2 GB
  • Phi-4 14B62.0 tok/s · 7.8 GB
  • Gemma 3 4B178.7 tok/s · 2.4 GB
Best pick: Qwen 3 30B-A3B — 30B quality at 127 tok/s.

128 GB

Everything fits

Multi-model setups. 70B at Q8 for better quality. Frontier 235B MoE.

  • Qwen 3 30B-A3B127.4 tok/s · 16.1 GB
  • Llama 3.3 70B12.6 tok/s · 37.1 GB
  • Devstral 24B39.3 tok/s · 12.6 GB
  • DeepSeek R1 32B24.9 tok/s · 17.3 GB
Qwen 3 235B-A22B fits at ~118 GB Q4. Frontier MoE on a laptop.

MLX vs GGUF

MLX: Apple-native, Metal 4, unified memory optimized. GGUF: cross-platform, powers Ollama/LM Studio.

On M5 Max, MLX delivers ~20-30% higher decode tok/s than llama.cpp (GGUF backend). MLX is built for Metal 4 and Apple's unified memory — no CPU-to-GPU copies, native mx.array operations. As of March 2026, Ollama 0.18.2 has Metal 4 shader compilation bugs on M5 Max. If you need Ollama, pin to an older version or wait for a fix. GGUF wins on model variety (tens of thousands of quants on HuggingFace) and cross-platform support (Linux/Windows CUDA).

# MLX: install and run pip install mlx-lm mlx_lm.generate --model mlx-community/Qwen3-30B-A3B-4bit \ --prompt "Write a Python async HTTP server" --max-tokens 512 # GGUF via llama.cpp (if you need it) brew install llama.cpp llama-cli -m Qwen3-30B-A3B-Q4_K_M.gguf \ -p "Write a Python async HTTP server" -n 512 # MLX server (OpenAI-compatible API) mlx_lm.server --model mlx-community/Qwen3-30B-A3B-4bit --port 8080 # Then: curl localhost:8080/v1/chat/completions ...

MLX

  • Native Metal 4 GPU + unified memory integration
  • pip install mlx-lm — Python-first, fine-tuning support
  • ~20-30% faster decode vs llama.cpp on Apple Silicon
  • Faster prefill via zero-copy memory access
  • Recommended on M5 Max as of March 2026
  • Models: mlx-community/* on HuggingFace

GGUF / llama.cpp

  • Cross-platform: CPU, CUDA, Metal, Vulkan, ROCm
  • Largest model ecosystem on HuggingFace
  • IQ quants, mixed quantization, k-quants
  • Powers Ollama and LM Studio GUIs
  • Bug: Ollama 0.18.2 Metal 4 shader issues on M5
  • Better choice for Linux/CUDA dual-boot setups
TL;DR: Use mlx-lm on M5 Max for maximum tok/s. Use GGUF if you need cross-platform compat or a specific quant variant not yet in mlx-community.

Cloud API Pricing Reference

March 2026 pricing. Elo ratings from LM Arena. Local models are free after hardware cost.

Cloud APIs still win on absolute quality (Elo 1490-1510 vs ~1420-1450 for the best open models). For most dev tasks — code gen, summarization, data extraction, chat — a local 30B-70B model is good enough. The real question is cost at your usage volume. See the break-even table.

Model Provider Input $/M tok Output $/M tok Context Arena Elo Vision
Claude Opus 4.6Anthropic$5.00$25.001M~1505Yes
Gemini 3.1 ProGoogle$2.00$12.001M~1503Yes
GPT-5.2OpenAI$1.75$14.00400K~1490Yes
Claude Sonnet 4.6Anthropic$3.00$15.00200K~1480Yes
GPT-5.2 ProOpenAI$21.00$168.00400K~1510Yes
Gemini 2.5 FlashGoogleFreeFree1M~1450Yes
DeepSeek V3.2 APIDeepSeek$0.14$0.28164K~1421No
DeepSeek R1 APIDeepSeek$0.55$2.19164K~1430No
GPT-4oOpenAI$2.50$10.00128K~1460Yes
Claude Haiku 4.5Anthropic$1.00$5.00200K~1420Yes
Table 4: Cloud API pricing, March 2026. Elo from lmarena.ai.

Local Wins When

  • Data cannot leave the machine (HIPAA, source code, etc.)
  • TTFT matters: 100-200ms local vs 500ms-2s cloud
  • High volume: $0 marginal cost after hardware amortizes
  • Offline / air-gapped / airplane mode
  • No rate limits, no API key management

Cloud Wins When

  • You need Elo 1490+ quality (frontier reasoning)
  • Low volume (<100K tok/day): cheaper than hardware
  • 1M+ context windows not feasible locally
  • Always-latest models without manual updates
  • No upfront capex

Cost Break-Even

M5 Max 128 GB = ~$4,999. Blended Sonnet-tier API cost = ~$9/M tokens. Electricity: ~$5-10/mo under load.

Simple math: $4,999 / ($9/M_tok * daily_tok * 30) = months to break even. At 100K tok/day that is $27/month cloud spend, so ~15 months. At 500K tok/day it is $135/month, so ~3 months. If you are running batch jobs, agents, or CI pipelines against a local model, you hit 1M+ tok/day easily and break even in weeks. After break-even, marginal cost is zero (electricity is negligible).

Daily Token Usage Monthly Cloud Cost Break-Even Verdict
10K tokens/day~$2.70/mo154 yearsCloud wins
100K tokens/day~$27/mo15 monthsToss-up
500K tokens/day~$135/mo3 monthsLocal wins
1M tokens/day~$270/mo1.5 monthsLocal wins
5M tokens/day~$1,350/mo~11 daysLocal wins
Table 5: Break-even assuming $4,999 hardware, ~$9/M blended API tokens.
Rule of thumb: If you generate >100K tokens/day consistently, buy the hardware. If you run batch/agent workloads, break-even is measured in weeks.

Methodology & Reproduction

Exact versions, isolation strategy, and how to reproduce these numbers yourself.

Every benchmark run executes in a fresh subprocess (subprocess.Popen with a new Python interpreter) to ensure clean memory measurement. No model weights persist between runs. RSS is sampled at peak during generation. We use mlx_lm.generate directly — no Ollama, no LM Studio, no HTTP server overhead. Four standardized prompts per model (Q&A, reasoning, code gen, structured output). 3-5 runs per model, results averaged. Outliers beyond 2 standard deviations discarded.

# Reproduction steps (approximate) pip install mlx==0.31.1 mlx-lm==0.31.1 # Run a single model benchmark python -c " import time, subprocess, json from mlx_lm import load, generate model, tokenizer = load('mlx-community/Qwen3-30B-A3B-4bit') prompts = [ 'What causes ocean tides?', 'Write a binary search in Rust.', 'Output JSON: {name, age, city} for 3 fictional people.', 'Explain why P != NP is hard to prove.' ] for p in prompts: t0 = time.perf_counter() out = generate(model, tokenizer, prompt=p, max_tokens=256, verbose=True) elapsed = time.perf_counter() - t0 print(f'Prompt: {p[:40]}... | Time: {elapsed:.2f}s') " # For clean memory measurement, wrap each run in subprocess: # subprocess.run([sys.executable, 'bench_single.py', '--model', model_id])

Test Configuration

  • Hardware: MacBook Pro 16" M5 Max, 40-core GPU, 128 GB unified memory
  • OS: macOS 16.x (Darwin 25.3.0)
  • Framework: mlx 0.31.1, mlx-lm 0.31.1
  • Quantization: 4-bit (Q4) for all models
  • Prompts: 4 standardized (Q&A, reasoning, code, structured output)
  • Runs: 3-5 per model, averaged, outliers discarded
  • Isolation: Each run in a fresh subprocess for clean RSS measurement
  • Cooling: Stock laptop cooling, no external. Thermal throttling not observed.

FAQ

Common questions. Answers reference measured data from the benchmarks above.

Data-backed answers. If the answer involves a number, it came from our test runs, not a spec sheet.

Gemma 3 4B is the fastest model we tested, generating 178.7 tokens per second on the M5 Max 40-core GPU using MLX at 4-bit quantization. It uses only 2.4 GB of memory and supports both text and vision input. For models with more knowledge, Qwen 3 30B-A3B (MoE) achieves 127.4 tok/s while packing 30 billion parameters of intelligence into 16.1 GB of memory. See the full benchmark table for all results.
It follows the formula tok/s ≈ 614 GB/s ÷ model_size_GB. In practice: a 4B model generates ~179 tok/s, 8B ~105-113 tok/s, 14B ~55-62 tok/s, 27B ~31 tok/s, and 70B ~12-13 tok/s. MoE models like Qwen 3 30B-A3B break this curve with 127 tok/s despite using 16GB of memory because only 3B parameters are active per token.
Yes, but you need at least 64 GB of unified memory. At 4-bit quantization, a 70B dense model like Llama 3.3 70B uses 37.1 GB, leaving enough headroom on a 64 GB Mac for the OS and context window. On the M5 Max 40-core GPU, it generates at 12.6 tok/s with a 724ms time to first token. For 128 GB configurations, you can run 70B at Q8 for higher quality or keep multiple models loaded at once. See the RAM Tier Guide for configuration-specific recommendations.
On M5 Max, MLX is the better choice. It is purpose-built for Apple Silicon's unified memory and Metal 4 GPU, providing ~20-30% better decode performance than llama.cpp. Ollama (GGUF-based) also has Metal 4 shader compilation issues on M5 Max as of March 2026. MLX tends to have faster prompt processing as well. See the full MLX vs GGUF comparison for details.
At 4-bit quantization, model size in GB is roughly 0.5 times the number of billion parameters. Leave about 20% headroom for the OS and KV cache. 32 GB runs excellent 12B-27B models (31-69 tok/s). 64 GB runs frontier 70B dense models comfortably. 128 GB is for 70B at Q8, multi-model setups, or frontier MoE like Qwen 3 235B. The RAM Tier Guide has specific recommendations for each configuration.
It depends on usage. At 100K tokens/day (~50-100 interactions), a MacBook Pro M5 Max 128 GB ($4,999) breaks even with cloud APIs within 15 months. At 500K tokens/day, break-even is 3 months. At 1M tokens/day, just 1.5 months. After break-even, every token is free. Electricity adds only ~$5-10/month. See the full cost break-even analysis.
MoE is a model architecture where only a fraction of parameters are activated per token. Qwen 3 30B-A3B has 30 billion total parameters but activates only 3 billion per token, giving it the speed of a 3B model (127.4 tok/s) with the quality of a 30B model. The tradeoff: MoE models still need memory for all parameters (16.1 GB). On Apple Silicon with ample unified memory, MoE offers the best quality-to-speed ratio available — making it ideal for all RAM tiers.
For coding tasks, Devstral Small 24B (39.3 tok/s, 12.6 GB) is purpose-built for code generation and fits on any Mac with 32 GB+. Qwen 3 30B-A3B (127.4 tok/s) is excellent for coding while also being fast enough for real-time use. For smaller setups, Qwen 3 8B (105.4 tok/s, 4.4 GB) offers solid coding at ultra-fast speed. For maximum quality with 64 GB+, Llama 3.3 70B provides the strongest overall performance.
Yes. Once downloaded, models run entirely on local hardware with no internet needed. This is a key advantage for travel, air-gapped environments, restricted networks, and privacy-sensitive work. Models are stored on SSD and inference uses only your Mac's CPU, GPU, and unified memory.
Several VLMs run well via MLX: Gemma 3 4B VLM (179 tok/s, 2.4 GB) is the fastest. Qwen3-VL 8B (111 tok/s, 4.4 GB) is the best value. Gemma 3 27B VLM (32 tok/s, 15.2 GB) and Qwen3-VL 32B (27 tok/s, 17.3 GB) provide the highest quality image understanding. All run entirely locally. See the VLM section for more.
Yes. Apple Silicon uses a unified memory architecture where the CPU and GPU share the same memory pool. You do not need to copy model weights between system RAM and VRAM like you would with a discrete GPU. You can run AI inference alongside regular workloads such as a web browser, IDE, or creative apps. However, the model's memory footprint reduces what is available for other applications, so choose a model size that leaves enough headroom. For example, running a 17 GB model on a 64 GB Mac still leaves roughly 39 GB for macOS and your other apps.
Average human reading speed is roughly 4–5 tokens per second. So any model generating above 5 tok/s is producing text faster than you can read it. At 25–30 tok/s, responses appear nearly instantaneous for short answers. At 100+ tok/s, even long multi-paragraph responses complete in just a few seconds. For coding assistants, higher speeds mean faster completions and less waiting between edits. In practice, anything above 30 tok/s feels real-time for interactive chat.
Yes, as long as the combined memory footprint fits in your available RAM. For example, on a 128 GB Mac with approximately 8 GB reserved for the OS, you could simultaneously load Qwen 3 30B-A3B (16.1 GB) + Devstral 24B (12.6 GB) + Gemma 3 4B (2.4 GB) for a total of 31.1 GB, leaving over 88 GB free. On a 64 GB Mac, you could run a 7–8B model (4.4 GB) alongside a 14B model (7.8 GB) for about 12.2 GB total. Only one model generates tokens at a time using the GPU, but having multiple loaded avoids reload latency when switching between them.
32 GB: Qwen 3 30B-A3B (127.4 tok/s, 16.1 GB) is the best overall pick, offering 30B-class quality at blazing speed. 64 GB: Llama 3.3 70B (12.6 tok/s, 37.1 GB) provides the strongest quality, while Qwen 3 30B-A3B remains the best speed-to-quality ratio. 128 GB: You can run everything, but Qwen 3 30B-A3B is still the daily driver for speed, Llama 3.3 70B for quality, and Devstral 24B (39.3 tok/s, 12.6 GB) for coding. The sweet spot for most users is the 64 GB configuration with the 40-core GPU.

Recommendations

Summary of what to buy, what to run, and what to skip.

Hardware

  • Best value: 64 GB + 40-core GPU. Runs 70B dense, all MoE models, at full 614 GB/s.
  • Max headroom: 128 GB + 40-core. Multi-model, 70B Q8, frontier 235B MoE.
  • Avoid: 32-core GPU if inference speed matters. 25% less bandwidth, same price tier.

Software

  • Use mlx-lm on M5 Max. Skip Ollama until Metal 4 shaders are fixed.
  • mlx_lm.server exposes an OpenAI-compatible API for integration with existing tools.
  • Models: mlx-community/* on HuggingFace. Q4 variants for speed, Q8 if RAM allows.

Model Picks

  • Daily driver: Qwen 3 30B-A3B — 127 tok/s, 16.1 GB. Best speed-to-quality ratio.
  • Coding: Devstral 24B — 39 tok/s, 12.6 GB. Purpose-built for code gen.
  • Max quality: Llama 3.3 70B — 12.6 tok/s, 37.1 GB. Needs 64 GB+.
  • Fast vision: Gemma 3 4B VLM — 179 tok/s, 2.4 GB. OCR/doc analysis.
  • Lightweight: Qwen 3 8B — 105 tok/s, 4.4 GB. Fits everywhere.