The M5 Max has 614 GB/s of memory bandwidth and up to 128 GB of unified memory accessible by the GPU at full speed. LLM decode is memory-bandwidth-bound — not compute-bound — so this single number predicts throughput better than TFLOPS. The formula: tok/s ≈ 614 / model_size_GB. This page has measured numbers for 20+ models that confirm it within ~20-30%.
All data comes from real hardware (M5 Max 40-core, 128 GB), real framework versions (mlx 0.31.1, mlx-lm 0.31.1), and isolated subprocess runs. No Ollama, no wrappers — direct mlx_lm.generate calls. Below: the full benchmark tables, interactive comparison tools, a RAM calculator, VLM results, MLX vs GGUF analysis, cloud API cost comparison, and reproduction instructions.
Hardware: Apple M5 Max
TSMC 3nm. 614 GB/s bandwidth on the 40-core SKU. Bandwidth determines tok/s, not TFLOPS.
LLM decode reads the entire weight matrix once per token. On a memory-bandwidth-bound workload, compute is not the bottleneck — the memory bus is. The 40-core M5 Max has 614 GB/s; the 32-core variant has 460 GB/s. Same die, same RAM options, but ~25% less bandwidth. GPU core count determines bandwidth, not RAM amount. A 64 GB / 40-core Mac is faster than a 128 GB / 32-core Mac for inference. The formula tok/s ≈ bandwidth / model_size_GB holds within 20-30% across all models we tested, with the delta attributable to KV cache overhead, attention compute, and framework efficiency.
M5 Max Chip Specifications
| Spec | M5 Max (32-core GPU) | M5 Max (40-core GPU) |
|---|---|---|
| CPU Cores | 18 (6 Super + 12 Performance) | 18 (6 Super + 12 Performance) |
| GPU Cores | 32 | 40 |
| Neural Engine | 16-core | 16-core |
| GPU Neural Accelerators | Yes (new in M5) | Yes (new in M5) |
| Memory Bandwidth | 460 GB/s | 614 GB/s |
| Max Unified Memory | 128 GB | 128 GB |
| Process | TSMC 3nm (3rd gen) | TSMC 3nm (3rd gen) |
Unified memory means no VRAM wall. An RTX 4090 has 24 GB VRAM; an RTX 5090 has 32 GB. Exceed that and you spill to PCIe system RAM at a fraction of the bandwidth. The M5 Max GPU sees all 128 GB at full 614 GB/s. No copies, no offloading. This is why a 37 GB model like Llama 3.3 70B Q4 runs fine on a laptop — something impossible on any consumer NVIDIA GPU without severe throughput penalties.
Config Matrix
| Configuration | Unified Memory | Bandwidth | Max Model Size |
|---|---|---|---|
| M5 Max 32-core GPU | 36 GB | 460 GB/s | ~14B dense Q4 comfortably |
| M5 Max 32-core GPU | 64 GB | 460 GB/s | ~70B Q4 (slow: ~10 tok/s) |
| M5 Max 40-core GPU | 64 GB | 614 GB/s | ~70B Q4 at full speed |
| M5 Max 40-core GPU | 128 GB | 614 GB/s | 235B MoE, 70B Q8, multi-model |
Model Comparison Tool
Pick two models. Compare tok/s, RAM, and efficiency side by side.
Head to Head
Speed (tok/s)
RAM Usage (GB)
Efficiency (tok/s per GB)
RAM Calculator
Which models fit? 8 GB reserved for macOS. Remainder available for model weights + KV cache.
Compatible (0)
Too Large (0)
Best Pick at 128 GB
Text Generation Benchmarks
All Q4, MLX, M5 Max 40-core, 128 GB. 3-5 subprocess-isolated runs, averaged. Click headers to sort.
Dense models read all parameters per token. A 70B Q4 model is ~37 GB, so you get 614 / 37 ≈ 16.6 theoretical tok/s (measured: 12.6, delta from KV cache + compute). MoE models only read active expert weights per token. Qwen 3 30B-A3B activates 3B of 30B params — it decodes at 127 tok/s despite needing 16 GB resident. 70B models are slow but smart; 4B models are fast but dumb. MoE splits the difference. At Q4, memory footprint follows ~0.5 GB per billion parameters for dense models.
Text Generation Models
| Model ▲▼ | Params ▲▼ | Type ▲▼ | tok/s ▲▼ | Speed | TTFT ▲▼ | Memory ▲▼ | Tier |
|---|
Vision Language Models
| Model | Params | tok/s | Speed | TTFT | Memory | Tier |
|---|
Qwen 3 30B-A3B activates 3B of 30B params per token via sparse expert routing. Result: 127.4 tok/s decode at 16.1 GB resident. Effective compute cost of a 3B model, knowledge capacity of a 30B model.
Throughput Formula
Decode is memory-bandwidth-bound, not compute-bound:
Predicts within 20-30%. Gap = KV cache reads + attention compute + framework overhead.
Quick Reference
- TTFT < 200ms for all models under 15B params
- 70B models: TTFT ~730ms, still interactive
- Q4 memory:
~0.5 * params_BGB - MoE breaks the speed-vs-size curve
Vision Language Models (VLMs)
Image + text input. Local OCR, document analysis, screenshot parsing. No data leaves your machine.
VLMs process image inputs alongside text prompts. Use cases: local OCR, chart/diagram parsing, screenshot-to-code pipelines, document classification. All processing stays on-device — no image data hits an API endpoint. The text decode speed is identical to the text-only variants for the same architecture (the vision encoder adds latency to prefill, not decode). HuggingFace model IDs: mlx-community/gemma-3-4b-it-4bit (VLM-capable), mlx-community/Qwen3-VL-8B-4bit, mlx-community/Qwen3-VL-32B-4bit.
For bulk document processing and classification, Gemma 3 4B VLM at 179 tok/s is faster than any cloud vision API round-trip. For tasks requiring strong spatial reasoning or fine-grained image understanding, Qwen3-VL 32B at 27.3 tok/s fits in 64 GB and delivers the best accuracy we measured.
128 GB Memory Map
Each cell = 1 GB of unified memory. Click a category to highlight.
128 GB Unified Memory Pool
1 cell = 1 GBSpeed Tiers
Models bucketed by decode throughput. >100 tok/s = faster than you can read. >30 tok/s = real-time chat. <25 tok/s = noticeable latency.
Efficiency: tok/s per GB
Speed normalized by memory footprint. Higher = more throughput per GB of RAM consumed.
| Rank | Model | Type | tok/s | Memory | tok/s per GB | Quality | Agentic | Efficiency |
|---|
Quality Evaluation (lm-eval-harness)
ARC-Challenge, GSM8K, IFEval via lm-evaluation-harness. 4-bit quantization, greedy decoding.
Scores below are from lm-eval with our custom task configs (0-shot ARC with regex extraction, 8-shot GSM8K CoT multiturn, IFEval strict). All runs on MLX with greedy decoding (temp=0). Composite = mean of available scores.
Loading quality evaluation data...
Agentic Eval (terminal-bench)
14 tasks via terminal-bench-core v0.1.1 with terminus-thinking agent. Docker sandbox per task.
Each model runs an agentic loop: receive task description, execute shell commands in a Docker container, iterate until tests pass or timeout. The agent uses structured JSON output (CommandBatchResponse schema). Parse errors = model can't produce valid JSON. Run with: python3 scripts/run_agentic_bench_batch.py --task <name>
Loading agentic benchmark data...
RAM Tier Picks
Q4 memory: ~params_B * 0.5 GB. Reserve ~20% for OS + KV cache + apps.
Pick the model that fits your RAM with headroom to spare. Running at 95% memory utilization works but leaves no room for KV cache growth on long contexts or running anything else. These picks assume you want to keep the model loaded while using other apps.
32 GB
~24 GB usableCovers all models up to 27B dense. MoE models fit too.
- Qwen 3 30B-A3B127.4 tok/s · 16.1 GB
- Gemma 3 27B30.9 tok/s · 15.2 GB
- Phi-4 14B62.0 tok/s · 7.8 GB
- Gemma 3 4B178.7 tok/s · 2.4 GB
Qwen 3 30B-A3B — 30B quality at 127 tok/s.64 GB
Sweet SpotRuns 70B dense models. Get the 40-core GPU variant for full 614 GB/s bandwidth.
- Qwen 3 30B-A3B127.4 tok/s · 16.1 GB
- Llama 3.3 70B12.6 tok/s · 37.1 GB
- Devstral 24B39.3 tok/s · 12.6 GB
- Qwen 3 32B25.7 tok/s · 17.3 GB
128 GB
Everything fitsMulti-model setups. 70B at Q8 for better quality. Frontier 235B MoE.
- Qwen 3 30B-A3B127.4 tok/s · 16.1 GB
- Llama 3.3 70B12.6 tok/s · 37.1 GB
- Devstral 24B39.3 tok/s · 12.6 GB
- DeepSeek R1 32B24.9 tok/s · 17.3 GB
Qwen 3 235B-A22B fits at ~118 GB Q4. Frontier MoE on a laptop.MLX vs GGUF
MLX: Apple-native, Metal 4, unified memory optimized. GGUF: cross-platform, powers Ollama/LM Studio.
On M5 Max, MLX delivers ~20-30% higher decode tok/s than llama.cpp (GGUF backend). MLX is built for Metal 4 and Apple's unified memory — no CPU-to-GPU copies, native mx.array operations. As of March 2026, Ollama 0.18.2 has Metal 4 shader compilation bugs on M5 Max. If you need Ollama, pin to an older version or wait for a fix. GGUF wins on model variety (tens of thousands of quants on HuggingFace) and cross-platform support (Linux/Windows CUDA).
MLX
- Native Metal 4 GPU + unified memory integration
pip install mlx-lm— Python-first, fine-tuning support- ~20-30% faster decode vs llama.cpp on Apple Silicon
- Faster prefill via zero-copy memory access
- Recommended on M5 Max as of March 2026
- Models:
mlx-community/*on HuggingFace
GGUF / llama.cpp
- Cross-platform: CPU, CUDA, Metal, Vulkan, ROCm
- Largest model ecosystem on HuggingFace
- IQ quants, mixed quantization, k-quants
- Powers Ollama and LM Studio GUIs
- Bug: Ollama 0.18.2 Metal 4 shader issues on M5
- Better choice for Linux/CUDA dual-boot setups
mlx-lm on M5 Max for maximum tok/s. Use GGUF if you need cross-platform compat or a specific quant variant not yet in mlx-community.
Cloud API Pricing Reference
March 2026 pricing. Elo ratings from LM Arena. Local models are free after hardware cost.
Cloud APIs still win on absolute quality (Elo 1490-1510 vs ~1420-1450 for the best open models). For most dev tasks — code gen, summarization, data extraction, chat — a local 30B-70B model is good enough. The real question is cost at your usage volume. See the break-even table.
| Model | Provider | Input $/M tok | Output $/M tok | Context | Arena Elo | Vision |
|---|---|---|---|---|---|---|
| Claude Opus 4.6 | Anthropic | $5.00 | $25.00 | 1M | ~1505 | Yes |
| Gemini 3.1 Pro | $2.00 | $12.00 | 1M | ~1503 | Yes | |
| GPT-5.2 | OpenAI | $1.75 | $14.00 | 400K | ~1490 | Yes |
| Claude Sonnet 4.6 | Anthropic | $3.00 | $15.00 | 200K | ~1480 | Yes |
| GPT-5.2 Pro | OpenAI | $21.00 | $168.00 | 400K | ~1510 | Yes |
| Gemini 2.5 Flash | Free | Free | 1M | ~1450 | Yes | |
| DeepSeek V3.2 API | DeepSeek | $0.14 | $0.28 | 164K | ~1421 | No |
| DeepSeek R1 API | DeepSeek | $0.55 | $2.19 | 164K | ~1430 | No |
| GPT-4o | OpenAI | $2.50 | $10.00 | 128K | ~1460 | Yes |
| Claude Haiku 4.5 | Anthropic | $1.00 | $5.00 | 200K | ~1420 | Yes |
Local Wins When
- ✓ Data cannot leave the machine (HIPAA, source code, etc.)
- ✓ TTFT matters: 100-200ms local vs 500ms-2s cloud
- ✓ High volume: $0 marginal cost after hardware amortizes
- ✓ Offline / air-gapped / airplane mode
- ✓ No rate limits, no API key management
Cloud Wins When
- ✓ You need Elo 1490+ quality (frontier reasoning)
- ✓ Low volume (<100K tok/day): cheaper than hardware
- ✓ 1M+ context windows not feasible locally
- ✓ Always-latest models without manual updates
- ✓ No upfront capex
Cost Break-Even
M5 Max 128 GB = ~$4,999. Blended Sonnet-tier API cost = ~$9/M tokens. Electricity: ~$5-10/mo under load.
Simple math: $4,999 / ($9/M_tok * daily_tok * 30) = months to break even. At 100K tok/day that is $27/month cloud spend, so ~15 months. At 500K tok/day it is $135/month, so ~3 months. If you are running batch jobs, agents, or CI pipelines against a local model, you hit 1M+ tok/day easily and break even in weeks. After break-even, marginal cost is zero (electricity is negligible).
| Daily Token Usage | Monthly Cloud Cost | Break-Even | Verdict |
|---|---|---|---|
| 10K tokens/day | ~$2.70/mo | 154 years | Cloud wins |
| 100K tokens/day | ~$27/mo | 15 months | Toss-up |
| 500K tokens/day | ~$135/mo | 3 months | Local wins |
| 1M tokens/day | ~$270/mo | 1.5 months | Local wins |
| 5M tokens/day | ~$1,350/mo | ~11 days | Local wins |
Methodology & Reproduction
Exact versions, isolation strategy, and how to reproduce these numbers yourself.
Every benchmark run executes in a fresh subprocess (subprocess.Popen with a new Python interpreter) to ensure clean memory measurement. No model weights persist between runs. RSS is sampled at peak during generation. We use mlx_lm.generate directly — no Ollama, no LM Studio, no HTTP server overhead. Four standardized prompts per model (Q&A, reasoning, code gen, structured output). 3-5 runs per model, results averaged. Outliers beyond 2 standard deviations discarded.
Test Configuration
- Hardware: MacBook Pro 16" M5 Max, 40-core GPU, 128 GB unified memory
- OS: macOS 16.x (Darwin 25.3.0)
- Framework:
mlx0.31.1,mlx-lm0.31.1 - Quantization: 4-bit (Q4) for all models
- Prompts: 4 standardized (Q&A, reasoning, code, structured output)
- Runs: 3-5 per model, averaged, outliers discarded
- Isolation: Each run in a fresh subprocess for clean RSS measurement
- Cooling: Stock laptop cooling, no external. Thermal throttling not observed.
FAQ
Common questions. Answers reference measured data from the benchmarks above.
Data-backed answers. If the answer involves a number, it came from our test runs, not a spec sheet.
tok/s ≈ 614 GB/s ÷ model_size_GB. In practice: a 4B model generates ~179 tok/s, 8B ~105-113 tok/s, 14B ~55-62 tok/s, 27B ~31 tok/s, and 70B ~12-13 tok/s. MoE models like Qwen 3 30B-A3B break this curve with 127 tok/s despite using 16GB of memory because only 3B parameters are active per token.
Recommendations
Summary of what to buy, what to run, and what to skip.
Hardware
- • Best value: 64 GB + 40-core GPU. Runs 70B dense, all MoE models, at full 614 GB/s.
- • Max headroom: 128 GB + 40-core. Multi-model, 70B Q8, frontier 235B MoE.
- • Avoid: 32-core GPU if inference speed matters. 25% less bandwidth, same price tier.
Software
- • Use
mlx-lmon M5 Max. Skip Ollama until Metal 4 shaders are fixed. - •
mlx_lm.serverexposes an OpenAI-compatible API for integration with existing tools. - • Models:
mlx-community/*on HuggingFace. Q4 variants for speed, Q8 if RAM allows.
Model Picks
- • Daily driver:
Qwen 3 30B-A3B— 127 tok/s, 16.1 GB. Best speed-to-quality ratio. - • Coding:
Devstral 24B— 39 tok/s, 12.6 GB. Purpose-built for code gen. - • Max quality:
Llama 3.3 70B— 12.6 tok/s, 37.1 GB. Needs 64 GB+. - • Fast vision:
Gemma 3 4B VLM— 179 tok/s, 2.4 GB. OCR/doc analysis. - • Lightweight:
Qwen 3 8B— 105 tok/s, 4.4 GB. Fits everywhere.