How many tokens per second can M5 Max generate?

Token generation speed depends on model size and follows the formula: tok/s is approximately equal to 614 GB/s divided by the model size in GB. In practice, a 4B model generates about 179 tok/s, an 8B model about 105-113 tok/s, a 14B model about 55-62 tok/s, a 27B model about 31 tok/s, and a 70B model about 12-13 tok/s. MoE models like Qwen 3 30B-A3B break this curve, achieving 127 tok/s despite requiring 16GB of memory.

What models support vision and image input on Mac?

Several Vision Language Models (VLMs) run well on M5 Max via MLX. Gemma 3 4B VLM is the fastest at 178.7 tok/s using only 2.4 GB of memory. Qwen3-VL 8B (110.7 tok/s, 4.4 GB) offers the best value for vision tasks. For highest quality image understanding, Qwen3-VL 32B (27.3 tok/s, 17.3 GB) is the top choice. These models can analyze documents, charts, screenshots, and photos entirely locally with no data leaving your machine.

M5 Max Local LLM Benchmarks: 20+ Models, MLX, Real Numbers

Q: What is the fastest AI model on MacBook Pro M5 Max?

Gemma 3 4B is the fastest model we tested, generating 178.7 tokens per second on the M5 Max 40-core GPU using MLX at 4-bit quantization. It uses only 2.4 GB of memory and supports both text and vision input. For models with more knowledge, Qwen 3 30B-A3B (MoE) achieves 127.4 tok/s while packing 30 billion parameters of intelligence into 16.1 GB of memory.

Q: Can I run a 70B model on MacBook Pro?

Yes, but you need at least 64GB of unified memory. At 4-bit quantization, a 70B dense model like Llama 3.3 70B uses 37.1 GB of memory, leaving enough headroom on a 64GB Mac for the OS and context window. On the M5 Max 40-core GPU, it generates at 12.6 tok/s with a 724ms time to first token. For 128GB configurations, you can run 70B models at Q8 quantization for higher quality, or load multiple models simultaneously.

Q: Is MLX faster than llama.cpp on Apple Silicon?

On M5 Max hardware, MLX is the recommended framework. It is purpose-built for Apple Silicon's unified memory architecture and Metal 4 GPU, delivering approximately 20-30% better token generation performance than llama.cpp for decode tasks. MLX also tends to have faster prompt processing due to deep unified memory integration. Additionally, Ollama (which uses GGUF/llama.cpp) has Metal 4 shader compilation issues on M5 Max as of March 2026.

Q: How much RAM do I need to run local AI models?

At 4-bit quantization, model size in GB is roughly 0.5 times the number of billion parameters. You should leave about 20% headroom for the OS and KV cache. With 32GB, you can run excellent 12B-27B models comfortably. With 64GB, you can run 70B dense models and frontier MoE models. With 128GB, you can run frontier MoE models like Qwen 3 235B, 70B models at Q8 quality, and multi-model setups.

Q: Is local AI cheaper than cloud APIs?

It depends on usage. A MacBook Pro M5 Max 128GB costs about $4,999. At 100K tokens per day (roughly 50-100 substantial AI interactions), local inference breaks even with cloud API costs within 15 months. At 500K tokens per day, break-even is 3 months. At 1M tokens per day, just 1.5 months. After break-even, every additional token is free. Electricity adds only $5-10 per month under heavy use.

Q: What is Mixture of Experts (MoE) and why does it matter?

MoE (Mixture of Experts) is a model architecture where only a fraction of parameters are activated per token. For example, Qwen 3 30B-A3B has 30 billion total parameters but activates only 3 billion per token. This gives it the speed of a 3B model (127.4 tok/s) with the quality of a 30B model. The tradeoff is that MoE models still need memory for all parameters (16.1 GB for Qwen 3 30B-A3B). On Apple Silicon machines with ample unified memory, MoE offers the best quality-to-speed ratio.

Q: Which model should I choose for coding on Mac?

For coding tasks, Devstral Small 24B (39.3 tok/s, 12.6 GB) is purpose-built for code generation and fits on any Mac with 32GB or more. Qwen 3 30B-A3B (127.4 tok/s, 16.1 GB) is excellent for coding while also being fast enough for real-time use. For smaller setups, Qwen 3 8B (105.4 tok/s, 4.4 GB) offers solid coding ability at ultra-fast speeds. For maximum quality and you have 64GB+, Llama 3.3 70B provides the strongest overall performance including code.

Q: Can I run AI models completely offline on a MacBook Pro?

Yes. Once you download a model, it runs entirely on local hardware with no internet connection required. Models are stored on your SSD and inference uses only your Mac's CPU, GPU, and unified memory. This is a key advantage over cloud APIs for travel, air-gapped environments, restricted networks, and privacy-sensitive work involving code, legal documents, or medical data.

The M5 Max has 614 GB/s of memory bandwidth and up to 128 GB of unified memory accessible by the GPU at full speed. LLM decode is memory-bandwidth-bound — not compute-bound — so this single number predicts throughput better than TFLOPS. The formula: tok/s ≈ 614 / model_size_GB. This page has measured numbers for 20+ models that confirm it within ~20-30%.

All data comes from real hardware (M5 Max 40-core, 128 GB), real framework versions (mlx 0.31.1, mlx-lm 0.31.1), and isolated subprocess runs. No Ollama, no wrappers — direct mlx_lm.generate calls. Below: the full benchmark tables, interactive comparison tools, a RAM calculator, VLM results, MLX vs GGUF analysis, cloud API cost comparison, and reproduction instructions.

# Quick start: install and run your first model
pip install mlx-lm

# Generate text (downloads model on first run)
mlx_lm.generate --model mlx-community/gemma-3-4b-it-4bit \
  --prompt "Explain the attention mechanism in 3 sentences" \
  --max-tokens 256

# Benchmark a model (rough one-liner)
python -c "
from mlx_lm import load, generate
model, tok = load('mlx-community/Qwen3-30B-A3B-4bit')
generate(model, tok, prompt='Hello', max_tokens=200, verbose=True)
"
  

Hardware: Apple M5 Max

TSMC 3nm. 614 GB/s bandwidth on the 40-core SKU. Bandwidth determines tok/s, not TFLOPS.

LLM decode reads the entire weight matrix once per token. On a memory-bandwidth-bound workload, compute is not the bottleneck — the memory bus is. The 40-core M5 Max has 614 GB/s; the 32-core variant has 460 GB/s. Same die, same RAM options, but ~25% less bandwidth. GPU core count determines bandwidth, not RAM amount. A 64 GB / 40-core Mac is faster than a 128 GB / 32-core Mac for inference. The formula tok/s ≈ bandwidth / model_size_GB holds within 20-30% across all models we tested, with the delta attributable to KV cache overhead, attention compute, and framework efficiency.

M5 Max Chip Specifications

Spec	M5 Max (32-core GPU)	M5 Max (40-core GPU)
CPU Cores	18 (6 Super + 12 Performance)	18 (6 Super + 12 Performance)
GPU Cores	32	40
Neural Engine	16-core	16-core
GPU Neural Accelerators	Yes (new in M5)	Yes (new in M5)
Memory Bandwidth	460 GB/s	614 GB/s
Max Unified Memory	128 GB	128 GB
Process	TSMC 3nm (3rd gen)	TSMC 3nm (3rd gen)

      The number that matters: Memory bandwidth is set by GPU core count, not RAM size. 32-core = 460 GB/s. 40-core = 614 GB/s. Same RAM, ~25% different tok/s.
    

Unified memory means no VRAM wall. An RTX 4090 has 24 GB VRAM; an RTX 5090 has 32 GB. Exceed that and you spill to PCIe system RAM at a fraction of the bandwidth. The M5 Max GPU sees all 128 GB at full 614 GB/s. No copies, no offloading. This is why a 37 GB model like Llama 3.3 70B Q4 runs fine on a laptop — something impossible on any consumer NVIDIA GPU without severe throughput penalties.

Config Matrix

Configuration	Unified Memory	Bandwidth	Max Model Size
M5 Max 32-core GPU	36 GB	460 GB/s	~14B dense Q4 comfortably
M5 Max 32-core GPU	64 GB	460 GB/s	~70B Q4 (slow: ~10 tok/s)
M5 Max 40-core GPU	64 GB	614 GB/s	~70B Q4 at full speed
M5 Max 40-core GPU	128 GB	614 GB/s	235B MoE, 70B Q8, multi-model

All benchmarks below: 40-core GPU, 128 GB config.

Model Comparison Tool

Pick two models. Compare tok/s, RAM, and efficiency side by side.

Model A

Model B

Head to Head

Speed (tok/s)

RAM Usage (GB)

Efficiency (tok/s per GB)

RAM Calculator

Which models fit? 8 GB reserved for macOS. Remainder available for model weights + KV cache.

RAM Allocation

Compatible (0)

Too Large (0)

Best Pick at 128 GB

Text Generation Benchmarks

All Q4, MLX, M5 Max 40-core, 128 GB. 3-5 subprocess-isolated runs, averaged. Click headers to sort.

Dense models read all parameters per token. A 70B Q4 model is ~37 GB, so you get 614 / 37 ≈ 16.6 theoretical tok/s (measured: 12.6, delta from KV cache + compute). MoE models only read active expert weights per token. Qwen 3 30B-A3B activates 3B of 30B params — it decodes at 127 tok/s despite needing 16 GB resident. 70B models are slow but smart; 4B models are fast but dumb. MoE splits the difference. At Q4, memory footprint follows ~0.5 GB per billion parameters for dense models.

Text Generation Models

Model ▲▼	Params ▲▼	Type ▲▼	tok/s ▲▼	Speed	TTFT ▲▼	Memory ▲▼	Tier

Table 1: Text generation benchmarks on M5 Max 40-core GPU, 128 GB, MLX 4-bit quantization. Click column headers to sort.

Vision Language Models

Model	Params	tok/s	Speed	TTFT	Memory	Tier

Table 2: Vision Language Model benchmarks on M5 Max. See the VLM section for HuggingFace model IDs.

    MoE note: Qwen 3 30B-A3B activates 3B of 30B params per token via sparse expert routing. Result: 127.4 tok/s decode at 16.1 GB resident. Effective compute cost of a 3B model, knowledge capacity of a 30B model.
  

Throughput Formula

Decode is memory-bandwidth-bound, not compute-bound:

        tok/s ≈ 614 GB/s ÷ Model Size (GB)
      

Predicts within 20-30%. Gap = KV cache reads + attention compute + framework overhead.

Quick Reference

TTFT < 200ms for all models under 15B params
70B models: TTFT ~730ms, still interactive
Q4 memory: ~0.5 * params_B GB
MoE breaks the speed-vs-size curve

Vision Language Models (VLMs)

Image + text input. Local OCR, document analysis, screenshot parsing. No data leaves your machine.

VLMs process image inputs alongside text prompts. Use cases: local OCR, chart/diagram parsing, screenshot-to-code pipelines, document classification. All processing stays on-device — no image data hits an API endpoint. The text decode speed is identical to the text-only variants for the same architecture (the vision encoder adds latency to prefill, not decode). HuggingFace model IDs: mlx-community/gemma-3-4b-it-4bit (VLM-capable), mlx-community/Qwen3-VL-8B-4bit, mlx-community/Qwen3-VL-32B-4bit.

178.7

tok/s

Gemma 3 4B VLM

Fastest — 2.4 GB RSS

110.7

tok/s

Qwen3-VL 8B

Best quality/speed — 4.4 GB

27.3

tok/s

Qwen3-VL 32B

Highest accuracy — 17.3 GB

For bulk document processing and classification, Gemma 3 4B VLM at 179 tok/s is faster than any cloud vision API round-trip. For tasks requiring strong spatial reasoning or fine-grained image understanding, Qwen3-VL 32B at 27.3 tok/s fits in 64 GB and delivers the best accuracy we measured.

128 GB Memory Map

Each cell = 1 GB of unified memory. Click a category to highlight.

128 GB Unified Memory Pool

1 cell = 1 GB

Speed Tiers

Models bucketed by decode throughput. >100 tok/s = faster than you can read. >30 tok/s = real-time chat. <25 tok/s = noticeable latency.

Efficiency: tok/s per GB

Speed normalized by memory footprint. Higher = more throughput per GB of RAM consumed.

Rank	Model	Type	tok/s	Memory	tok/s per GB	Quality	Agentic	Efficiency

Table 3: Efficiency = tok/s / memory_GB. Higher is better. Small dense models dominate; MoE penalized by total weight residency.

Quality Evaluation (lm-eval-harness)

ARC-Challenge, GSM8K, IFEval via lm-evaluation-harness. 4-bit quantization, greedy decoding.

Scores below are from lm-eval with our custom task configs (0-shot ARC with regex extraction, 8-shot GSM8K CoT multiturn, IFEval strict). All runs on MLX with greedy decoding (temp=0). Composite = mean of available scores.

Loading quality evaluation data...

Agentic Eval (terminal-bench)

14 tasks via terminal-bench-core v0.1.1 with terminus-thinking agent. Docker sandbox per task.

Each model runs an agentic loop: receive task description, execute shell commands in a Docker container, iterate until tests pass or timeout. The agent uses structured JSON output (CommandBatchResponse schema). Parse errors = model can't produce valid JSON. Run with: python3 scripts/run_agentic_bench_batch.py --task <name>

Loading agentic benchmark data...

View detailed failure analysis in the Eval Dashboard →

RAM Tier Picks

Q4 memory: ~params_B * 0.5 GB. Reserve ~20% for OS + KV cache + apps.

Pick the model that fits your RAM with headroom to spare. Running at 95% memory utilization works but leaves no room for KV cache growth on long contexts or running anything else. These picks assume you want to keep the model loaded while using other apps.

32 GB

~24 GB usable

Covers all models up to 27B dense. MoE models fit too.

Qwen 3 30B-A3B127.4 tok/s · 16.1 GB
Gemma 3 27B30.9 tok/s · 15.2 GB
Phi-4 14B62.0 tok/s · 7.8 GB
Gemma 3 4B178.7 tok/s · 2.4 GB

Best pick: Qwen 3 30B-A3B — 30B quality at 127 tok/s.

64 GB

Sweet Spot

Runs 70B dense models. Get the 40-core GPU variant for full 614 GB/s bandwidth.

Qwen 3 30B-A3B127.4 tok/s · 16.1 GB
Llama 3.3 70B12.6 tok/s · 37.1 GB
Devstral 24B39.3 tok/s · 12.6 GB
Qwen 3 32B25.7 tok/s · 17.3 GB

32-core GPU = ~25% slower than 40-core at same RAM. Pay for the GPU cores.

128 GB

Everything fits

Multi-model setups. 70B at Q8 for better quality. Frontier 235B MoE.

Qwen 3 30B-A3B127.4 tok/s · 16.1 GB
Llama 3.3 70B12.6 tok/s · 37.1 GB
Devstral 24B39.3 tok/s · 12.6 GB
DeepSeek R1 32B24.9 tok/s · 17.3 GB

Qwen 3 235B-A22B fits at ~118 GB Q4. Frontier MoE on a laptop.

MLX vs GGUF

MLX: Apple-native, Metal 4, unified memory optimized. GGUF: cross-platform, powers Ollama/LM Studio.

On M5 Max, MLX delivers ~20-30% higher decode tok/s than llama.cpp (GGUF backend). MLX is built for Metal 4 and Apple's unified memory — no CPU-to-GPU copies, native mx.array operations. As of March 2026, Ollama 0.18.2 has Metal 4 shader compilation bugs on M5 Max. If you need Ollama, pin to an older version or wait for a fix. GGUF wins on model variety (tens of thousands of quants on HuggingFace) and cross-platform support (Linux/Windows CUDA).

# MLX: install and run
pip install mlx-lm
mlx_lm.generate --model mlx-community/Qwen3-30B-A3B-4bit \
  --prompt "Write a Python async HTTP server" --max-tokens 512

# GGUF via llama.cpp (if you need it)
brew install llama.cpp
llama-cli -m Qwen3-30B-A3B-Q4_K_M.gguf \
  -p "Write a Python async HTTP server" -n 512

# MLX server (OpenAI-compatible API)
mlx_lm.server --model mlx-community/Qwen3-30B-A3B-4bit --port 8080
# Then: curl localhost:8080/v1/chat/completions ...
  

MLX

Native Metal 4 GPU + unified memory integration
pip install mlx-lm — Python-first, fine-tuning support
~20-30% faster decode vs llama.cpp on Apple Silicon
Faster prefill via zero-copy memory access
Recommended on M5 Max as of March 2026
Models: mlx-community/* on HuggingFace

GGUF / llama.cpp

Cross-platform: CPU, CUDA, Metal, Vulkan, ROCm
Largest model ecosystem on HuggingFace
IQ quants, mixed quantization, k-quants
Powers Ollama and LM Studio GUIs
Bug: Ollama 0.18.2 Metal 4 shader issues on M5
Better choice for Linux/CUDA dual-boot setups

    TL;DR: Use mlx-lm on M5 Max for maximum tok/s. Use GGUF if you need cross-platform compat or a specific quant variant not yet in mlx-community.
  

Cloud API Pricing Reference

March 2026 pricing. Elo ratings from LM Arena. Local models are free after hardware cost.

Cloud APIs still win on absolute quality (Elo 1490-1510 vs ~1420-1450 for the best open models). For most dev tasks — code gen, summarization, data extraction, chat — a local 30B-70B model is good enough. The real question is cost at your usage volume. See the break-even table.

Model	Provider	Input $/M tok	Output $/M tok	Context	Arena Elo	Vision
Claude Opus 4.6	Anthropic	$5.00	$25.00	1M	~1505	Yes
Gemini 3.1 Pro	Google	$2.00	$12.00	1M	~1503	Yes
GPT-5.2	OpenAI	$1.75	$14.00	400K	~1490	Yes
Claude Sonnet 4.6	Anthropic	$3.00	$15.00	200K	~1480	Yes
GPT-5.2 Pro	OpenAI	$21.00	$168.00	400K	~1510	Yes
Gemini 2.5 Flash	Google	Free	Free	1M	~1450	Yes
DeepSeek V3.2 API	DeepSeek	$0.14	$0.28	164K	~1421	No
DeepSeek R1 API	DeepSeek	$0.55	$2.19	164K	~1430	No
GPT-4o	OpenAI	$2.50	$10.00	128K	~1460	Yes
Claude Haiku 4.5	Anthropic	$1.00	$5.00	200K	~1420	Yes

Table 4: Cloud API pricing, March 2026. Elo from lmarena.ai.

Local Wins When

✓ Data cannot leave the machine (HIPAA, source code, etc.)
✓ TTFT matters: 100-200ms local vs 500ms-2s cloud
✓ High volume: $0 marginal cost after hardware amortizes
✓ Offline / air-gapped / airplane mode
✓ No rate limits, no API key management

Cloud Wins When

✓ You need Elo 1490+ quality (frontier reasoning)
✓ Low volume (<100K tok/day): cheaper than hardware
✓ 1M+ context windows not feasible locally
✓ Always-latest models without manual updates
✓ No upfront capex

Cost Break-Even

M5 Max 128 GB = ~$4,999. Blended Sonnet-tier API cost = ~$9/M tokens. Electricity: ~$5-10/mo under load.

Simple math: $4,999 / ($9/M_tok * daily_tok * 30) = months to break even. At 100K tok/day that is $27/month cloud spend, so ~15 months. At 500K tok/day it is $135/month, so ~3 months. If you are running batch jobs, agents, or CI pipelines against a local model, you hit 1M+ tok/day easily and break even in weeks. After break-even, marginal cost is zero (electricity is negligible).

Daily Token Usage	Monthly Cloud Cost	Break-Even	Verdict
10K tokens/day	~$2.70/mo	154 years	Cloud wins
100K tokens/day	~$27/mo	15 months	Toss-up
500K tokens/day	~$135/mo	3 months	Local wins
1M tokens/day	~$270/mo	1.5 months	Local wins
5M tokens/day	~$1,350/mo	~11 days	Local wins

Table 5: Break-even assuming $4,999 hardware, ~$9/M blended API tokens.

    Rule of thumb: If you generate >100K tokens/day consistently, buy the hardware. If you run batch/agent workloads, break-even is measured in weeks.
  

Methodology & Reproduction

Exact versions, isolation strategy, and how to reproduce these numbers yourself.

Every benchmark run executes in a fresh subprocess (subprocess.Popen with a new Python interpreter) to ensure clean memory measurement. No model weights persist between runs. RSS is sampled at peak during generation. We use mlx_lm.generate directly — no Ollama, no LM Studio, no HTTP server overhead. Four standardized prompts per model (Q&A, reasoning, code gen, structured output). 3-5 runs per model, results averaged. Outliers beyond 2 standard deviations discarded.

# Reproduction steps (approximate)
pip install mlx==0.31.1 mlx-lm==0.31.1

# Run a single model benchmark
python -c "
import time, subprocess, json
from mlx_lm import load, generate

model, tokenizer = load('mlx-community/Qwen3-30B-A3B-4bit')

prompts = [
    'What causes ocean tides?',
    'Write a binary search in Rust.',
    'Output JSON: {name, age, city} for 3 fictional people.',
    'Explain why P != NP is hard to prove.'
]

for p in prompts:
    t0 = time.perf_counter()
    out = generate(model, tokenizer, prompt=p, max_tokens=256, verbose=True)
    elapsed = time.perf_counter() - t0
    print(f'Prompt: {p[:40]}... | Time: {elapsed:.2f}s')
"

# For clean memory measurement, wrap each run in subprocess:
# subprocess.run([sys.executable, 'bench_single.py', '--model', model_id])
  

Test Configuration

Hardware: MacBook Pro 16" M5 Max, 40-core GPU, 128 GB unified memory
OS: macOS 16.x (Darwin 25.3.0)
Framework: mlx 0.31.1, mlx-lm 0.31.1
Quantization: 4-bit (Q4) for all models
Prompts: 4 standardized (Q&A, reasoning, code, structured output)
Runs: 3-5 per model, averaged, outliers discarded
Isolation: Each run in a fresh subprocess for clean RSS measurement
Cooling: Stock laptop cooling, no external. Thermal throttling not observed.

FAQ

Common questions. Answers reference measured data from the benchmarks above.

Data-backed answers. If the answer involves a number, it came from our test runs, not a spec sheet.

What is the fastest AI model on MacBook Pro M5 Max?+

Gemma 3 4B is the fastest model we tested, generating 178.7 tokens per second on the M5 Max 40-core GPU using MLX at 4-bit quantization. It uses only 2.4 GB of memory and supports both text and vision input. For models with more knowledge, Qwen 3 30B-A3B (MoE) achieves 127.4 tok/s while packing 30 billion parameters of intelligence into 16.1 GB of memory. See the full benchmark table for all results.

How many tokens per second can the M5 Max generate?+

It follows the formula tok/s ≈ 614 GB/s ÷ model_size_GB. In practice: a 4B model generates ~179 tok/s, 8B ~105-113 tok/s, 14B ~55-62 tok/s, 27B ~31 tok/s, and 70B ~12-13 tok/s. MoE models like Qwen 3 30B-A3B break this curve with 127 tok/s despite using 16GB of memory because only 3B parameters are active per token.

Can I run a 70B model on MacBook Pro?+

Yes, but you need at least 64 GB of unified memory. At 4-bit quantization, a 70B dense model like Llama 3.3 70B uses 37.1 GB, leaving enough headroom on a 64 GB Mac for the OS and context window. On the M5 Max 40-core GPU, it generates at 12.6 tok/s with a 724ms time to first token. For 128 GB configurations, you can run 70B at Q8 for higher quality or keep multiple models loaded at once. See the RAM Tier Guide for configuration-specific recommendations.

Is MLX faster than llama.cpp on Apple Silicon?+

On M5 Max, MLX is the better choice. It is purpose-built for Apple Silicon's unified memory and Metal 4 GPU, providing ~20-30% better decode performance than llama.cpp. Ollama (GGUF-based) also has Metal 4 shader compilation issues on M5 Max as of March 2026. MLX tends to have faster prompt processing as well. See the full MLX vs GGUF comparison for details.

How much RAM do I need to run local AI models?+

At 4-bit quantization, model size in GB is roughly 0.5 times the number of billion parameters. Leave about 20% headroom for the OS and KV cache. 32 GB runs excellent 12B-27B models (31-69 tok/s). 64 GB runs frontier 70B dense models comfortably. 128 GB is for 70B at Q8, multi-model setups, or frontier MoE like Qwen 3 235B. The RAM Tier Guide has specific recommendations for each configuration.

Is local AI cheaper than cloud APIs?+

It depends on usage. At 100K tokens/day (~50-100 interactions), a MacBook Pro M5 Max 128 GB ($4,999) breaks even with cloud APIs within 15 months. At 500K tokens/day, break-even is 3 months. At 1M tokens/day, just 1.5 months. After break-even, every token is free. Electricity adds only ~$5-10/month. See the full cost break-even analysis.

What is Mixture of Experts (MoE) and why does it matter?+

MoE is a model architecture where only a fraction of parameters are activated per token. Qwen 3 30B-A3B has 30 billion total parameters but activates only 3 billion per token, giving it the speed of a 3B model (127.4 tok/s) with the quality of a 30B model. The tradeoff: MoE models still need memory for all parameters (16.1 GB). On Apple Silicon with ample unified memory, MoE offers the best quality-to-speed ratio available — making it ideal for all RAM tiers.

Which model should I choose for coding on Mac?+

For coding tasks, Devstral Small 24B (39.3 tok/s, 12.6 GB) is purpose-built for code generation and fits on any Mac with 32 GB+. Qwen 3 30B-A3B (127.4 tok/s) is excellent for coding while also being fast enough for real-time use. For smaller setups, Qwen 3 8B (105.4 tok/s, 4.4 GB) offers solid coding at ultra-fast speed. For maximum quality with 64 GB+, Llama 3.3 70B provides the strongest overall performance.

Can I run AI models completely offline on a MacBook Pro?+

Yes. Once downloaded, models run entirely on local hardware with no internet needed. This is a key advantage for travel, air-gapped environments, restricted networks, and privacy-sensitive work. Models are stored on SSD and inference uses only your Mac's CPU, GPU, and unified memory.

What models support vision (image input) on Mac?+

Several VLMs run well via MLX: Gemma 3 4B VLM (179 tok/s, 2.4 GB) is the fastest. Qwen3-VL 8B (111 tok/s, 4.4 GB) is the best value. Gemma 3 27B VLM (32 tok/s, 15.2 GB) and Qwen3-VL 32B (27 tok/s, 17.3 GB) provide the highest quality image understanding. All run entirely locally. See the VLM section for more.

Can I run AI models while doing other work?+

Yes. Apple Silicon uses a unified memory architecture where the CPU and GPU share the same memory pool. You do not need to copy model weights between system RAM and VRAM like you would with a discrete GPU. You can run AI inference alongside regular workloads such as a web browser, IDE, or creative apps. However, the model's memory footprint reduces what is available for other applications, so choose a model size that leaves enough headroom. For example, running a 17 GB model on a 64 GB Mac still leaves roughly 39 GB for macOS and your other apps.

What does tokens per second actually feel like?+

Average human reading speed is roughly 4–5 tokens per second. So any model generating above 5 tok/s is producing text faster than you can read it. At 25–30 tok/s, responses appear nearly instantaneous for short answers. At 100+ tok/s, even long multi-paragraph responses complete in just a few seconds. For coding assistants, higher speeds mean faster completions and less waiting between edits. In practice, anything above 30 tok/s feels real-time for interactive chat.

Can I run multiple models at once?+

Yes, as long as the combined memory footprint fits in your available RAM. For example, on a 128 GB Mac with approximately 8 GB reserved for the OS, you could simultaneously load Qwen 3 30B-A3B (16.1 GB) + Devstral 24B (12.6 GB) + Gemma 3 4B (2.4 GB) for a total of 31.1 GB, leaving over 88 GB free. On a 64 GB Mac, you could run a 7–8B model (4.4 GB) alongside a 14B model (7.8 GB) for about 12.2 GB total. Only one model generates tokens at a time using the GPU, but having multiple loaded avoids reload latency when switching between them.

What is the best model for each RAM tier?+

32 GB: Qwen 3 30B-A3B (127.4 tok/s, 16.1 GB) is the best overall pick, offering 30B-class quality at blazing speed. 64 GB: Llama 3.3 70B (12.6 tok/s, 37.1 GB) provides the strongest quality, while Qwen 3 30B-A3B remains the best speed-to-quality ratio. 128 GB: You can run everything, but Qwen 3 30B-A3B is still the daily driver for speed, Llama 3.3 70B for quality, and Devstral 24B (39.3 tok/s, 12.6 GB) for coding. The sweet spot for most users is the 64 GB configuration with the 40-core GPU.

Recommendations

Summary of what to buy, what to run, and what to skip.

Hardware

• Best value: 64 GB + 40-core GPU. Runs 70B dense, all MoE models, at full 614 GB/s.
• Max headroom: 128 GB + 40-core. Multi-model, 70B Q8, frontier 235B MoE.
• Avoid: 32-core GPU if inference speed matters. 25% less bandwidth, same price tier.

Software

• Use mlx-lm on M5 Max. Skip Ollama until Metal 4 shaders are fixed.
• mlx_lm.server exposes an OpenAI-compatible API for integration with existing tools.
• Models: mlx-community/* on HuggingFace. Q4 variants for speed, Q8 if RAM allows.

Model Picks

• Daily driver: Qwen 3 30B-A3B — 127 tok/s, 16.1 GB. Best speed-to-quality ratio.
• Coding: Devstral 24B — 39 tok/s, 12.6 GB. Purpose-built for code gen.
• Max quality: Llama 3.3 70B — 12.6 tok/s, 37.1 GB. Needs 64 GB+.
• Fast vision: Gemma 3 4B VLM — 179 tok/s, 2.4 GB. OCR/doc analysis.
• Lightweight: Qwen 3 8B — 105 tok/s, 4.4 GB. Fits everywhere.