How many tokens per second can M5 Max generate?

Token generation speed depends on model size and follows the formula: tok/s is approximately equal to 614 GB/s divided by the model size in GB. In practice, a 4B model generates about 179 tok/s, an 8B model about 105-113 tok/s, a 14B model about 55-62 tok/s, a 27B model about 31 tok/s, and a 70B model about 12-13 tok/s. MoE models like Qwen 3 30B-A3B break this curve, achieving 127 tok/s despite requiring 16GB of memory.

What models support vision and image input on Mac?

Several Vision Language Models (VLMs) run well on M5 Max via MLX. Gemma 3 4B VLM is the fastest at 178.7 tok/s using only 2.4 GB of memory. Qwen3-VL 8B (110.7 tok/s, 4.4 GB) offers the best value for vision tasks. For highest quality image understanding, Qwen3-VL 32B (27.3 tok/s, 17.3 GB) is the top choice. These models can analyze documents, charts, screenshots, and photos entirely locally with no data leaving your machine.

The M5 Max Changed Everything About Local AI — And Most Reviews Got It Wrong

Q: What is the fastest AI model on MacBook Pro M5 Max?

Gemma 3 4B is the fastest model we tested, generating 178.7 tokens per second on the M5 Max 40-core GPU using MLX at 4-bit quantization. It uses only 2.4 GB of memory and supports both text and vision input. For models with more knowledge, Qwen 3 30B-A3B (MoE) achieves 127.4 tok/s while packing 30 billion parameters of intelligence into 16.1 GB of memory.

Q: Can I run a 70B model on MacBook Pro?

Yes, but you need at least 64GB of unified memory. At 4-bit quantization, a 70B dense model like Llama 3.3 70B uses 37.1 GB of memory, leaving enough headroom on a 64GB Mac for the OS and context window. On the M5 Max 40-core GPU, it generates at 12.6 tok/s with a 724ms time to first token. For 128GB configurations, you can run 70B models at Q8 quantization for higher quality, or load multiple models simultaneously.

Q: Is MLX faster than llama.cpp on Apple Silicon?

On M5 Max hardware, MLX is the recommended framework. It is purpose-built for Apple Silicon's unified memory architecture and Metal 4 GPU, delivering approximately 20-30% better token generation performance than llama.cpp for decode tasks. MLX also tends to have faster prompt processing due to deep unified memory integration. Additionally, Ollama (which uses GGUF/llama.cpp) has Metal 4 shader compilation issues on M5 Max as of March 2026.

Q: Is local AI cheaper than cloud APIs?

It depends on usage. A MacBook Pro M5 Max 128GB costs about $4,999. At 100K tokens per day (roughly 50-100 substantial AI interactions), local inference breaks even with cloud API costs within 15 months. At 500K tokens per day, break-even is 3 months. At 1M tokens per day, just 1.5 months. After break-even, every additional token is free. Electricity adds only $5-10 per month under heavy use.

Q: What is Mixture of Experts (MoE) and why does it matter?

MoE (Mixture of Experts) is a model architecture where only a fraction of parameters are activated per token. For example, Qwen 3 30B-A3B has 30 billion total parameters but activates only 3 billion per token. This gives it the speed of a 3B model (127.4 tok/s) with the quality of a 30B model. The tradeoff is that MoE models still need memory for all parameters (16.1 GB for Qwen 3 30B-A3B). On Apple Silicon machines with ample unified memory, MoE offers the best quality-to-speed ratio.

Q: Which model should I choose for coding on Mac?

For coding tasks, Devstral Small 24B (39.3 tok/s, 12.6 GB) is purpose-built for code generation and fits on any Mac with 32GB or more. Qwen 3 30B-A3B (127.4 tok/s, 16.1 GB) is excellent for coding while also being fast enough for real-time use. For smaller setups, Qwen 3 8B (105.4 tok/s, 4.4 GB) offers solid coding ability at ultra-fast speeds. For maximum quality and you have 64GB+, Llama 3.3 70B provides the strongest overall performance including code.

Q: Can I run AI models completely offline on a MacBook Pro?

Yes. Once you download a model, it runs entirely on local hardware with no internet connection required. Models are stored on your SSD and inference uses only your Mac's CPU, GPU, and unified memory. This is a key advantage over cloud APIs for travel, air-gapped environments, restricted networks, and privacy-sensitive work involving code, legal documents, or medical data.

Every review of the M5 Max says the same thing: it's fast for AI. That's not wrong. It's just boring, and it misses the point entirely.

Here's what nobody's talking about: a $2,499 MacBook with 64 GB of RAM and the 40-core GPU is a better AI machine than a $4,999 one with 128 GB and the 32-core GPU. GPU core count — not RAM amount — determines how fast your models run. I watched people on forums agonize over 64 GB vs 128 GB while completely ignoring whether they were getting the 32-core or 40-core chip. That's a 25% performance difference they never even considered.

I spent weeks benchmarking more than 20 open-source models on real M5 Max hardware using Apple's MLX framework. Not synthetic benchmarks. Not copy-pasted spec sheets from press releases. Actual measured performance, averaged across multiple runs, in isolated subprocesses, on a machine I bought at retail price. And the data told a story that contradicts the conventional wisdom in almost every way. The model everyone should be running? A 30B MoE that hits 127 tok/s. The RAM tier that makes sense for 90% of buyers? 64 GB, not 128 GB. The framework that wins? MLX, and it's not close. Below you'll find the full benchmark results, a RAM tier guide built on actual data, an honest MLX vs GGUF comparison, a cost break-even analysis that will save you from wasting money on cloud APIs, and a FAQ that answers the questions that actually matter.

Hot Takes — Backed by Data

128 GB is overkill for 90% of AI users. Every model worth running daily fits in 64 GB. The only reason to get 128 GB is if you're running Qwen 3 235B or loading three models simultaneously. That's a niche use case, not the default recommendation.
The 32-core GPU config is a trap. You save a few hundred dollars and lose 25% of your inference speed. Memory bandwidth is 460 GB/s vs 614 GB/s. On a machine you're buying specifically for AI, that's the wrong place to cut costs.
Qwen 3 30B-A3B is the best local model, period. 127 tok/s. 30B parameters of intelligence. 16 GB of RAM. Nothing else comes close on the quality-to-speed curve. If you're still running a 70B dense model as your daily driver, you're doing it wrong.
Ollama dropped the ball on M5 Max. Metal 4 shader compilation bugs in March 2026 make it unreliable on the latest hardware. MLX works flawlessly. This matters more than model selection for most people.
Cloud APIs are a bad deal at scale. If you're generating 100K+ tokens per day, you're burning money on cloud costs. A MacBook pays for itself in months, then every token is free forever.
70B dense models are overrated for daily use. 12.6 tok/s is fine for batch processing. It's painful for interactive chat. Qwen 3 30B-A3B gives you 90% of the quality at 10x the speed. Stop torturing yourself.
Nobody needs a 4B model. It's fast (179 tok/s) and it's impressive for demos, but the quality gap between 4B and 8B is enormous. Spend the extra 2 GB of RAM and run Qwen 3 8B instead. You'll thank me later.

Hardware: The One Spec That Matters

Forget CPU cores. Forget the Neural Engine. Memory bandwidth is the entire game for local AI.

Let me save you 20 minutes of reading spec sheets. Token generation during LLM inference is memory-bandwidth-bound. Every single token requires reading the entire model from memory. That means tok/s ≈ 614 GB/s ÷ model_size_GB predicts your real-world performance within 20–30%. And here's the part Apple's marketing conveniently buries: memory bandwidth is determined by GPU core count, not RAM amount. A 128 GB Mac with 32 GPU cores is slower for AI than a 64 GB Mac with 40 GPU cores. Read that again. The expensive one is slower.

M5 Max Chip Specifications

Spec	M5 Max (32-core GPU)	M5 Max (40-core GPU)
CPU Cores	18 (6 Super + 12 Performance)	18 (6 Super + 12 Performance)
GPU Cores	32	40
Neural Engine	16-core	16-core
GPU Neural Accelerators	Yes (new in M5)	Yes (new in M5)
Memory Bandwidth	460 GB/s	614 GB/s
Max Unified Memory	128 GB	128 GB
Process	TSMC 3nm (3rd gen)	TSMC 3nm (3rd gen)

      The only number that matters: 614 GB/s vs 460 GB/s. That's a 33% bandwidth gap. Every model, every time, every token. The 32-core GPU config should come with a warning label for AI buyers.
    

The unified memory architecture is the real reason Apple Silicon dominates local AI. An NVIDIA RTX 4090 has 24 GB of VRAM. The RTX 5090 has 32 GB. Exceed those limits and your model spills to system RAM over PCIe, and performance collapses. On the M5 Max, the GPU sees all 128 GB at full bandwidth. No VRAM wall. No layer offloading. No performance cliff. This is why a laptop runs 70B models that choke a $2,000 desktop GPU. The unified memory story is the one thing Apple's marketing team actually got right.

Memory Configurations for AI Workloads

Configuration	Unified Memory	Bandwidth	Best For
M5 Max 32-core GPU	36 GB	460 GB/s	Small models up to ~14B dense
M5 Max 32-core GPU	64 GB	460 GB/s	Mid-range models up to ~70B Q4 (slower)
M5 Max 40-core GPU	64 GB	614 GB/s	Mid-range models up to ~70B Q4 at full speed
M5 Max 40-core GPU	128 GB	614 GB/s	Frontier MoE, large dense, multi-model setups

All benchmarks in this guide were conducted on the 40-core GPU, 128 GB configuration.

Side-by-Side Model Comparison

Pick two models. See the truth. No marketing spin, just numbers.

Model A

Model B

Head to Head

Speed (tok/s)

RAM Usage (GB)

Efficiency (tok/s per GB)

RAM Tier Calculator

Stop guessing. See exactly which models fit at each RAM level (8 GB reserved for macOS).

RAM Allocation

Compatible (0)

Too Large (0)

Best Pick at 128 GB

The Benchmark Numbers (No Spin)

All models tested at 4-bit quantization on MLX, MacBook Pro M5 Max 40-core GPU, 128GB. Averaged across 3-5 passes.

I'm going to say what the press-release reviews won't: most of these models are interchangeable for everyday tasks. An 8B model at 105 tok/s and a 7B model at 111 tok/s? You will not notice the difference. Stop obsessing over single-digit tok/s differences between models in the same size class. What matters is picking the right size class for your workload and RAM.

The real story in this data is Mixture of Experts. Qwen 3 30B-A3B activates only 3 billion of its 30 billion parameters per token. That gives it the speed of a small model (127 tok/s) with the knowledge of a large one. It sits in 16 GB of RAM and it's smarter than any dense model under 27B. This is the architecture that changes everything, and it's why I keep saying: stop running 70B dense models as your daily driver. A 70B model at 12.6 tok/s is useful for batch processing. It's miserable for interactive chat. Qwen 3 30B-A3B at 127 tok/s is fast enough that you forget you're running it locally. Click any column header to sort the table.

Text Generation Models

Model ▲▼	Params ▲▼	Type ▲▼	tok/s ▲▼	Speed	TTFT ▲▼	Memory ▲▼	Tier

Table 1: Text generation benchmarks on M5 Max 40-core GPU, 128 GB, MLX 4-bit quantization. Click column headers to sort.

Vision Language Models

Model	Params	tok/s	Speed	TTFT	Memory	Tier

Table 2: Vision Language Model benchmarks on M5 Max. See the VLM section for detailed analysis.

    The number that should end every argument: Qwen 3 30B-A3B achieves 127.4 tok/s while using 16.1 GB of memory. That's 30B parameters of intelligence at 8B-class speed. If you take one thing from this entire article, let it be this model name.
  

Performance Formula

Token generation is memory-bandwidth-bound:

        tok/s ≈ 614 GB/s ÷ Model Size (GB)
      

Predicts real-world performance within 20-30%. The gap is from KV cache overhead, compute, and framework efficiency.

What Surprised Me

TTFT under 200ms for all models under 15B — this is faster than most cloud APIs
Even 70B models respond within 730ms — faster than GPT-4o's typical first-token latency
Memory at Q4 follows ~0.5 GB per billion params with eerie consistency
MoE models don't just bend the speed curve — they break it

Vision Models: The Sleeper Hit Nobody Talks About

Local image understanding at 179 tok/s. No upload to cloud servers. No privacy concerns. No API bill.

Vision Language Models are the most underrated capability of local AI on Mac. Why is everyone still uploading screenshots and documents to cloud APIs when Gemma 3 4B VLM runs locally at 179 tok/s using 2.4 GB of RAM? That's not a typo. A vision model that fits in the RAM of a smartwatch processes images faster than you can read the output. For document analysis, OCR, screenshot parsing, chart reading — this is the workflow that should make every privacy-conscious developer switch to local inference. No image of your codebase, your financial documents, or your client's data ever leaves your machine.

178.7

tok/s

Gemma 3 4B VLM

Fastest VLM — 2.4 GB

110.7

tok/s

Qwen3-VL 8B

Best VLM value — 4.4 GB

27.3

tok/s

Qwen3-VL 32B

Highest quality VLM — 17.3 GB

Here's my take: for 80% of vision tasks — summarization, classification, data extraction — Gemma 3 4B VLM running locally at 179 tok/s is better than paying for a cloud API. It's faster (under 200ms TTFT vs 500ms+ for cloud), it's free, and your data stays local. When you need serious visual reasoning, Qwen3-VL 32B at 27.3 tok/s is the best image comprehension I've tested locally, and it fits in a 64 GB Mac with room to spare. Stop paying per-image API fees for tasks a $0 local model handles.

128 GB Memory Map

Each cell represents 1 GB of unified memory. Click a model category to highlight its cells.

128 GB Memory Pool

Each cell = 1 GB

Speed Tier Summary

Anything above 30 tok/s feels real-time. Below that, you'll feel the wait. Choose accordingly.

Efficiency Rankings: Bang for Your Byte

Tokens per second per GB of RAM. The metric that actually tells you which models are worth their memory footprint.

Rank	Model	Type	tok/s	Memory	tok/s per GB	Quality	Agentic	Efficiency

Table 3: Models ranked by efficiency with quality and agentic scores.

Quality Evaluations

Speed without quality is worthless. Here's the truth about reasoning ability.

The speed numbers above are impressive, but they mean nothing if the model outputs garbage. I ran three standard quality benchmarks: ARC-Challenge (science reasoning), GSM8K (math), and IFEval (instruction following). The results reveal a clear quality tier that speed alone can't show you.

Loading quality evaluation data...

Agentic Benchmarks

This is where local AI falls apart. Most models can't even produce valid JSON.

The agentic results are the most damning evidence that local models aren't ready for real autonomous work. I ran 14 terminal-bench tasks — real-world shell operations in Docker containers. Parse errors dominated: most small models simply cannot produce the structured JSON output the agent loop requires. Only the simplest tasks (fix file permissions, create a file) had >30% pass rates. For context, Claude Sonnet 4.5 hits 50% on the full terminal-bench v1.0 suite. Our best local model barely cracks 25%.

Loading agentic benchmark data...

View detailed failure analysis in the Eval Dashboard →

RAM Tier Recommendations (The Honest Ones)

Everyone says you need 128 GB. You don't. Here's what each tier actually gets you.

The internet consensus is that you need 128 GB for local AI. That advice is wrong for most people, and it's costing them $1,000+ in unnecessary upgrades. At Q4 quantization, the best daily-driver model (Qwen 3 30B-A3B) uses 16.1 GB. Even the biggest model most people would want for interactive use (a 32B dense model) uses 17.3 GB. You need 64 GB to run 70B models, and even then, you're running them at 12.6 tok/s — which is usable, not pleasant. The 128 GB config only makes sense if you're running Qwen 3 235B, loading multiple models simultaneously, or want 70B at Q8 quality. That's a real use case, but it's not most people. Save the $1,000 and get 64 GB with the 40-core GPU.

32 GB

Underrated

~22-26 GB usable. Runs the single best model (Qwen 3 30B-A3B) with room to spare. Honestly enough for most hobbyists.

Qwen 3 30B-A3B127.4 tok/s · 16.1 GB
Gemma 3 27B30.9 tok/s · 15.2 GB
Phi-4 14B62.0 tok/s · 7.8 GB
Gemma 3 4B178.7 tok/s · 2.4 GB

Fits all models up to 27B at Q4. That covers 90% of daily AI tasks.

64 GB

The Right Answer

~48-54 GB usable. Runs every model worth running interactively, including 70B dense. This is the config I'd buy.

Qwen 3 30B-A3B127.4 tok/s · 16.1 GB
Llama 3.3 70B12.6 tok/s · 37.1 GB
Devstral 24B39.3 tok/s · 12.6 GB
Qwen 3 32B25.7 tok/s · 17.3 GB

Get the 40-core GPU. Seriously. The 32-core at 64 GB is 25% slower for every single model.

128 GB

Niche

~100-110 GB usable. For frontier MoE models, 70B at Q8, or multi-model setups. You know if you need this.

Qwen 3 30B-A3B127.4 tok/s · 16.1 GB
Llama 3.3 70B12.6 tok/s · 37.1 GB
Devstral 24B39.3 tok/s · 12.6 GB
DeepSeek R1 32B24.9 tok/s · 17.3 GB

Can run Qwen 3 235B-A22B (~118 GB) at Q4 for frontier quality. That's the real reason to buy this tier.

MLX vs GGUF: One Clear Winner

This is not a close call. On M5 Max, MLX wins on speed, stability, and integration. GGUF wins on ecosystem breadth. That's it.

MLX wins. Full stop. On M5 Max hardware, Apple's MLX framework is 20–30% faster than llama.cpp for token generation. It's purpose-built for unified memory and Metal 4, and it just works. Meanwhile, Ollama — the most popular GGUF front-end — shipped with broken Metal 4 shader compilation on M5 Max as of March 2026. If you bought a brand new MacBook Pro and tried to run Ollama, you got shader errors. That's not a minor issue. That's the most popular local AI tool failing on the most popular AI hardware.

The one area where GGUF legitimately wins is model selection. There are tens of thousands of GGUF models on HuggingFace compared to thousands for MLX. If you need an obscure fine-tuned model, GGUF has it. But for the mainstream models that 95% of people run, MLX has them all through the mlx-community org on HuggingFace. The cross-platform argument for GGUF matters if you also run models on Linux or Windows. If your AI machine is a Mac — and if you're reading this, it is — use MLX.

MLX (Apple)

Built for Apple Silicon's unified memory & Metal 4 GPU
Native Python library (mlx-lm) with fine-tuning support
~20-30% faster decode on Apple Silicon vs llama.cpp
Faster prompt processing (prefill) via deep memory integration
Recommended on M5 Max hardware
Thousands of models on HuggingFace (mlx-community)

GGUF (llama.cpp)

Cross-platform: CPU, CUDA, Metal, Vulkan
Tens of thousands of models on HuggingFace
Broader quantization options (IQ quants, mixed quantization)
Powers Ollama and LM Studio (GUI tools)
Note: Ollama 0.18.2 has Metal 4 shader bugs on M5 Max
Best for cross-platform portability needs

Bottom line: If your Mac has an M5 chip, use MLX. The speed advantage is real, the stability is better, and the mlx-lm library makes it dead simple. Only go GGUF if you need a model that doesn't exist in MLX format, or you're also running on non-Apple hardware. Check the benchmark tables — every number was measured on MLX.

Cloud APIs: Know When to Hold, Know When to Fold

Cloud models are still smarter. But how much smarter? And is that gap worth $25 per million output tokens?

Here's the honest truth: Claude Opus, Gemini 3.1 Pro, and GPT-5.2 are still better than any local model on the hardest reasoning tasks. Their Elo ratings (1490–1510) beat what you can run locally. But here's the question nobody asks: how often do you actually need that level of reasoning? For coding assistance, writing, summarization, data extraction, and general Q&A — which is 90% of daily AI use — a local 30B MoE model handles the job. Why would you pay $25/M output tokens for Claude Opus when Qwen 3 30B-A3B answers your question in 200ms for free? Cloud APIs make sense for the 10% of tasks that require frontier intelligence. They're a bad deal for everything else.

Model	Provider	Input $/M tok	Output $/M tok	Context	Arena Elo	Vision
Claude Opus 4.6	Anthropic	$5.00	$25.00	1M	~1505	Yes
Gemini 3.1 Pro	Google	$2.00	$12.00	1M	~1503	Yes
GPT-5.2	OpenAI	$1.75	$14.00	400K	~1490	Yes
Claude Sonnet 4.6	Anthropic	$3.00	$15.00	200K	~1480	Yes
GPT-5.2 Pro	OpenAI	$21.00	$168.00	400K	~1510	Yes
Gemini 2.5 Flash	Google	Free	Free	1M	~1450	Yes
DeepSeek V3.2 API	DeepSeek	$0.14	$0.28	164K	~1421	No
DeepSeek R1 API	DeepSeek	$0.55	$2.19	164K	~1430	No
GPT-4o	OpenAI	$2.50	$10.00	128K	~1460	Yes
Claude Haiku 4.5	Anthropic	$1.00	$5.00	200K	~1420	Yes

Table 4: Cloud API pricing and quality ratings as of March 2026. Prices subject to change.

Why I Run Local 90% of the Time

✓ Privacy: my code never hits someone else's server
✓ Latency: under 200ms TTFT beats every cloud API
✓ Cost: zero marginal cost after the hardware purchase
✓ Offline: works on planes, in coffee shops with bad WiFi, everywhere
✓ No rate limits: I generate as fast as the hardware allows

When I Still Pay for Cloud

✓ Tasks requiring Elo 1490+ reasoning (complex multi-step analysis)
✓ 100K+ token context windows (local models cap out around 32K usable)
✓ One-off tasks where I need absolute best quality
✓ When I'm away from my Mac and need AI on my phone
✓ Very low usage: a few queries a week doesn't justify $2,499+

Cost Break-Even: The Math Nobody Does

MacBook Pro M5 Max 128GB at ~$4,999 vs blended Sonnet-tier API pricing (~$9/M tokens).

Most people never do this math. They just keep swiping their credit card for API tokens because the per-query cost feels small. But it adds up. At Sonnet-tier blended pricing (~$9/M tokens), generating 100K tokens per day costs $27/month. That's $324/year. A MacBook Pro M5 Max at $4,999 pays for itself in 15 months at that rate. And after break-even? Every single token is free. Forever. If you're a developer who generates 500K+ tokens per day (and many do, between coding assistance and document processing), the Mac pays for itself in three months. Three months. After that, you're printing free tokens while cloud users are still paying per request.

Daily Token Usage	Monthly Cloud Cost	Break-Even	Verdict
10K tokens/day	~$2.70/mo	154 years	Cloud wins
100K tokens/day	~$27/mo	15 months	Toss-up
500K tokens/day	~$135/mo	3 months	Local wins
1M tokens/day	~$270/mo	1.5 months	Local wins
5M tokens/day	~$1,350/mo	~11 days	Local wins

Table 5: Break-even analysis assuming M5 Max 128 GB at $4,999 and blended API pricing of ~$9/M tokens.

    The number to remember: 100K tokens/day = 15 months to break even. That's roughly 50-100 substantial AI interactions. If you use AI seriously for work, you're almost certainly above that threshold. Stop renting tokens and start owning them.
  

How I Tested (And Why You Should Care)

No vendor sponsorship. No review units. No cherry-picked results. Here's exactly what I did.

I'm going to tell you something most benchmark articles won't: half the numbers you see online are garbage. People run a single inference pass, screenshot the output, and call it a benchmark. That's not data. That's an anecdote. Every number in this guide comes from 3 to 5 passes per model, run in isolated subprocesses on a machine I bought at full retail price. No thermal throttling cheats. No fresh-boot single-run maximums. Real sustained performance.

The test rig: MacBook Pro 16-inch, M5 Max 40-core GPU, 128 GB unified memory, macOS 16.x. Framework: MLX 0.31.1 with mlx-lm 0.31.1. All models at 4-bit quantization. Four standardized prompts covering Q&A, reasoning, coding, and structured output. Each run in its own subprocess to prevent memory contamination. Metrics: average generation tok/s, time to first token, peak RSS memory usage. Standard laptop cooling, no external fans or cooling pads. Quality benchmarks (Elo, MMLU-Pro, HumanEval) come from public leaderboards, not my own testing — I'm benchmarking inference speed, not model intelligence.

Test Configuration Summary

Hardware: MacBook Pro 16" M5 Max, 40-core GPU, 128 GB unified memory
OS: macOS 16.x (Darwin 25.3.0)
Framework: MLX 0.31.1, mlx-lm 0.31.1
Quantization: 4-bit (Q4) for all models
Test prompts: 4 standardized per model (Q&A, reasoning, coding, structured output)
Runs: 3-5 passes per model, averaged
Isolation: Each run in a separate subprocess for clean memory measurement

Frequently Asked Questions

The questions people actually ask me, answered with data instead of marketing copy.

I've answered these questions hundreds of times in forums, DMs, and comment sections. The answers below are based on my benchmark data and the testing methodology described above. No hedging. No weasel words. If the data says something, I say it.

What is the fastest AI model on MacBook Pro M5 Max?+

Gemma 3 4B is the fastest model we tested, generating 178.7 tokens per second on the M5 Max 40-core GPU using MLX at 4-bit quantization. It uses only 2.4 GB of memory and supports both text and vision input. For models with more knowledge, Qwen 3 30B-A3B (MoE) achieves 127.4 tok/s while packing 30 billion parameters of intelligence into 16.1 GB of memory. See the full benchmark table for all results.

How many tokens per second can the M5 Max generate?+

It follows the formula tok/s ≈ 614 GB/s ÷ model_size_GB. In practice: a 4B model generates ~179 tok/s, 8B ~105-113 tok/s, 14B ~55-62 tok/s, 27B ~31 tok/s, and 70B ~12-13 tok/s. MoE models like Qwen 3 30B-A3B break this curve with 127 tok/s despite using 16GB of memory because only 3B parameters are active per token.

Can I run a 70B model on MacBook Pro?+

Yes, but you need at least 64 GB of unified memory. At 4-bit quantization, a 70B dense model like Llama 3.3 70B uses 37.1 GB, leaving enough headroom on a 64 GB Mac for the OS and context window. On the M5 Max 40-core GPU, it generates at 12.6 tok/s with a 724ms time to first token. For 128 GB configurations, you can run 70B at Q8 for higher quality or keep multiple models loaded at once. See the RAM Tier Guide for configuration-specific recommendations.

Is MLX faster than llama.cpp on Apple Silicon?+

On M5 Max, MLX is the better choice. It is purpose-built for Apple Silicon's unified memory and Metal 4 GPU, providing ~20-30% better decode performance than llama.cpp. Ollama (GGUF-based) also has Metal 4 shader compilation issues on M5 Max as of March 2026. MLX tends to have faster prompt processing as well. See the full MLX vs GGUF comparison for details.

How much RAM do I need to run local AI models?+

At 4-bit quantization, model size in GB is roughly 0.5 times the number of billion parameters. Leave about 20% headroom for the OS and KV cache. 32 GB runs excellent 12B-27B models (31-69 tok/s). 64 GB runs frontier 70B dense models comfortably. 128 GB is for 70B at Q8, multi-model setups, or frontier MoE like Qwen 3 235B. The RAM Tier Guide has specific recommendations for each configuration.

Is local AI cheaper than cloud APIs?+

It depends on usage. At 100K tokens/day (~50-100 interactions), a MacBook Pro M5 Max 128 GB ($4,999) breaks even with cloud APIs within 15 months. At 500K tokens/day, break-even is 3 months. At 1M tokens/day, just 1.5 months. After break-even, every token is free. Electricity adds only ~$5-10/month. See the full cost break-even analysis.

What is Mixture of Experts (MoE) and why does it matter?+

MoE is a model architecture where only a fraction of parameters are activated per token. Qwen 3 30B-A3B has 30 billion total parameters but activates only 3 billion per token, giving it the speed of a 3B model (127.4 tok/s) with the quality of a 30B model. The tradeoff: MoE models still need memory for all parameters (16.1 GB). On Apple Silicon with ample unified memory, MoE offers the best quality-to-speed ratio available — making it ideal for all RAM tiers.

Which model should I choose for coding on Mac?+

For coding tasks, Devstral Small 24B (39.3 tok/s, 12.6 GB) is purpose-built for code generation and fits on any Mac with 32 GB+. Qwen 3 30B-A3B (127.4 tok/s) is excellent for coding while also being fast enough for real-time use. For smaller setups, Qwen 3 8B (105.4 tok/s, 4.4 GB) offers solid coding at ultra-fast speed. For maximum quality with 64 GB+, Llama 3.3 70B provides the strongest overall performance.

Can I run AI models completely offline on a MacBook Pro?+

Yes. Once downloaded, models run entirely on local hardware with no internet needed. This is a key advantage for travel, air-gapped environments, restricted networks, and privacy-sensitive work. Models are stored on SSD and inference uses only your Mac's CPU, GPU, and unified memory.

What models support vision (image input) on Mac?+

Several VLMs run well via MLX: Gemma 3 4B VLM (179 tok/s, 2.4 GB) is the fastest. Qwen3-VL 8B (111 tok/s, 4.4 GB) is the best value. Gemma 3 27B VLM (32 tok/s, 15.2 GB) and Qwen3-VL 32B (27 tok/s, 17.3 GB) provide the highest quality image understanding. All run entirely locally. See the VLM section for more.

Can I run AI models while doing other work?+

Yes. Apple Silicon uses a unified memory architecture where the CPU and GPU share the same memory pool. You do not need to copy model weights between system RAM and VRAM like you would with a discrete GPU. You can run AI inference alongside regular workloads such as a web browser, IDE, or creative apps. However, the model's memory footprint reduces what is available for other applications, so choose a model size that leaves enough headroom. For example, running a 17 GB model on a 64 GB Mac still leaves roughly 39 GB for macOS and your other apps.

What does tokens per second actually feel like?+

Average human reading speed is roughly 4–5 tokens per second. So any model generating above 5 tok/s is producing text faster than you can read it. At 25–30 tok/s, responses appear nearly instantaneous for short answers. At 100+ tok/s, even long multi-paragraph responses complete in just a few seconds. For coding assistants, higher speeds mean faster completions and less waiting between edits. In practice, anything above 30 tok/s feels real-time for interactive chat.

Can I run multiple models at once?+

Yes, as long as the combined memory footprint fits in your available RAM. For example, on a 128 GB Mac with approximately 8 GB reserved for the OS, you could simultaneously load Qwen 3 30B-A3B (16.1 GB) + Devstral 24B (12.6 GB) + Gemma 3 4B (2.4 GB) for a total of 31.1 GB, leaving over 88 GB free. On a 64 GB Mac, you could run a 7–8B model (4.4 GB) alongside a 14B model (7.8 GB) for about 12.2 GB total. Only one model generates tokens at a time using the GPU, but having multiple loaded avoids reload latency when switching between them.

What is the best model for each RAM tier?+

32 GB: Qwen 3 30B-A3B (127.4 tok/s, 16.1 GB) is the best overall pick, offering 30B-class quality at blazing speed. 64 GB: Llama 3.3 70B (12.6 tok/s, 37.1 GB) provides the strongest quality, while Qwen 3 30B-A3B remains the best speed-to-quality ratio. 128 GB: You can run everything, but Qwen 3 30B-A3B is still the daily driver for speed, Llama 3.3 70B for quality, and Devstral 24B (39.3 tok/s, 12.6 GB) for coding. The sweet spot for most users is the 64 GB configuration with the 40-core GPU.

My Recommendations (No Caveats)

I'll make this simple. Three decisions, three clear answers.

Which Mac to buy: 64 GB with the 40-core GPU. Not 128 GB. Not the 32-core GPU. The 64 GB / 40-core config runs every model you'd want for daily interactive use, at full 614 GB/s bandwidth, for $1,000 less than the 128 GB version. The only people who need 128 GB are those running Qwen 3 235B, multi-model setups, or 70B at Q8 quality. If that's you, you already know it. Everyone else: save the money.

Which model to run: Qwen 3 30B-A3B as your daily driver. 127 tok/s, 16.1 GB, 30B parameters of intelligence. It fits on every Mac from 32 GB up. Add Devstral 24B for coding and Gemma 3 4B VLM for fast vision tasks. If you have 64 GB and need maximum reasoning quality for specific tasks, keep Llama 3.3 70B around — but don't make it your default. At 12.6 tok/s, it's a specialist tool, not a daily driver.

Which framework to use: MLX. 20-30% faster than GGUF on Apple Silicon, no shader bugs, native Python library, and every mainstream model is available. Install mlx-lm, download from the mlx-community on HuggingFace, and start generating. It takes five minutes to go from zero to 127 tok/s.

The M5 Max changed the economics of local AI. A laptop now runs models that required server racks two years ago. The open-source model ecosystem is advancing faster than the cloud providers can cut prices. And every new model release works on the hardware you already own, with zero additional cost. Stop paying rent on intelligence. Own it.