How many tokens per second can M5 Max generate?

Token generation speed depends on model size and follows the formula: tok/s is approximately equal to 614 GB/s divided by the model size in GB. In practice, a 4B model generates about 179 tok/s, an 8B model about 105-113 tok/s, a 14B model about 55-62 tok/s, a 27B model about 31 tok/s, and a 70B model about 12-13 tok/s. MoE models like Qwen 3 30B-A3B break this curve, achieving 127 tok/s despite requiring 16GB of memory.

What models support vision and image input on Mac?

Several Vision Language Models (VLMs) run well on M5 Max via MLX. Gemma 3 4B VLM is the fastest at 178.7 tok/s using only 2.4 GB of memory. Qwen3-VL 8B (110.7 tok/s, 4.4 GB) offers the best value for vision tasks. For highest quality image understanding, Qwen3-VL 32B (27.3 tok/s, 17.3 GB) is the top choice. These models can analyze documents, charts, screenshots, and photos entirely locally with no data leaving your machine.

I Benchmarked 20+ AI Models on My M5 Max — Here's What Actually Happened

Q: What is the fastest AI model on MacBook Pro M5 Max?

Gemma 3 4B is the fastest model we tested, generating 178.7 tokens per second on the M5 Max 40-core GPU using MLX at 4-bit quantization. It uses only 2.4 GB of memory and supports both text and vision input. For models with more knowledge, Qwen 3 30B-A3B (MoE) achieves 127.4 tok/s while packing 30 billion parameters of intelligence into 16.1 GB of memory.

Q: Can I run a 70B model on MacBook Pro?

Yes, but you need at least 64GB of unified memory. At 4-bit quantization, a 70B dense model like Llama 3.3 70B uses 37.1 GB of memory, leaving enough headroom on a 64GB Mac for the OS and context window. On the M5 Max 40-core GPU, it generates at 12.6 tok/s with a 724ms time to first token. For 128GB configurations, you can run 70B models at Q8 quantization for higher quality, or load multiple models simultaneously.

Q: Is MLX faster than llama.cpp on Apple Silicon?

On M5 Max hardware, MLX is the recommended framework. It is purpose-built for Apple Silicon's unified memory architecture and Metal 4 GPU, delivering approximately 20-30% better token generation performance than llama.cpp for decode tasks. MLX also tends to have faster prompt processing due to deep unified memory integration. Additionally, Ollama (which uses GGUF/llama.cpp) has Metal 4 shader compilation issues on M5 Max as of March 2026.

Q: Is local AI cheaper than cloud APIs?

It depends on usage. A MacBook Pro M5 Max 128GB costs about $4,999. At 100K tokens per day (roughly 50-100 substantial AI interactions), local inference breaks even with cloud API costs within 15 months. At 500K tokens per day, break-even is 3 months. At 1M tokens per day, just 1.5 months. After break-even, every additional token is free. Electricity adds only $5-10 per month under heavy use.

Q: What is Mixture of Experts (MoE) and why does it matter?

MoE (Mixture of Experts) is a model architecture where only a fraction of parameters are activated per token. For example, Qwen 3 30B-A3B has 30 billion total parameters but activates only 3 billion per token. This gives it the speed of a 3B model (127.4 tok/s) with the quality of a 30B model. The tradeoff is that MoE models still need memory for all parameters (16.1 GB for Qwen 3 30B-A3B). On Apple Silicon machines with ample unified memory, MoE offers the best quality-to-speed ratio.

Q: Which model should I choose for coding on Mac?

For coding tasks, Devstral Small 24B (39.3 tok/s, 12.6 GB) is purpose-built for code generation and fits on any Mac with 32GB or more. Qwen 3 30B-A3B (127.4 tok/s, 16.1 GB) is excellent for coding while also being fast enough for real-time use. For smaller setups, Qwen 3 8B (105.4 tok/s, 4.4 GB) offers solid coding ability at ultra-fast speeds. For maximum quality and you have 64GB+, Llama 3.3 70B provides the strongest overall performance including code.

Q: Can I run AI models completely offline on a MacBook Pro?

Yes. Once you download a model, it runs entirely on local hardware with no internet connection required. Models are stored on your SSD and inference uses only your Mac's CPU, GPU, and unified memory. This is a key advantage over cloud APIs for travel, air-gapped environments, restricted networks, and privacy-sensitive work involving code, legal documents, or medical data.

So I got my hands on the new M5 Max MacBook Pro — the 128 GB, 40-core GPU model — and I did what any reasonable person would do: I spent the entire weekend benchmarking AI models on it. I had a spreadsheet. I had coffee. I had a Python script that spawned isolated subprocesses. It was a whole thing.

Here's the thing: I went into this expecting the small models to be fast and the big models to be slow. That part was obvious. What I didn't expect was how fast the small ones would be (nearly 180 tokens per second, which is genuinely absurd), or that a 70B model would feel usable on a laptop, or that Ollama would just... break because of a Metal 4 shader bug. I also didn't expect a 30-billion-parameter MoE model to outrun most 8B models. But I'm getting ahead of myself.

This page has everything: the raw benchmark numbers, interactive tools so you can compare models side by side, a RAM calculator that tells you what fits on your Mac, a RAM tier guide with my actual picks, an honest look at MLX vs GGUF, a cost breakdown vs cloud APIs, and a FAQ that covers the questions I kept getting asked while writing this up.

The Hardware: What's Actually in This Machine

Apple M5 Max, announced March 3, 2026 — TSMC 3nm, with Neural Accelerators baked into every GPU core.

Before I get into the numbers, it helps to understand why the hardware matters so much. When you're generating tokens with a language model, the bottleneck isn't compute — it's memory bandwidth. Every single token the model generates requires reading the entire set of model weights from memory. So the speed formula is pretty simple: tok/s ≈ 614 GB/s ÷ model_size_GB. That's it. That formula predicted my real-world results within about 20–30%, which I found kind of wild. The gap comes from KV cache overhead, compute ops, and framework efficiency. But bandwidth is the main event.

M5 Max Chip Specifications

Spec	M5 Max (32-core GPU)	M5 Max (40-core GPU)
CPU Cores	18 (6 Super + 12 Performance)	18 (6 Super + 12 Performance)
GPU Cores	32	40
Neural Engine	16-core	16-core
GPU Neural Accelerators	Yes (new in M5)	Yes (new in M5)
Memory Bandwidth	460 GB/s	614 GB/s
Max Unified Memory	128 GB	128 GB
Process	TSMC 3nm (3rd gen)	TSMC 3nm (3rd gen)

      This tripped me up at first: Memory bandwidth is set by GPU core count, NOT RAM amount. A 64GB Mac with the 32-core GPU (460 GB/s) is ~25% slower than a 64GB Mac with the 40-core GPU (614 GB/s). Same RAM, very different speeds.
    

The other big deal is the unified memory architecture. On a regular PC with an NVIDIA GPU, you've got 24 GB of VRAM on an RTX 4090, or 32 GB on a 5090. If your model doesn't fit in VRAM, you're offloading layers to system RAM over PCIe, and that's dramatically slower. On the M5 Max, the GPU just... accesses all 128 GB directly at full bandwidth. There's no VRAM vs system RAM distinction. It's one big pool. That's what makes running a 70B model on a laptop even possible.

Memory Configurations for AI Workloads

Configuration	Unified Memory	Bandwidth	Best For
M5 Max 32-core GPU	36 GB	460 GB/s	Small models up to ~14B dense
M5 Max 32-core GPU	64 GB	460 GB/s	Mid-range models up to ~70B Q4 (slower)
M5 Max 40-core GPU	64 GB	614 GB/s	Mid-range models up to ~70B Q4 at full speed
M5 Max 40-core GPU	128 GB	614 GB/s	Frontier MoE, large dense, multi-model setups

All my benchmarks were run on the 40-core GPU, 128 GB configuration.

Compare Any Two Models

Pick two models and see how they stack up on speed, memory, and efficiency.

Model A

Model B

Head to Head

Speed (tok/s)

RAM Usage (GB)

Efficiency (tok/s per GB)

RAM Calculator: What Fits on Your Mac

Click a RAM tier to see which models you can actually run (I'm reserving 8 GB for macOS).

RAM Allocation

Compatible (0)

Too Large (0)

Best Pick at 128 GB

The Actual Benchmark Numbers

All models at 4-bit quantization on MLX, M5 Max 40-core GPU, 128GB. Averaged across 3-5 runs.

Alright, here's the part you probably scrolled down for. I tested every model at 4-bit quantization using MLX, ran each one 3-5 times in isolated subprocesses, and averaged the results. The table below has everything. A few things jumped out at me. First, Gemma 3 4B at 179 tok/s is generating text so fast that a full paragraph appears in under a second. Second, the Qwen 3 30B-A3B (that's a Mixture of Experts model) somehow hits 127 tok/s despite having 30 billion parameters. It only activates 3B of them per token, which gives it 8B-class speed with way more knowledge. That felt like cheating, honestly. And third, even the 70B models are usable — 12-13 tok/s is slower than reading speed, but it's not painful for longer tasks. Click column headers to sort.

Text Generation Models

Model ▲▼	Params ▲▼	Type ▲▼	tok/s ▲▼	Speed	TTFT ▲▼	Memory ▲▼	Tier

Table 1: Text generation benchmarks on M5 Max 40-core GPU, 128 GB, MLX 4-bit quantization. Click column headers to sort.

Vision Language Models

Model	Params	tok/s	Speed	TTFT	Memory	Tier

Table 2: Vision Language Model benchmarks on M5 Max. See the VLM section for my thoughts.

The MoE surprise: Qwen 3 30B-A3B hits 127.4 tok/s despite needing 16.1 GB of memory. It only fires 3B of its 30B params per token, so you get 8B speed with 30B brains. I didn't believe the numbers at first and re-ran it twice. More on MoE in the FAQ.

The Speed Formula

Token generation is memory-bandwidth-bound:

        tok/s ≈ 614 GB/s ÷ Model Size (GB)
      

This predicted my real results within 20-30%. The gap is KV cache overhead, compute, and framework efficiency.

Things That Surprised Me

TTFT under 200ms for everything under 15B params
Even 70B models respond within 730ms
Memory at Q4 follows ~0.5 GB per billion params
MoE models completely break the speed-vs-size curve

Vision Models: Feeding Images to Local AI

Models that take images + text as input. Local document analysis, screenshot parsing, OCR, and visual Q&A — all without uploading anything.

I was curious whether the vision models would be noticeably slower than their text-only counterparts. Turns out, not really. Gemma 3 4B handles images at the same 179 tok/s as text. Qwen3-VL 8B was right there at 111 tok/s. The privacy angle here is what sold me — I was testing these by feeding in screenshots of my own code, photos of handwritten notes, and a few receipts. None of that data left my machine. I also tried running Pixtral (Mistral's VLM) but kept hitting an auth error during weight download, so it didn't make the final list. Something to revisit later.

178.7

tok/s

Gemma 3 4B VLM

Fastest VLM — 2.4 GB

110.7

tok/s

Qwen3-VL 8B

Best value VLM — 4.4 GB

27.3

tok/s

Qwen3-VL 32B

Best quality VLM — 17.3 GB

For quick vision tasks — summarizing a screenshot, reading a chart, extracting text from a photo — Gemma 3 4B VLM at 179 tok/s is faster than any cloud API I've used, and it costs nothing per query. When you need the model to actually understand complex images (dense diagrams, multi-page documents), Qwen3-VL 32B at 27.3 tok/s is the strongest I tested, and it fits comfortably on a 64 GB Mac. Check the RAM guide to see which VLMs work at your tier.

128 GB Memory Map

Each cell is 1 GB of unified memory. Click a category to see what's using what.

128 GB Memory Pool

Each cell = 1 GB

Speed Tiers at a Glance

I grouped the models by how fast they actually feel in practice.

Efficiency Rankings: Speed per GB of RAM

Tokens per second divided by GB of memory used. Higher = more bang for your RAM buck.

Rank	Model	Type	tok/s	Memory	tok/s per GB	Quality	Agentic	Efficiency

Table 3: Models ranked by efficiency with quality and agentic scores.

Quality Evaluations

But can they actually think? I ran ARC-Challenge, GSM8K, and IFEval to find out.

Speed is great, but I also wanted to know if these models can actually reason. So I ran three standard benchmarks via lm-evaluation-harness: ARC-Challenge (grade-school science), GSM8K (math word problems with chain-of-thought), and IFEval (can the model follow precise formatting instructions). The composite score averages all three.

Loading quality evaluation data...

Agentic Benchmarks

I also tried making them do real terminal tasks. Most of them... couldn't.

Here's where things get interesting. I used terminal-bench to give each model 14 real tasks in fresh Docker containers: create files, fix git repos, install packages, write servers. The model has to figure everything out autonomously through shell commands. Green means it passed, red means it failed. Hover over the red cells to see why — most failures are "parse error" (the model couldn't even produce valid JSON) or "wrong commands" (it tried but got the shell commands wrong). For reference, Claude Sonnet 4.5 scores 50% on the full terminal-bench v1.0 suite.

Loading agentic benchmark data...

View detailed failure analysis in the Eval Dashboard →

My Actual Picks by RAM Tier

At Q4 quantization, model size is roughly params × 0.5 GB. I'd leave ~20% headroom for macOS and KV cache.

Look, picking the right model for your RAM is probably the most important thing you can do here. Running a model that barely fits means you've got no room for context windows, KV cache growth, or having Chrome open (and you'll have Chrome open). My picks below are based on the benchmarks I ran, with real headroom accounted for. Rule of thumb: don't use more than 80% of your memory for the model itself. Leave the rest for macOS, context, and whatever else you've got running.

32 GB

Capable

~22-26 GB usable for models. You'd be surprised how much you can do here.

Qwen 3 30B-A3B127.4 tok/s · 16.1 GB
Gemma 3 27B30.9 tok/s · 15.2 GB
Phi-4 14B62.0 tok/s · 7.8 GB
Gemma 3 4B178.7 tok/s · 2.4 GB

Everything up to 27B fits at Q4. Qwen 3 30B-A3B is the star here.

64 GB

Sweet Spot

~48-54 GB usable. This is where it gets fun. 70B models fit. MoE models have tons of room.

Qwen 3 30B-A3B127.4 tok/s · 16.1 GB
Llama 3.3 70B12.6 tok/s · 37.1 GB
Devstral 24B39.3 tok/s · 12.6 GB
Qwen 3 32B25.7 tok/s · 17.3 GB

Make sure you have the 40-core GPU. The 32-core is ~25% slower at the same RAM.

128 GB

Overkill (in a good way)

~100-110 GB usable. Frontier MoE, 70B at Q8, or just load three models at once and switch between them.

Qwen 3 30B-A3B127.4 tok/s · 16.1 GB
Llama 3.3 70B12.6 tok/s · 37.1 GB
Devstral 24B39.3 tok/s · 12.6 GB
DeepSeek R1 32B24.9 tok/s · 17.3 GB

Can fit Qwen 3 235B-A22B (~118 GB) at Q4 if you want to go all out.

MLX vs GGUF: Why I Went with MLX (and Why Ollama Broke)

Apple's MLX framework vs. the GGUF/llama.cpp ecosystem. Both work, but one works better on this hardware.

So I started this project using Ollama because, honestly, it's the easiest way to get a model running. Download, run, done. But on the M5 Max, Ollama kept throwing Metal 4 shader compilation errors. This is a known issue with Ollama 0.18.2 on M5 hardware — it uses GGUF/llama.cpp under the hood, and the Metal shaders haven't been updated for Metal 4 yet. I wasted a good two hours debugging that before switching to MLX. Once I did, everything just worked, and the performance was noticeably better anyway — roughly 20-30% faster on decode. MLX is purpose-built for Apple Silicon's unified memory and Metal 4, so that makes sense. The downside is the model ecosystem is smaller (though the mlx-community on HuggingFace has thousands of models now) and you don't get the nice GUI that Ollama or LM Studio provide.

MLX (Apple)

Built for Apple Silicon's unified memory & Metal 4 GPU
Native Python library (mlx-lm) with fine-tuning support
~20-30% faster decode on Apple Silicon vs llama.cpp
Faster prompt processing (prefill) via deep memory integration
Recommended on M5 Max hardware
Thousands of models on HuggingFace (mlx-community)

GGUF (llama.cpp)

Cross-platform: CPU, CUDA, Metal, Vulkan
Tens of thousands of models on HuggingFace
Broader quantization options (IQ quants, mixed quantization)
Powers Ollama and LM Studio (GUI tools)
Note: Ollama 0.18.2 has Metal 4 shader bugs on M5 Max
Best for cross-platform portability needs

My take: Use MLX if you're on M5 Max. It's just faster, and it actually works without shader bugs right now. Use GGUF if you need cross-platform support or a wider model selection. Both are improving fast. The benchmark tables above are all MLX numbers.

How Do These Compare to Cloud APIs?

API pricing as of March 2026. Cloud is still better at the very top of the quality curve, but local is closer than I expected.

I think it's only fair to compare local models against the cloud options. The top cloud models (Claude Opus, Gemini 3.1 Pro, GPT-5.2) still outperform anything you can run locally on the hardest reasoning benchmarks. That's just where things are in 2026. But for the stuff I actually use AI for day-to-day — coding help, writing drafts, summarizing docs, data extraction — a locally-run 30B or 70B model handles it well. The real wins for local are privacy (nothing leaves my machine), latency (sub-200ms TTFT vs 500ms-2s for APIs), and zero marginal cost once you've bought the hardware. The cost analysis below breaks down when local starts saving you money.

Model	Provider	Input $/M tok	Output $/M tok	Context	Arena Elo	Vision
Claude Opus 4.6	Anthropic	$5.00	$25.00	1M	~1505	Yes
Gemini 3.1 Pro	Google	$2.00	$12.00	1M	~1503	Yes
GPT-5.2	OpenAI	$1.75	$14.00	400K	~1490	Yes
Claude Sonnet 4.6	Anthropic	$3.00	$15.00	200K	~1480	Yes
GPT-5.2 Pro	OpenAI	$21.00	$168.00	400K	~1510	Yes
Gemini 2.5 Flash	Google	Free	Free	1M	~1450	Yes
DeepSeek V3.2 API	DeepSeek	$0.14	$0.28	164K	~1421	No
DeepSeek R1 API	DeepSeek	$0.55	$2.19	164K	~1430	No
GPT-4o	OpenAI	$2.50	$10.00	128K	~1460	Yes
Claude Haiku 4.5	Anthropic	$1.00	$5.00	200K	~1420	Yes

Table 4: Cloud API pricing and quality ratings as of March 2026. Prices change constantly.

Where Local Wins

✓ Privacy: sensitive data never leaves your machine
✓ Latency: under 200ms TTFT vs 500ms-2s for APIs
✓ Cost at scale: zero marginal cost after hardware
✓ Offline access: works anywhere, no internet needed
✓ No rate limits: generate as fast as hardware allows

Where Cloud Wins

✓ Highest absolute quality (Elo 1490-1510)
✓ No upfront hardware investment
✓ Always the latest frontier models
✓ 1M+ token context windows
✓ Low usage: cheaper than buying hardware

When Does Local Actually Save You Money?

M5 Max 128GB at ~$4,999 vs. roughly $9/M tokens (blended Sonnet-tier pricing).

This is the question everyone asks, and the honest answer is: it depends on how much you use it. I ran the numbers assuming a $4,999 MacBook Pro and blended API costs of about $9 per million tokens (that's roughly what Sonnet-tier models cost when you average input and output). If you're doing 10K tokens a day — maybe a handful of short conversations — local doesn't make financial sense. You'd be waiting 154 years to break even. But if you're like me and you're burning through 100K+ tokens a day on coding tasks and writing, the math starts working in your favor within about 15 months. At 500K tokens/day, it's 3 months. And after that, every token is free. Electricity is maybe $5-10/month even if you're running it hard.

Daily Token Usage	Monthly Cloud Cost	Break-Even	Verdict
10K tokens/day	~$2.70/mo	154 years	Cloud wins
100K tokens/day	~$27/mo	15 months	Toss-up
500K tokens/day	~$135/mo	3 months	Local wins
1M tokens/day	~$270/mo	1.5 months	Local wins
5M tokens/day	~$1,350/mo	~11 days	Local wins

Table 5: Break-even assuming M5 Max 128 GB at $4,999 and blended API pricing of ~$9/M tokens.

    The short version: If you're using AI heavily for work (100K+ tokens/day, which is 50-100 solid interactions), local pays for itself within about 15 months. After that, it's all free. Electricity is negligible.
  

How I Tested All of This

Hardware, software, prompts, and what I did to keep the numbers honest.

I wanted these numbers to be reproducible, so I was pretty careful about the setup. Everything ran on a single MacBook Pro 16-inch, M5 Max with the 40-core GPU and 128 GB of unified memory, running macOS 16.x. I used MLX 0.31.1 with mlx-lm 0.31.1, and every model was tested at 4-bit quantization. No cherry-picking quantization levels to make certain models look better.

Each model got four standardized prompts: a simple Q&A, a reasoning task, a coding task, and a structured output task. I ran 3-5 passes per model and averaged the results. The important part: every benchmark run happened in an isolated subprocess. That means clean memory measurement, no contamination from previous runs, and no warm-cache advantages. I measured tok/s (generation speed), time to first token (TTFT), and peak memory (RSS). The laptop was on its standard cooling — no external fans or cooling pads. Quality numbers (Elo ratings, MMLU-Pro, HumanEval) come from public leaderboards, not my own testing. And for the record: I bought this laptop at retail. No review unit, no vendor sponsorship, no early access.

Test Setup

Hardware: MacBook Pro 16" M5 Max, 40-core GPU, 128 GB unified memory
OS: macOS 16.x (Darwin 25.3.0)
Framework: MLX 0.31.1, mlx-lm 0.31.1
Quantization: 4-bit (Q4) for all models
Test prompts: 4 standardized per model (Q&A, reasoning, coding, structured output)
Runs: 3-5 passes per model, averaged
Isolation: Each run in a separate subprocess for clean memory measurement

Questions I Keep Getting Asked

Stuff people asked me after reading early drafts of this post.

I shared a draft of this with a few friends and got a lot of the same questions back. I've collected them here. Most of the answers reference actual numbers from my testing, and I've tried to be specific rather than hand-wavy. If you don't see your question, the RAM Tier Guide and MLX vs GGUF section might cover it.

What is the fastest AI model on MacBook Pro M5 Max?+

Gemma 3 4B is the fastest model I tested, generating 178.7 tokens per second on the M5 Max 40-core GPU using MLX at 4-bit quantization. It uses only 2.4 GB of memory and supports both text and vision input. For models with more knowledge, Qwen 3 30B-A3B (MoE) achieves 127.4 tok/s while packing 30 billion parameters of intelligence into 16.1 GB of memory. See the full benchmark table for all results.

How many tokens per second can the M5 Max generate?+

It follows the formula tok/s ≈ 614 GB/s ÷ model_size_GB. In practice: a 4B model generates ~179 tok/s, 8B ~105-113 tok/s, 14B ~55-62 tok/s, 27B ~31 tok/s, and 70B ~12-13 tok/s. MoE models like Qwen 3 30B-A3B break this curve with 127 tok/s despite using 16GB of memory because only 3B parameters are active per token.

Can I run a 70B model on MacBook Pro?+

Yes, but you need at least 64 GB of unified memory. At 4-bit quantization, a 70B dense model like Llama 3.3 70B uses 37.1 GB, leaving enough headroom on a 64 GB Mac for the OS and context window. On the M5 Max 40-core GPU, it generates at 12.6 tok/s with a 724ms time to first token. I fully expected the 70B model to choke, but it just... worked. For 128 GB configurations, you can run 70B at Q8 for higher quality or keep multiple models loaded at once. See the RAM Tier Guide for configuration-specific recommendations.

Is MLX faster than llama.cpp on Apple Silicon?+

On M5 Max, MLX is the better choice. It's purpose-built for Apple Silicon's unified memory and Metal 4 GPU, providing ~20-30% better decode performance than llama.cpp. Ollama (GGUF-based) also has Metal 4 shader compilation issues on M5 Max as of March 2026 — I hit this myself and wasted a couple hours on it. MLX tends to have faster prompt processing as well. See the full MLX vs GGUF comparison for details.

How much RAM do I need to run local AI models?+

At 4-bit quantization, model size in GB is roughly 0.5 times the number of billion parameters. Leave about 20% headroom for the OS and KV cache. 32 GB runs excellent 12B-27B models (31-69 tok/s). 64 GB runs frontier 70B dense models comfortably. 128 GB is for 70B at Q8, multi-model setups, or frontier MoE like Qwen 3 235B. The RAM Tier Guide has specific recommendations for each configuration.

Is local AI cheaper than cloud APIs?+

It depends on usage. At 100K tokens/day (~50-100 interactions), a MacBook Pro M5 Max 128 GB ($4,999) breaks even with cloud APIs within 15 months. At 500K tokens/day, break-even is 3 months. At 1M tokens/day, just 1.5 months. After break-even, every token is free. Electricity adds only ~$5-10/month. See the full cost break-even analysis.

What is Mixture of Experts (MoE) and why does it matter?+

MoE is a model architecture where only a fraction of parameters are activated per token. Qwen 3 30B-A3B has 30 billion total parameters but activates only 3 billion per token, giving it the speed of a 3B model (127.4 tok/s) with the quality of a 30B model. The tradeoff: MoE models still need memory for all parameters (16.1 GB). On Apple Silicon with ample unified memory, MoE offers the best quality-to-speed ratio available — making it ideal for all RAM tiers.

Which model should I choose for coding on Mac?+

For coding tasks, Devstral Small 24B (39.3 tok/s, 12.6 GB) is purpose-built for code generation and fits on any Mac with 32 GB+. Qwen 3 30B-A3B (127.4 tok/s) is excellent for coding while also being fast enough for real-time use. For smaller setups, Qwen 3 8B (105.4 tok/s, 4.4 GB) offers solid coding at ultra-fast speed. For maximum quality with 64 GB+, Llama 3.3 70B provides the strongest overall performance.

Can I run AI models completely offline on a MacBook Pro?+

Yes. Once downloaded, models run entirely on local hardware with no internet needed. This is a key advantage for travel, air-gapped environments, restricted networks, and privacy-sensitive work. Models are stored on SSD and inference uses only your Mac's CPU, GPU, and unified memory.

What models support vision (image input) on Mac?+

Several VLMs run well via MLX: Gemma 3 4B VLM (179 tok/s, 2.4 GB) is the fastest. Qwen3-VL 8B (111 tok/s, 4.4 GB) is the best value. Gemma 3 27B VLM (32 tok/s, 15.2 GB) and Qwen3-VL 32B (27 tok/s, 17.3 GB) provide the highest quality image understanding. All run entirely locally. See the VLM section for more.

Can I run AI models while doing other work?+

Yes. Apple Silicon uses a unified memory architecture where the CPU and GPU share the same memory pool. You don't need to copy model weights between system RAM and VRAM like you would with a discrete GPU. You can run AI inference alongside your browser, IDE, or creative apps. However, the model's memory footprint reduces what's available for other applications, so choose a model size that leaves enough headroom. For example, running a 17 GB model on a 64 GB Mac still leaves roughly 39 GB for macOS and your other apps.

What does tokens per second actually feel like?+

Average human reading speed is roughly 4–5 tokens per second. So any model generating above 5 tok/s is producing text faster than you can read it. At 25–30 tok/s, responses appear nearly instantaneous for short answers. At 100+ tok/s, even long multi-paragraph responses complete in just a few seconds. For coding assistants, higher speeds mean faster completions and less waiting between edits. In practice, anything above 30 tok/s feels real-time for interactive chat.

Can I run multiple models at once?+

Yes, as long as the combined memory footprint fits in your available RAM. For example, on a 128 GB Mac with approximately 8 GB reserved for the OS, you could simultaneously load Qwen 3 30B-A3B (16.1 GB) + Devstral 24B (12.6 GB) + Gemma 3 4B (2.4 GB) for a total of 31.1 GB, leaving over 88 GB free. On a 64 GB Mac, you could run a 7–8B model (4.4 GB) alongside a 14B model (7.8 GB) for about 12.2 GB total. Only one model generates tokens at a time using the GPU, but having multiple loaded avoids reload latency when switching between them.

What is the best model for each RAM tier?+

32 GB: Qwen 3 30B-A3B (127.4 tok/s, 16.1 GB) is the best overall pick, offering 30B-class quality at blazing speed. 64 GB: Llama 3.3 70B (12.6 tok/s, 37.1 GB) provides the strongest quality, while Qwen 3 30B-A3B remains the best speed-to-quality ratio. 128 GB: You can run everything, but Qwen 3 30B-A3B is still the daily driver for speed, Llama 3.3 70B for quality, and Devstral 24B (39.3 tok/s, 12.6 GB) for coding. The sweet spot for most users is the 64 GB configuration with the 40-core GPU.

What I'd Actually Recommend

So after a weekend of testing, here's where I landed. The M5 Max is genuinely good at this. Small models (4B-8B) generate text faster than you can read it — 105 to 179 tok/s. Mid-range models (12B-27B) feel comfortable for conversation at 31-69 tok/s. And the 70B models, which I honestly expected to struggle, run at 12-13 tok/s — not blazing, but totally usable for longer tasks. The standout is the Qwen 3 30B-A3B with its MoE architecture: 127 tok/s, 30B params worth of smarts, 16 GB of memory. I keep going back to it.

If you're buying a MacBook Pro for AI work, get the 64 GB configuration with the 40-core GPU. It runs 70B dense models and every MoE model, at the full 614 GB/s bandwidth. If you can swing 128 GB, you get frontier MoE models, higher quantization, and multi-model workflows. For software, MLX is the way to go on M5 Max right now — it's faster and doesn't have the shader bugs that are plaguing Ollama. Start with Qwen 3 30B-A3B for everyday use, Devstral 24B for code, and Gemma 3 4B VLM for fast vision tasks. The open-source model ecosystem is moving fast, and having this hardware means you're ready for whatever drops next without needing to pay per token.