So I got my hands on the new M5 Max MacBook Pro — the 128 GB, 40-core GPU model — and I did what any reasonable person would do: I spent the entire weekend benchmarking AI models on it. I had a spreadsheet. I had coffee. I had a Python script that spawned isolated subprocesses. It was a whole thing.
Here's the thing: I went into this expecting the small models to be fast and the big models to be slow. That part was obvious. What I didn't expect was how fast the small ones would be (nearly 180 tokens per second, which is genuinely absurd), or that a 70B model would feel usable on a laptop, or that Ollama would just... break because of a Metal 4 shader bug. I also didn't expect a 30-billion-parameter MoE model to outrun most 8B models. But I'm getting ahead of myself.
This page has everything: the raw benchmark numbers, interactive tools so you can compare models side by side, a RAM calculator that tells you what fits on your Mac, a RAM tier guide with my actual picks, an honest look at MLX vs GGUF, a cost breakdown vs cloud APIs, and a FAQ that covers the questions I kept getting asked while writing this up.
The Hardware: What's Actually in This Machine
Apple M5 Max, announced March 3, 2026 — TSMC 3nm, with Neural Accelerators baked into every GPU core.
Before I get into the numbers, it helps to understand why the hardware matters so much. When you're generating tokens with a language model, the bottleneck isn't compute — it's memory bandwidth. Every single token the model generates requires reading the entire set of model weights from memory. So the speed formula is pretty simple: tok/s ≈ 614 GB/s ÷ model_size_GB. That's it. That formula predicted my real-world results within about 20–30%, which I found kind of wild. The gap comes from KV cache overhead, compute ops, and framework efficiency. But bandwidth is the main event.
M5 Max Chip Specifications
| Spec | M5 Max (32-core GPU) | M5 Max (40-core GPU) |
|---|---|---|
| CPU Cores | 18 (6 Super + 12 Performance) | 18 (6 Super + 12 Performance) |
| GPU Cores | 32 | 40 |
| Neural Engine | 16-core | 16-core |
| GPU Neural Accelerators | Yes (new in M5) | Yes (new in M5) |
| Memory Bandwidth | 460 GB/s | 614 GB/s |
| Max Unified Memory | 128 GB | 128 GB |
| Process | TSMC 3nm (3rd gen) | TSMC 3nm (3rd gen) |
The other big deal is the unified memory architecture. On a regular PC with an NVIDIA GPU, you've got 24 GB of VRAM on an RTX 4090, or 32 GB on a 5090. If your model doesn't fit in VRAM, you're offloading layers to system RAM over PCIe, and that's dramatically slower. On the M5 Max, the GPU just... accesses all 128 GB directly at full bandwidth. There's no VRAM vs system RAM distinction. It's one big pool. That's what makes running a 70B model on a laptop even possible.
Memory Configurations for AI Workloads
| Configuration | Unified Memory | Bandwidth | Best For |
|---|---|---|---|
| M5 Max 32-core GPU | 36 GB | 460 GB/s | Small models up to ~14B dense |
| M5 Max 32-core GPU | 64 GB | 460 GB/s | Mid-range models up to ~70B Q4 (slower) |
| M5 Max 40-core GPU | 64 GB | 614 GB/s | Mid-range models up to ~70B Q4 at full speed |
| M5 Max 40-core GPU | 128 GB | 614 GB/s | Frontier MoE, large dense, multi-model setups |
Compare Any Two Models
Pick two models and see how they stack up on speed, memory, and efficiency.
Head to Head
Speed (tok/s)
RAM Usage (GB)
Efficiency (tok/s per GB)
RAM Calculator: What Fits on Your Mac
Click a RAM tier to see which models you can actually run (I'm reserving 8 GB for macOS).
Compatible (0)
Too Large (0)
Best Pick at 128 GB
The Actual Benchmark Numbers
All models at 4-bit quantization on MLX, M5 Max 40-core GPU, 128GB. Averaged across 3-5 runs.
Alright, here's the part you probably scrolled down for. I tested every model at 4-bit quantization using MLX, ran each one 3-5 times in isolated subprocesses, and averaged the results. The table below has everything. A few things jumped out at me. First, Gemma 3 4B at 179 tok/s is generating text so fast that a full paragraph appears in under a second. Second, the Qwen 3 30B-A3B (that's a Mixture of Experts model) somehow hits 127 tok/s despite having 30 billion parameters. It only activates 3B of them per token, which gives it 8B-class speed with way more knowledge. That felt like cheating, honestly. And third, even the 70B models are usable — 12-13 tok/s is slower than reading speed, but it's not painful for longer tasks. Click column headers to sort.
Text Generation Models
| Model ▲▼ | Params ▲▼ | Type ▲▼ | tok/s ▲▼ | Speed | TTFT ▲▼ | Memory ▲▼ | Tier |
|---|
Vision Language Models
| Model | Params | tok/s | Speed | TTFT | Memory | Tier |
|---|
The Speed Formula
Token generation is memory-bandwidth-bound:
This predicted my real results within 20-30%. The gap is KV cache overhead, compute, and framework efficiency.
Things That Surprised Me
- TTFT under 200ms for everything under 15B params
- Even 70B models respond within 730ms
- Memory at Q4 follows ~0.5 GB per billion params
- MoE models completely break the speed-vs-size curve
Vision Models: Feeding Images to Local AI
Models that take images + text as input. Local document analysis, screenshot parsing, OCR, and visual Q&A — all without uploading anything.
I was curious whether the vision models would be noticeably slower than their text-only counterparts. Turns out, not really. Gemma 3 4B handles images at the same 179 tok/s as text. Qwen3-VL 8B was right there at 111 tok/s. The privacy angle here is what sold me — I was testing these by feeding in screenshots of my own code, photos of handwritten notes, and a few receipts. None of that data left my machine. I also tried running Pixtral (Mistral's VLM) but kept hitting an auth error during weight download, so it didn't make the final list. Something to revisit later.
For quick vision tasks — summarizing a screenshot, reading a chart, extracting text from a photo — Gemma 3 4B VLM at 179 tok/s is faster than any cloud API I've used, and it costs nothing per query. When you need the model to actually understand complex images (dense diagrams, multi-page documents), Qwen3-VL 32B at 27.3 tok/s is the strongest I tested, and it fits comfortably on a 64 GB Mac. Check the RAM guide to see which VLMs work at your tier.
128 GB Memory Map
Each cell is 1 GB of unified memory. Click a category to see what's using what.
128 GB Memory Pool
Each cell = 1 GBSpeed Tiers at a Glance
I grouped the models by how fast they actually feel in practice.
Efficiency Rankings: Speed per GB of RAM
Tokens per second divided by GB of memory used. Higher = more bang for your RAM buck.
| Rank | Model | Type | tok/s | Memory | tok/s per GB | Quality | Agentic | Efficiency |
|---|
Quality Evaluations
But can they actually think? I ran ARC-Challenge, GSM8K, and IFEval to find out.
Speed is great, but I also wanted to know if these models can actually reason. So I ran three standard benchmarks via lm-evaluation-harness: ARC-Challenge (grade-school science), GSM8K (math word problems with chain-of-thought), and IFEval (can the model follow precise formatting instructions). The composite score averages all three.
Loading quality evaluation data...
Agentic Benchmarks
I also tried making them do real terminal tasks. Most of them... couldn't.
Here's where things get interesting. I used terminal-bench to give each model 14 real tasks in fresh Docker containers: create files, fix git repos, install packages, write servers. The model has to figure everything out autonomously through shell commands. Green means it passed, red means it failed. Hover over the red cells to see why — most failures are "parse error" (the model couldn't even produce valid JSON) or "wrong commands" (it tried but got the shell commands wrong). For reference, Claude Sonnet 4.5 scores 50% on the full terminal-bench v1.0 suite.
Loading agentic benchmark data...
My Actual Picks by RAM Tier
At Q4 quantization, model size is roughly params × 0.5 GB. I'd leave ~20% headroom for macOS and KV cache.
Look, picking the right model for your RAM is probably the most important thing you can do here. Running a model that barely fits means you've got no room for context windows, KV cache growth, or having Chrome open (and you'll have Chrome open). My picks below are based on the benchmarks I ran, with real headroom accounted for. Rule of thumb: don't use more than 80% of your memory for the model itself. Leave the rest for macOS, context, and whatever else you've got running.
32 GB
Capable~22-26 GB usable for models. You'd be surprised how much you can do here.
- Qwen 3 30B-A3B127.4 tok/s · 16.1 GB
- Gemma 3 27B30.9 tok/s · 15.2 GB
- Phi-4 14B62.0 tok/s · 7.8 GB
- Gemma 3 4B178.7 tok/s · 2.4 GB
64 GB
Sweet Spot~48-54 GB usable. This is where it gets fun. 70B models fit. MoE models have tons of room.
- Qwen 3 30B-A3B127.4 tok/s · 16.1 GB
- Llama 3.3 70B12.6 tok/s · 37.1 GB
- Devstral 24B39.3 tok/s · 12.6 GB
- Qwen 3 32B25.7 tok/s · 17.3 GB
128 GB
Overkill (in a good way)~100-110 GB usable. Frontier MoE, 70B at Q8, or just load three models at once and switch between them.
- Qwen 3 30B-A3B127.4 tok/s · 16.1 GB
- Llama 3.3 70B12.6 tok/s · 37.1 GB
- Devstral 24B39.3 tok/s · 12.6 GB
- DeepSeek R1 32B24.9 tok/s · 17.3 GB
MLX vs GGUF: Why I Went with MLX (and Why Ollama Broke)
Apple's MLX framework vs. the GGUF/llama.cpp ecosystem. Both work, but one works better on this hardware.
So I started this project using Ollama because, honestly, it's the easiest way to get a model running. Download, run, done. But on the M5 Max, Ollama kept throwing Metal 4 shader compilation errors. This is a known issue with Ollama 0.18.2 on M5 hardware — it uses GGUF/llama.cpp under the hood, and the Metal shaders haven't been updated for Metal 4 yet. I wasted a good two hours debugging that before switching to MLX. Once I did, everything just worked, and the performance was noticeably better anyway — roughly 20-30% faster on decode. MLX is purpose-built for Apple Silicon's unified memory and Metal 4, so that makes sense. The downside is the model ecosystem is smaller (though the mlx-community on HuggingFace has thousands of models now) and you don't get the nice GUI that Ollama or LM Studio provide.
MLX (Apple)
- Built for Apple Silicon's unified memory & Metal 4 GPU
- Native Python library (mlx-lm) with fine-tuning support
- ~20-30% faster decode on Apple Silicon vs llama.cpp
- Faster prompt processing (prefill) via deep memory integration
- Recommended on M5 Max hardware
- Thousands of models on HuggingFace (mlx-community)
GGUF (llama.cpp)
- Cross-platform: CPU, CUDA, Metal, Vulkan
- Tens of thousands of models on HuggingFace
- Broader quantization options (IQ quants, mixed quantization)
- Powers Ollama and LM Studio (GUI tools)
- Note: Ollama 0.18.2 has Metal 4 shader bugs on M5 Max
- Best for cross-platform portability needs
How Do These Compare to Cloud APIs?
API pricing as of March 2026. Cloud is still better at the very top of the quality curve, but local is closer than I expected.
I think it's only fair to compare local models against the cloud options. The top cloud models (Claude Opus, Gemini 3.1 Pro, GPT-5.2) still outperform anything you can run locally on the hardest reasoning benchmarks. That's just where things are in 2026. But for the stuff I actually use AI for day-to-day — coding help, writing drafts, summarizing docs, data extraction — a locally-run 30B or 70B model handles it well. The real wins for local are privacy (nothing leaves my machine), latency (sub-200ms TTFT vs 500ms-2s for APIs), and zero marginal cost once you've bought the hardware. The cost analysis below breaks down when local starts saving you money.
| Model | Provider | Input $/M tok | Output $/M tok | Context | Arena Elo | Vision |
|---|---|---|---|---|---|---|
| Claude Opus 4.6 | Anthropic | $5.00 | $25.00 | 1M | ~1505 | Yes |
| Gemini 3.1 Pro | $2.00 | $12.00 | 1M | ~1503 | Yes | |
| GPT-5.2 | OpenAI | $1.75 | $14.00 | 400K | ~1490 | Yes |
| Claude Sonnet 4.6 | Anthropic | $3.00 | $15.00 | 200K | ~1480 | Yes |
| GPT-5.2 Pro | OpenAI | $21.00 | $168.00 | 400K | ~1510 | Yes |
| Gemini 2.5 Flash | Free | Free | 1M | ~1450 | Yes | |
| DeepSeek V3.2 API | DeepSeek | $0.14 | $0.28 | 164K | ~1421 | No |
| DeepSeek R1 API | DeepSeek | $0.55 | $2.19 | 164K | ~1430 | No |
| GPT-4o | OpenAI | $2.50 | $10.00 | 128K | ~1460 | Yes |
| Claude Haiku 4.5 | Anthropic | $1.00 | $5.00 | 200K | ~1420 | Yes |
Where Local Wins
- ✓ Privacy: sensitive data never leaves your machine
- ✓ Latency: under 200ms TTFT vs 500ms-2s for APIs
- ✓ Cost at scale: zero marginal cost after hardware
- ✓ Offline access: works anywhere, no internet needed
- ✓ No rate limits: generate as fast as hardware allows
Where Cloud Wins
- ✓ Highest absolute quality (Elo 1490-1510)
- ✓ No upfront hardware investment
- ✓ Always the latest frontier models
- ✓ 1M+ token context windows
- ✓ Low usage: cheaper than buying hardware
When Does Local Actually Save You Money?
M5 Max 128GB at ~$4,999 vs. roughly $9/M tokens (blended Sonnet-tier pricing).
This is the question everyone asks, and the honest answer is: it depends on how much you use it. I ran the numbers assuming a $4,999 MacBook Pro and blended API costs of about $9 per million tokens (that's roughly what Sonnet-tier models cost when you average input and output). If you're doing 10K tokens a day — maybe a handful of short conversations — local doesn't make financial sense. You'd be waiting 154 years to break even. But if you're like me and you're burning through 100K+ tokens a day on coding tasks and writing, the math starts working in your favor within about 15 months. At 500K tokens/day, it's 3 months. And after that, every token is free. Electricity is maybe $5-10/month even if you're running it hard.
| Daily Token Usage | Monthly Cloud Cost | Break-Even | Verdict |
|---|---|---|---|
| 10K tokens/day | ~$2.70/mo | 154 years | Cloud wins |
| 100K tokens/day | ~$27/mo | 15 months | Toss-up |
| 500K tokens/day | ~$135/mo | 3 months | Local wins |
| 1M tokens/day | ~$270/mo | 1.5 months | Local wins |
| 5M tokens/day | ~$1,350/mo | ~11 days | Local wins |
How I Tested All of This
Hardware, software, prompts, and what I did to keep the numbers honest.
I wanted these numbers to be reproducible, so I was pretty careful about the setup. Everything ran on a single MacBook Pro 16-inch, M5 Max with the 40-core GPU and 128 GB of unified memory, running macOS 16.x. I used MLX 0.31.1 with mlx-lm 0.31.1, and every model was tested at 4-bit quantization. No cherry-picking quantization levels to make certain models look better.
Each model got four standardized prompts: a simple Q&A, a reasoning task, a coding task, and a structured output task. I ran 3-5 passes per model and averaged the results. The important part: every benchmark run happened in an isolated subprocess. That means clean memory measurement, no contamination from previous runs, and no warm-cache advantages. I measured tok/s (generation speed), time to first token (TTFT), and peak memory (RSS). The laptop was on its standard cooling — no external fans or cooling pads. Quality numbers (Elo ratings, MMLU-Pro, HumanEval) come from public leaderboards, not my own testing. And for the record: I bought this laptop at retail. No review unit, no vendor sponsorship, no early access.
Test Setup
- Hardware: MacBook Pro 16" M5 Max, 40-core GPU, 128 GB unified memory
- OS: macOS 16.x (Darwin 25.3.0)
- Framework: MLX 0.31.1, mlx-lm 0.31.1
- Quantization: 4-bit (Q4) for all models
- Test prompts: 4 standardized per model (Q&A, reasoning, coding, structured output)
- Runs: 3-5 passes per model, averaged
- Isolation: Each run in a separate subprocess for clean memory measurement
Questions I Keep Getting Asked
Stuff people asked me after reading early drafts of this post.
I shared a draft of this with a few friends and got a lot of the same questions back. I've collected them here. Most of the answers reference actual numbers from my testing, and I've tried to be specific rather than hand-wavy. If you don't see your question, the RAM Tier Guide and MLX vs GGUF section might cover it.
tok/s ≈ 614 GB/s ÷ model_size_GB. In practice: a 4B model generates ~179 tok/s, 8B ~105-113 tok/s, 14B ~55-62 tok/s, 27B ~31 tok/s, and 70B ~12-13 tok/s. MoE models like Qwen 3 30B-A3B break this curve with 127 tok/s despite using 16GB of memory because only 3B parameters are active per token.
What I'd Actually Recommend
So after a weekend of testing, here's where I landed. The M5 Max is genuinely good at this. Small models (4B-8B) generate text faster than you can read it — 105 to 179 tok/s. Mid-range models (12B-27B) feel comfortable for conversation at 31-69 tok/s. And the 70B models, which I honestly expected to struggle, run at 12-13 tok/s — not blazing, but totally usable for longer tasks. The standout is the Qwen 3 30B-A3B with its MoE architecture: 127 tok/s, 30B params worth of smarts, 16 GB of memory. I keep going back to it.
If you're buying a MacBook Pro for AI work, get the 64 GB configuration with the 40-core GPU. It runs 70B dense models and every MoE model, at the full 614 GB/s bandwidth. If you can swing 128 GB, you get frontier MoE models, higher quantization, and multi-model workflows. For software, MLX is the way to go on M5 Max right now — it's faster and doesn't have the shader bugs that are plaguing Ollama. Start with Qwen 3 30B-A3B for everyday use, Devstral 24B for code, and Gemma 3 4B VLM for fast vision tasks. The open-source model ecosystem is moving fast, and having this hardware means you're ready for whatever drops next without needing to pay per token.