I Benchmarked 27 AI Models on My MacBook Pro M5 Max — Speed, Quality, and Agentic

Local AI inference is faster, cheaper, and more private than you think. I tested every major open-source model to prove it.

I just got a MacBook Pro M5 Max with 128GB of unified memory. The first thing I did was install every open-source AI model I could get my hands on and run them all through a standardized benchmark suite.

Not because anyone asked me to. Because I was genuinely curious: can a laptop actually replace cloud AI APIs?

The short answer? For a surprising number of use cases, yes. A 4-billion-parameter model generates text at 179 tokens per second on this machine – faster than you can read. A 30-billion-parameter model, thanks to a clever architectural trick, hits 128 tokens per second. Even a massive 70-billion-parameter model runs at a usable 13 tokens per second.

Here’s everything I learned.

Why Local AI Matters

Before the benchmarks, let me make the case for running models on your own hardware.

Privacy is the obvious one. When you send a prompt to an API, your data leaves your machine. Maybe the provider promises not to train on it. Maybe they don’t. With local inference, your code, your documents, your medical records, your proprietary business logic – none of it ever leaves your laptop. For some industries, this isn’t a preference. It’s a compliance requirement.

Cost is the less obvious one. Cloud APIs charge per token. A MacBook Pro M5 Max 128GB costs about $4,999. If you’re generating 500K tokens per day (roughly 200-300 substantive AI interactions), local inference pays for itself in about three months. At 1M tokens per day, you break even in six weeks. After that, every token is free.

Latency is the sneaky one. Most models I tested produce their first token in under 200 milliseconds. Cloud APIs typically take 500ms to 2 seconds just for the network round-trip before generation even begins. For interactive coding assistants or real-time applications, that difference is night and day.

And then there’s the simple fact that local models work offline. Airplanes, restricted networks, spotty hotel Wi-Fi – your AI assistant doesn’t care.

The Setup

Hardware: MacBook Pro 16" with M5 Max, 40-core GPU, 128GB unified memory

Framework: Apple’s MLX (version 0.31.1), the purpose-built machine learning framework for Apple Silicon

Quantization: 4-bit (Q4) for all models. At this precision, model size runs roughly 0.5 GB per billion parameters. Quality loss compared to full precision is minimal for most tasks.

Methodology: 4 standardized prompts per model covering simple Q&A, reasoning, coding, and structured output. Each model got 3-5 benchmark passes in isolated subprocesses for clean memory measurement. Averages are reported throughout.

Why MLX instead of Ollama or llama.cpp? On M5 Max hardware, MLX currently delivers roughly 20-30% better token generation performance. There’s also a Metal 4 shader compilation bug in Ollama 0.18.2 on M5 Max that produces type mismatches. MLX is the path of least resistance right now for Apple Silicon.

The Results: Speed Tiers

I’m going to organize findings by speed tier, because that’s what actually matters for usability.

Tier 1: Instant (100+ tok/s)

These models generate text faster than you can read it. They’re suitable for real-time applications, autocomplete, and interactive use where latency needs to feel invisible.

Model	Params	tok/s	Memory	Type
Gemma 3 4B	4B	178.7	2.4 GB	VLM
Qwen 3 30B-A3B	30B (3B active)	127.4	16.1 GB	MoE
Llama 3.1 8B	8B	112.7	4.4 GB	Dense
DeepSeek R1 Distill 7B	7B	110.8	4.1 GB	Dense
Qwen3-VL 8B	8B	110.7	4.4 GB	VLM
Mistral 7B v0.3	7B	106.7	3.9 GB	Dense
Qwen 3 8B	8B	105.4	4.4 GB	Dense

The fastest model overall is Gemma 3 4B at 178.7 tok/s, using a mere 2.4 GB of memory. It handles both text and images (it’s a vision-language model), and it’s fast enough for tasks like summarization, classification, and extraction where you simply don’t need a larger model.

Tier 2: Comfortable (50-99 tok/s)

Smooth conversational speed. You won’t notice any lag in interactive use.

Model	Params	tok/s	Memory	Type
Gemma 3 12B	12B	69.2	7.0 GB	Dense
Phi-4 14B	14B	62.0	7.8 GB	Dense
DeepSeek R1 Distill 14B	14B	55.1	7.9 GB	Dense
Phi-4 Reasoning 14B	14B	54.3	8.2 GB	Dense

Phi-4 14B deserves a callout: at 62 tok/s and only 7.8 GB, it punches well above its weight on reasoning and math. If you have a 32GB Mac and need analytical horsepower, this is the model.

Tier 3: Moderate (25-49 tok/s)

Usable for longer generation tasks. You’ll notice tokens arriving one by one, but it’s perfectly fine for anything that isn’t real-time.

Model	Params	tok/s	Memory
Devstral Small 24B	24B	39.3	12.6 GB
Gemma 3 27B	27B	30.9	15.2 GB
Qwen 3 32B	32B	25.7	17.3 GB
DeepSeek R1 Distill 32B	32B	24.9	17.3 GB

Tier 4: Steady (10-24 tok/s)

Patience required. But these are 70-billion-parameter models running on a laptop. The quality gap is real.

Model	Params	tok/s	Memory
Llama 3.1 70B	70B	13.1	37.1 GB
Llama 3.3 70B	70B	12.6	37.1 GB

The MoE Surprise: Qwen 3 30B-A3B

The single most interesting result in this entire benchmark is Qwen 3 30B-A3B.

This model uses a Mixture of Experts (MoE) architecture. It has 30 billion total parameters, but only activates 3 billion per token. The result: it runs at 127.4 tok/s – nearly matching 8B-class dense models – while packing 30B parameters worth of knowledge.

The tradeoff is memory. It still needs 16.1 GB to hold all 30 billion weights in RAM. But the speed is determined by the active parameters, not the total count. You get the intelligence of a 30B model at the speed of a small one.

For anyone with 32GB or more of RAM, this model is arguably the best value proposition in local AI right now.

Quality Evaluations: How Smart Are These Models?

Speed is only half the story. I also ran every model through three standardized quality benchmarks using lm-evaluation-harness to measure how well they actually reason, follow instructions, and solve problems.

The benchmarks:

ARC-Challenge – 1,172 grade-school science questions (multiple choice, 0-shot). Tests general knowledge and reasoning.
GSM8K – 1,319 grade-school math word problems (8-shot chain of thought). Tests mathematical reasoning.
IFEval – 541 instructions with verifiable constraints like “write more than 400 words” or “include the word ‘AI’ at least 3 times.” Tests instruction following.

All tests used greedy decoding (temperature=0) for reproducibility. Each model was tested with 25 samples per benchmark – enough to get directional signal, with full 200+ sample runs coming soon for the top performers.

Configuration matters. I initially got 0% scores across the board on thinking models (DeepSeek R1, Qwen 3) because the default max generation length of 100 tokens truncated their <think> blocks before they could produce an answer. With our custom ARC task (2,048 max tokens, proper stop sequences, regex answer extraction), the same models scored near-perfect.

Early Results (25-sample calibration)

All 12 Q4 models tested so far show strong performance on ARC-Challenge, with most scoring above 90%. The more interesting differentiation is on GSM8K (math) and IFEval (instruction following), where model size and architecture create clear quality tiers.

The full quality leaderboard with per-model breakdowns is available in the Interactive Leaderboard – it loads live data as evaluations complete.

Quality vs Speed: The Real Tradeoff

The key finding so far: you don’t need a 70B model for most tasks. On ARC-Challenge, even the 4B Gemma 3 matches larger models at 25 samples. The differentiation happens on harder benchmarks and with larger sample sizes, which is why we’re expanding evaluation to 200+ samples for the top performers.

For the sweet spot of quality + speed, the 8B-14B range (Qwen 3 8B, Phi-4 14B, Gemma 3 12B) delivers excellent results while maintaining 60+ tok/s generation speed.

Agentic Benchmarks: Can They Actually Do Real Work?

This is where things get humbling. I used terminal-bench to test whether these models can autonomously operate a terminal – installing packages, fixing code, creating servers, manipulating files. Each model runs in a fresh Docker container and must solve the task through an agentic loop of shell commands.

I ran 14 tasks across all 27 models. The results:

Task	Pass Rate	Difficulty	Top Failure Mode
hello-world	17/27 (63%)	Easy	parse error (21)
fix-permissions	14/27 (52%)	Easy	parse error (9)
create-bucket	8/27 (30%)	Easy	parse error (9)
csv-to-parquet	5/27 (19%)	Easy	wrong commands (11)
extract-safely	2/27 (7%)	Easy	wrong output (15)
fix-git	1/27 (4%)	Medium	parse error (9)
grid-pattern-transform	1/27 (4%)	Easy	wrong commands (11)
processing-pipeline	1/27 (4%)	Easy	wrong output (11)
fibonacci-server	0/27 (0%)	Medium	wrong commands (12)
simple-web-scraper	0/27 (0%)	Easy	wrong commands (25)
polyglot-c-py	0/27 (0%)	Medium	wrong commands (13)
openssl-selfsigned-cert	0/27 (0%)	Medium	wrong output (14)
modernize-fortran-build	0/27 (0%)	Easy	wrong commands (14)
swe-bench-langcodes	0/27 (0%)	Medium	wrong commands (25)

The failure modes tell the story clearly:

parse error – model can’t produce valid JSON for the agent loop. Dominates the easy tasks where even getting started requires structured output. Most small models (3B-8B) fail here.
wrong commands – model produces valid JSON but issues the wrong shell commands. Dominates the harder tasks where models understand the format but not the problem. Common pattern: installing wrong packages, using wrong CLI flags, misunderstanding the task.
wrong output – model gets partially there (some tests pass) but fails others. Shows the model understands the task but can’t fully solve it.
loop – model enters a cycle of 20+ episodes repeating the same failed approach. Rare but happens on tasks like fix-git where models get stuck retrying.

Overall: parse errors account for 111 of 329 failures (34%), wrong commands for 129 (39%), and wrong output for 47 (14%).

For context, the best cloud models score much higher: Claude Sonnet 4.5 hits 50% on terminal-bench v1.0, and GPT-5.3 Codex reaches 77.3% on v2.0. Local open-source models are clearly a generation behind on agentic capability.

The takeaway: local models are great for text generation, reasoning, and coding assistance. But autonomous agent workflows – where the model must produce structured output and execute multi-step plans – remain firmly in cloud-model territory for now.

The full interactive agentic results grid is in the Eval Dashboard and the Leaderboard.

The Physics of Local Inference

There’s an elegant formula that governs all of this:

Tokens/sec = Memory Bandwidth (GB/s) / Model Size in Memory (GB)

The M5 Max 40-core has 614 GB/s of bandwidth. Divide that by model size and you get a theoretical ceiling. Real-world performance lands within 20-30% of that prediction, with the gap explained by KV cache operations, compute overhead, and framework efficiency.

This is why the M5 Max is so effective at inference. It’s not about raw compute – it’s about how fast you can read model weights from memory. Apple’s unified memory architecture means the GPU has direct access to all 128GB at full bandwidth. No VRAM bottleneck. No PCIe bus. No layer offloading.

An NVIDIA RTX 4090 has 24GB of VRAM. An RTX 5090 has 32GB. If your model doesn’t fit, you’re stuck.

Local vs. Cloud: A Cost Reality Check

Here’s the math that surprised me. At Sonnet-tier blended pricing (~$9 per million tokens):

Daily Usage	Monthly Cloud Cost	Break-Even Point
100K tokens/day	~$27/month	15 months
500K tokens/day	~$135/month	3 months
1M tokens/day	~$270/month	6 weeks

And that’s comparing against local models in the 8B-30B range. The frontier cloud models (Claude Opus 4.6, GPT-5.2 Pro) cost $25-168 per million output tokens. Against those, local inference starts looking economical at even modest usage.

Of course, cloud models are still significantly smarter for complex tasks. This isn’t an either/or. The right approach is using local models for the 80% of tasks that don’t need frontier intelligence, and reserving API calls for the 20% that do.

My Recommendation Matrix

After weeks of testing, here’s what I’d actually install:

Best for Coding: Devstral Small 24B (39.3 tok/s, 12.6 GB) – purpose-built for code generation, Apache 2.0 licensed.

Best for General Use: Qwen 3 30B-A3B (127.4 tok/s, 16.1 GB) – the MoE architecture makes this the best quality-per-tok/s model available.

Best for Reasoning/Math: Phi-4 14B (62.0 tok/s, 7.8 GB) – MIT licensed, exceptional analytical performance for its size.

Best for Vision Tasks: Gemma 3 4B (178.7 tok/s, 2.4 GB) for speed, Qwen3-VL 32B (27.3 tok/s, 17.3 GB) for quality.

Best for Deep Reasoning: DeepSeek R1 Distill 32B (24.9 tok/s, 17.3 GB) – chain-of-thought reasoning, Apache 2.0.

Best for Maximum Quality: Llama 3.3 70B (12.6 tok/s, 37.1 GB) – if you have the RAM and the patience.

Best for 32GB Macs: Qwen 3 30B-A3B. Not close. MoE architecture at this tier is a game-changer.

RAM Tier Quick Guide

128GB: You can run anything up to 70B at high-quality quantization, or squeeze in Qwen 3 235B-A22B (a 235B MoE model) at Q4. This is the configuration for people who want no compromises.

64GB: The sweet spot. Llama 3.3 70B fits with room for context. Qwen 3 30B-A3B flies. All mid-range models are comfortable. Note: the 40-core GPU variant (614 GB/s) is roughly 25% faster than the 32-core variant (460 GB/s) at the same RAM.

32GB: More capable than you’d expect. Qwen 3 30B-A3B still fits at 16.1 GB. Gemma 3 27B fits at 15.2 GB. Phi-4 14B at 7.8 GB leaves room for everything else.

Try It Yourself

I’ve published the full benchmark data and an interactive comparison tool where you can filter, sort, and explore all the results.

Interactive benchmark explorer: o.ml/benchmarks/m5-max-ai-models/explorer/

The site has multiple views for different audiences:

Open Model Leaderboard – quality + speed rankings, local vs cloud comparison
Benchmark Explorer – interactive tools, sortable tables
Casual Blog – first-person narrative
Dev Reference – terse, code snippets, model IDs
Editorial – opinionated takes
Enterprise Report – formal analysis, ROI framing

The entire project – benchmarking scripts, raw data, and the comparison website – is open source. If you have an M-series Mac, you can run the benchmarks yourself and contribute your results.

GitHub repo: github.com/joshmouch/Eval-Apple-Models

If this was useful, consider starring the repo. And if you run the benchmarks on different hardware (M4 Pro, M5 Pro, M5 Ultra), I’d love to see your numbers.

The era of local AI is here. Your laptop is more capable than you think.

Speed benchmarks were run on March 24-25, 2026 using MLX 0.31.1 on macOS Darwin 25.3.0. Quality evaluations ongoing using lm-evaluation-harness 0.4.11 with custom task configurations. Models were sourced from the mlx-community collection on Hugging Face. Speed numbers represent averages across 3-5 standardized benchmark passes. Quality scores use greedy decoding (temp=0) with fixed seeds for reproducibility.