Apple Silicon Benchmark · March 2026

The M5 Max
AI Benchmark

20+ open-source models tested on real hardware. Every number measured, not estimated.

0tok/s
Peak Speed
0+
Models Tested
0GB/s
Bandwidth
$0
API Cost
M5 Max 40-Core GPU 128 GB Unified Memory MLX Framework 4-Bit Quantization

Independent benchmarks · No vendor sponsorship

NEW: Quality Evaluations — I'm now running ARC-Challenge, GSM8K, and IFEval benchmarks on every model, because fast doesn't mean smart. Check out the live leaderboard →

So I got my hands on the new M5 Max MacBook Pro — the 128 GB, 40-core GPU model — and I did what any reasonable person would do: I spent the entire weekend benchmarking AI models on it. I had a spreadsheet. I had coffee. I had a Python script that spawned isolated subprocesses. It was a whole thing.

Here's the thing: I went into this expecting the small models to be fast and the big models to be slow. That part was obvious. What I didn't expect was how fast the small ones would be (nearly 180 tokens per second, which is genuinely absurd), or that a 70B model would feel usable on a laptop, or that Ollama would just... break because of a Metal 4 shader bug. I also didn't expect a 30-billion-parameter MoE model to outrun most 8B models. But I'm getting ahead of myself.

This page has everything: the raw benchmark numbers, interactive tools so you can compare models side by side, a RAM calculator that tells you what fits on your Mac, a RAM tier guide with my actual picks, an honest look at MLX vs GGUF, a cost breakdown vs cloud APIs, and a FAQ that covers the questions I kept getting asked while writing this up.

The Hardware: What's Actually in This Machine

Apple M5 Max, announced March 3, 2026 — TSMC 3nm, with Neural Accelerators baked into every GPU core.

Before I get into the numbers, it helps to understand why the hardware matters so much. When you're generating tokens with a language model, the bottleneck isn't compute — it's memory bandwidth. Every single token the model generates requires reading the entire set of model weights from memory. So the speed formula is pretty simple: tok/s ≈ 614 GB/s ÷ model_size_GB. That's it. That formula predicted my real-world results within about 20–30%, which I found kind of wild. The gap comes from KV cache overhead, compute ops, and framework efficiency. But bandwidth is the main event.

M5 Max Chip Specifications

Spec M5 Max (32-core GPU) M5 Max (40-core GPU)
CPU Cores18 (6 Super + 12 Performance)18 (6 Super + 12 Performance)
GPU Cores3240
Neural Engine16-core16-core
GPU Neural AcceleratorsYes (new in M5)Yes (new in M5)
Memory Bandwidth460 GB/s614 GB/s
Max Unified Memory128 GB128 GB
ProcessTSMC 3nm (3rd gen)TSMC 3nm (3rd gen)
This tripped me up at first: Memory bandwidth is set by GPU core count, NOT RAM amount. A 64GB Mac with the 32-core GPU (460 GB/s) is ~25% slower than a 64GB Mac with the 40-core GPU (614 GB/s). Same RAM, very different speeds.

The other big deal is the unified memory architecture. On a regular PC with an NVIDIA GPU, you've got 24 GB of VRAM on an RTX 4090, or 32 GB on a 5090. If your model doesn't fit in VRAM, you're offloading layers to system RAM over PCIe, and that's dramatically slower. On the M5 Max, the GPU just... accesses all 128 GB directly at full bandwidth. There's no VRAM vs system RAM distinction. It's one big pool. That's what makes running a 70B model on a laptop even possible.

Memory Configurations for AI Workloads

Configuration Unified Memory Bandwidth Best For
M5 Max 32-core GPU36 GB460 GB/sSmall models up to ~14B dense
M5 Max 32-core GPU64 GB460 GB/sMid-range models up to ~70B Q4 (slower)
M5 Max 40-core GPU64 GB614 GB/sMid-range models up to ~70B Q4 at full speed
M5 Max 40-core GPU128 GB614 GB/sFrontier MoE, large dense, multi-model setups
All my benchmarks were run on the 40-core GPU, 128 GB configuration.

Compare Any Two Models

Pick two models and see how they stack up on speed, memory, and efficiency.

Head to Head

Speed (tok/s)

RAM Usage (GB)

Efficiency (tok/s per GB)

RAM Calculator: What Fits on Your Mac

Click a RAM tier to see which models you can actually run (I'm reserving 8 GB for macOS).

RAM Allocation

Compatible (0)

Too Large (0)

Best Pick at 128 GB

The Actual Benchmark Numbers

All models at 4-bit quantization on MLX, M5 Max 40-core GPU, 128GB. Averaged across 3-5 runs.

Alright, here's the part you probably scrolled down for. I tested every model at 4-bit quantization using MLX, ran each one 3-5 times in isolated subprocesses, and averaged the results. The table below has everything. A few things jumped out at me. First, Gemma 3 4B at 179 tok/s is generating text so fast that a full paragraph appears in under a second. Second, the Qwen 3 30B-A3B (that's a Mixture of Experts model) somehow hits 127 tok/s despite having 30 billion parameters. It only activates 3B of them per token, which gives it 8B-class speed with way more knowledge. That felt like cheating, honestly. And third, even the 70B models are usable — 12-13 tok/s is slower than reading speed, but it's not painful for longer tasks. Click column headers to sort.

Text Generation Models

Model ▲▼ Params ▲▼ Type ▲▼ tok/s ▲▼ Speed TTFT ▲▼ Memory ▲▼ Tier
Table 1: Text generation benchmarks on M5 Max 40-core GPU, 128 GB, MLX 4-bit quantization. Click column headers to sort.

Vision Language Models

ModelParamstok/sSpeedTTFTMemoryTier
Table 2: Vision Language Model benchmarks on M5 Max. See the VLM section for my thoughts.
The MoE surprise: Qwen 3 30B-A3B hits 127.4 tok/s despite needing 16.1 GB of memory. It only fires 3B of its 30B params per token, so you get 8B speed with 30B brains. I didn't believe the numbers at first and re-ran it twice. More on MoE in the FAQ.

The Speed Formula

Token generation is memory-bandwidth-bound:

tok/s ≈ 614 GB/s ÷ Model Size (GB)

This predicted my real results within 20-30%. The gap is KV cache overhead, compute, and framework efficiency.

Things That Surprised Me

  • TTFT under 200ms for everything under 15B params
  • Even 70B models respond within 730ms
  • Memory at Q4 follows ~0.5 GB per billion params
  • MoE models completely break the speed-vs-size curve

Vision Models: Feeding Images to Local AI

Models that take images + text as input. Local document analysis, screenshot parsing, OCR, and visual Q&A — all without uploading anything.

I was curious whether the vision models would be noticeably slower than their text-only counterparts. Turns out, not really. Gemma 3 4B handles images at the same 179 tok/s as text. Qwen3-VL 8B was right there at 111 tok/s. The privacy angle here is what sold me — I was testing these by feeding in screenshots of my own code, photos of handwritten notes, and a few receipts. None of that data left my machine. I also tried running Pixtral (Mistral's VLM) but kept hitting an auth error during weight download, so it didn't make the final list. Something to revisit later.

178.7
tok/s
Gemma 3 4B VLM
Fastest VLM — 2.4 GB
110.7
tok/s
Qwen3-VL 8B
Best value VLM — 4.4 GB
27.3
tok/s
Qwen3-VL 32B
Best quality VLM — 17.3 GB

For quick vision tasks — summarizing a screenshot, reading a chart, extracting text from a photo — Gemma 3 4B VLM at 179 tok/s is faster than any cloud API I've used, and it costs nothing per query. When you need the model to actually understand complex images (dense diagrams, multi-page documents), Qwen3-VL 32B at 27.3 tok/s is the strongest I tested, and it fits comfortably on a 64 GB Mac. Check the RAM guide to see which VLMs work at your tier.

128 GB Memory Map

Each cell is 1 GB of unified memory. Click a category to see what's using what.

128 GB Memory Pool

Each cell = 1 GB

Speed Tiers at a Glance

I grouped the models by how fast they actually feel in practice.

Efficiency Rankings: Speed per GB of RAM

Tokens per second divided by GB of memory used. Higher = more bang for your RAM buck.

Rank Model Type tok/s Memory tok/s per GB Quality Agentic Efficiency
Table 3: Models ranked by efficiency with quality and agentic scores.

Quality Evaluations

But can they actually think? I ran ARC-Challenge, GSM8K, and IFEval to find out.

Speed is great, but I also wanted to know if these models can actually reason. So I ran three standard benchmarks via lm-evaluation-harness: ARC-Challenge (grade-school science), GSM8K (math word problems with chain-of-thought), and IFEval (can the model follow precise formatting instructions). The composite score averages all three.

Loading quality evaluation data...

Agentic Benchmarks

I also tried making them do real terminal tasks. Most of them... couldn't.

Here's where things get interesting. I used terminal-bench to give each model 14 real tasks in fresh Docker containers: create files, fix git repos, install packages, write servers. The model has to figure everything out autonomously through shell commands. Green means it passed, red means it failed. Hover over the red cells to see why — most failures are "parse error" (the model couldn't even produce valid JSON) or "wrong commands" (it tried but got the shell commands wrong). For reference, Claude Sonnet 4.5 scores 50% on the full terminal-bench v1.0 suite.

Loading agentic benchmark data...

View detailed failure analysis in the Eval Dashboard →

My Actual Picks by RAM Tier

At Q4 quantization, model size is roughly params × 0.5 GB. I'd leave ~20% headroom for macOS and KV cache.

Look, picking the right model for your RAM is probably the most important thing you can do here. Running a model that barely fits means you've got no room for context windows, KV cache growth, or having Chrome open (and you'll have Chrome open). My picks below are based on the benchmarks I ran, with real headroom accounted for. Rule of thumb: don't use more than 80% of your memory for the model itself. Leave the rest for macOS, context, and whatever else you've got running.

32 GB

Capable

~22-26 GB usable for models. You'd be surprised how much you can do here.

  • Qwen 3 30B-A3B127.4 tok/s · 16.1 GB
  • Gemma 3 27B30.9 tok/s · 15.2 GB
  • Phi-4 14B62.0 tok/s · 7.8 GB
  • Gemma 3 4B178.7 tok/s · 2.4 GB
Everything up to 27B fits at Q4. Qwen 3 30B-A3B is the star here.

128 GB

Overkill (in a good way)

~100-110 GB usable. Frontier MoE, 70B at Q8, or just load three models at once and switch between them.

  • Qwen 3 30B-A3B127.4 tok/s · 16.1 GB
  • Llama 3.3 70B12.6 tok/s · 37.1 GB
  • Devstral 24B39.3 tok/s · 12.6 GB
  • DeepSeek R1 32B24.9 tok/s · 17.3 GB
Can fit Qwen 3 235B-A22B (~118 GB) at Q4 if you want to go all out.

MLX vs GGUF: Why I Went with MLX (and Why Ollama Broke)

Apple's MLX framework vs. the GGUF/llama.cpp ecosystem. Both work, but one works better on this hardware.

So I started this project using Ollama because, honestly, it's the easiest way to get a model running. Download, run, done. But on the M5 Max, Ollama kept throwing Metal 4 shader compilation errors. This is a known issue with Ollama 0.18.2 on M5 hardware — it uses GGUF/llama.cpp under the hood, and the Metal shaders haven't been updated for Metal 4 yet. I wasted a good two hours debugging that before switching to MLX. Once I did, everything just worked, and the performance was noticeably better anyway — roughly 20-30% faster on decode. MLX is purpose-built for Apple Silicon's unified memory and Metal 4, so that makes sense. The downside is the model ecosystem is smaller (though the mlx-community on HuggingFace has thousands of models now) and you don't get the nice GUI that Ollama or LM Studio provide.

MLX (Apple)

  • Built for Apple Silicon's unified memory & Metal 4 GPU
  • Native Python library (mlx-lm) with fine-tuning support
  • ~20-30% faster decode on Apple Silicon vs llama.cpp
  • Faster prompt processing (prefill) via deep memory integration
  • Recommended on M5 Max hardware
  • Thousands of models on HuggingFace (mlx-community)

GGUF (llama.cpp)

  • Cross-platform: CPU, CUDA, Metal, Vulkan
  • Tens of thousands of models on HuggingFace
  • Broader quantization options (IQ quants, mixed quantization)
  • Powers Ollama and LM Studio (GUI tools)
  • Note: Ollama 0.18.2 has Metal 4 shader bugs on M5 Max
  • Best for cross-platform portability needs
My take: Use MLX if you're on M5 Max. It's just faster, and it actually works without shader bugs right now. Use GGUF if you need cross-platform support or a wider model selection. Both are improving fast. The benchmark tables above are all MLX numbers.

How Do These Compare to Cloud APIs?

API pricing as of March 2026. Cloud is still better at the very top of the quality curve, but local is closer than I expected.

I think it's only fair to compare local models against the cloud options. The top cloud models (Claude Opus, Gemini 3.1 Pro, GPT-5.2) still outperform anything you can run locally on the hardest reasoning benchmarks. That's just where things are in 2026. But for the stuff I actually use AI for day-to-day — coding help, writing drafts, summarizing docs, data extraction — a locally-run 30B or 70B model handles it well. The real wins for local are privacy (nothing leaves my machine), latency (sub-200ms TTFT vs 500ms-2s for APIs), and zero marginal cost once you've bought the hardware. The cost analysis below breaks down when local starts saving you money.

Model Provider Input $/M tok Output $/M tok Context Arena Elo Vision
Claude Opus 4.6Anthropic$5.00$25.001M~1505Yes
Gemini 3.1 ProGoogle$2.00$12.001M~1503Yes
GPT-5.2OpenAI$1.75$14.00400K~1490Yes
Claude Sonnet 4.6Anthropic$3.00$15.00200K~1480Yes
GPT-5.2 ProOpenAI$21.00$168.00400K~1510Yes
Gemini 2.5 FlashGoogleFreeFree1M~1450Yes
DeepSeek V3.2 APIDeepSeek$0.14$0.28164K~1421No
DeepSeek R1 APIDeepSeek$0.55$2.19164K~1430No
GPT-4oOpenAI$2.50$10.00128K~1460Yes
Claude Haiku 4.5Anthropic$1.00$5.00200K~1420Yes
Table 4: Cloud API pricing and quality ratings as of March 2026. Prices change constantly.

Where Local Wins

  • Privacy: sensitive data never leaves your machine
  • Latency: under 200ms TTFT vs 500ms-2s for APIs
  • Cost at scale: zero marginal cost after hardware
  • Offline access: works anywhere, no internet needed
  • No rate limits: generate as fast as hardware allows

Where Cloud Wins

  • Highest absolute quality (Elo 1490-1510)
  • No upfront hardware investment
  • Always the latest frontier models
  • 1M+ token context windows
  • Low usage: cheaper than buying hardware

When Does Local Actually Save You Money?

M5 Max 128GB at ~$4,999 vs. roughly $9/M tokens (blended Sonnet-tier pricing).

This is the question everyone asks, and the honest answer is: it depends on how much you use it. I ran the numbers assuming a $4,999 MacBook Pro and blended API costs of about $9 per million tokens (that's roughly what Sonnet-tier models cost when you average input and output). If you're doing 10K tokens a day — maybe a handful of short conversations — local doesn't make financial sense. You'd be waiting 154 years to break even. But if you're like me and you're burning through 100K+ tokens a day on coding tasks and writing, the math starts working in your favor within about 15 months. At 500K tokens/day, it's 3 months. And after that, every token is free. Electricity is maybe $5-10/month even if you're running it hard.

Daily Token Usage Monthly Cloud Cost Break-Even Verdict
10K tokens/day~$2.70/mo154 yearsCloud wins
100K tokens/day~$27/mo15 monthsToss-up
500K tokens/day~$135/mo3 monthsLocal wins
1M tokens/day~$270/mo1.5 monthsLocal wins
5M tokens/day~$1,350/mo~11 daysLocal wins
Table 5: Break-even assuming M5 Max 128 GB at $4,999 and blended API pricing of ~$9/M tokens.
The short version: If you're using AI heavily for work (100K+ tokens/day, which is 50-100 solid interactions), local pays for itself within about 15 months. After that, it's all free. Electricity is negligible.

How I Tested All of This

Hardware, software, prompts, and what I did to keep the numbers honest.

I wanted these numbers to be reproducible, so I was pretty careful about the setup. Everything ran on a single MacBook Pro 16-inch, M5 Max with the 40-core GPU and 128 GB of unified memory, running macOS 16.x. I used MLX 0.31.1 with mlx-lm 0.31.1, and every model was tested at 4-bit quantization. No cherry-picking quantization levels to make certain models look better.

Each model got four standardized prompts: a simple Q&A, a reasoning task, a coding task, and a structured output task. I ran 3-5 passes per model and averaged the results. The important part: every benchmark run happened in an isolated subprocess. That means clean memory measurement, no contamination from previous runs, and no warm-cache advantages. I measured tok/s (generation speed), time to first token (TTFT), and peak memory (RSS). The laptop was on its standard cooling — no external fans or cooling pads. Quality numbers (Elo ratings, MMLU-Pro, HumanEval) come from public leaderboards, not my own testing. And for the record: I bought this laptop at retail. No review unit, no vendor sponsorship, no early access.

Test Setup

  • Hardware: MacBook Pro 16" M5 Max, 40-core GPU, 128 GB unified memory
  • OS: macOS 16.x (Darwin 25.3.0)
  • Framework: MLX 0.31.1, mlx-lm 0.31.1
  • Quantization: 4-bit (Q4) for all models
  • Test prompts: 4 standardized per model (Q&A, reasoning, coding, structured output)
  • Runs: 3-5 passes per model, averaged
  • Isolation: Each run in a separate subprocess for clean memory measurement

Questions I Keep Getting Asked

Stuff people asked me after reading early drafts of this post.

I shared a draft of this with a few friends and got a lot of the same questions back. I've collected them here. Most of the answers reference actual numbers from my testing, and I've tried to be specific rather than hand-wavy. If you don't see your question, the RAM Tier Guide and MLX vs GGUF section might cover it.

Gemma 3 4B is the fastest model I tested, generating 178.7 tokens per second on the M5 Max 40-core GPU using MLX at 4-bit quantization. It uses only 2.4 GB of memory and supports both text and vision input. For models with more knowledge, Qwen 3 30B-A3B (MoE) achieves 127.4 tok/s while packing 30 billion parameters of intelligence into 16.1 GB of memory. See the full benchmark table for all results.
It follows the formula tok/s ≈ 614 GB/s ÷ model_size_GB. In practice: a 4B model generates ~179 tok/s, 8B ~105-113 tok/s, 14B ~55-62 tok/s, 27B ~31 tok/s, and 70B ~12-13 tok/s. MoE models like Qwen 3 30B-A3B break this curve with 127 tok/s despite using 16GB of memory because only 3B parameters are active per token.
Yes, but you need at least 64 GB of unified memory. At 4-bit quantization, a 70B dense model like Llama 3.3 70B uses 37.1 GB, leaving enough headroom on a 64 GB Mac for the OS and context window. On the M5 Max 40-core GPU, it generates at 12.6 tok/s with a 724ms time to first token. I fully expected the 70B model to choke, but it just... worked. For 128 GB configurations, you can run 70B at Q8 for higher quality or keep multiple models loaded at once. See the RAM Tier Guide for configuration-specific recommendations.
On M5 Max, MLX is the better choice. It's purpose-built for Apple Silicon's unified memory and Metal 4 GPU, providing ~20-30% better decode performance than llama.cpp. Ollama (GGUF-based) also has Metal 4 shader compilation issues on M5 Max as of March 2026 — I hit this myself and wasted a couple hours on it. MLX tends to have faster prompt processing as well. See the full MLX vs GGUF comparison for details.
At 4-bit quantization, model size in GB is roughly 0.5 times the number of billion parameters. Leave about 20% headroom for the OS and KV cache. 32 GB runs excellent 12B-27B models (31-69 tok/s). 64 GB runs frontier 70B dense models comfortably. 128 GB is for 70B at Q8, multi-model setups, or frontier MoE like Qwen 3 235B. The RAM Tier Guide has specific recommendations for each configuration.
It depends on usage. At 100K tokens/day (~50-100 interactions), a MacBook Pro M5 Max 128 GB ($4,999) breaks even with cloud APIs within 15 months. At 500K tokens/day, break-even is 3 months. At 1M tokens/day, just 1.5 months. After break-even, every token is free. Electricity adds only ~$5-10/month. See the full cost break-even analysis.
MoE is a model architecture where only a fraction of parameters are activated per token. Qwen 3 30B-A3B has 30 billion total parameters but activates only 3 billion per token, giving it the speed of a 3B model (127.4 tok/s) with the quality of a 30B model. The tradeoff: MoE models still need memory for all parameters (16.1 GB). On Apple Silicon with ample unified memory, MoE offers the best quality-to-speed ratio available — making it ideal for all RAM tiers.
For coding tasks, Devstral Small 24B (39.3 tok/s, 12.6 GB) is purpose-built for code generation and fits on any Mac with 32 GB+. Qwen 3 30B-A3B (127.4 tok/s) is excellent for coding while also being fast enough for real-time use. For smaller setups, Qwen 3 8B (105.4 tok/s, 4.4 GB) offers solid coding at ultra-fast speed. For maximum quality with 64 GB+, Llama 3.3 70B provides the strongest overall performance.
Yes. Once downloaded, models run entirely on local hardware with no internet needed. This is a key advantage for travel, air-gapped environments, restricted networks, and privacy-sensitive work. Models are stored on SSD and inference uses only your Mac's CPU, GPU, and unified memory.
Several VLMs run well via MLX: Gemma 3 4B VLM (179 tok/s, 2.4 GB) is the fastest. Qwen3-VL 8B (111 tok/s, 4.4 GB) is the best value. Gemma 3 27B VLM (32 tok/s, 15.2 GB) and Qwen3-VL 32B (27 tok/s, 17.3 GB) provide the highest quality image understanding. All run entirely locally. See the VLM section for more.
Yes. Apple Silicon uses a unified memory architecture where the CPU and GPU share the same memory pool. You don't need to copy model weights between system RAM and VRAM like you would with a discrete GPU. You can run AI inference alongside your browser, IDE, or creative apps. However, the model's memory footprint reduces what's available for other applications, so choose a model size that leaves enough headroom. For example, running a 17 GB model on a 64 GB Mac still leaves roughly 39 GB for macOS and your other apps.
Average human reading speed is roughly 4–5 tokens per second. So any model generating above 5 tok/s is producing text faster than you can read it. At 25–30 tok/s, responses appear nearly instantaneous for short answers. At 100+ tok/s, even long multi-paragraph responses complete in just a few seconds. For coding assistants, higher speeds mean faster completions and less waiting between edits. In practice, anything above 30 tok/s feels real-time for interactive chat.
Yes, as long as the combined memory footprint fits in your available RAM. For example, on a 128 GB Mac with approximately 8 GB reserved for the OS, you could simultaneously load Qwen 3 30B-A3B (16.1 GB) + Devstral 24B (12.6 GB) + Gemma 3 4B (2.4 GB) for a total of 31.1 GB, leaving over 88 GB free. On a 64 GB Mac, you could run a 7–8B model (4.4 GB) alongside a 14B model (7.8 GB) for about 12.2 GB total. Only one model generates tokens at a time using the GPU, but having multiple loaded avoids reload latency when switching between them.
32 GB: Qwen 3 30B-A3B (127.4 tok/s, 16.1 GB) is the best overall pick, offering 30B-class quality at blazing speed. 64 GB: Llama 3.3 70B (12.6 tok/s, 37.1 GB) provides the strongest quality, while Qwen 3 30B-A3B remains the best speed-to-quality ratio. 128 GB: You can run everything, but Qwen 3 30B-A3B is still the daily driver for speed, Llama 3.3 70B for quality, and Devstral 24B (39.3 tok/s, 12.6 GB) for coding. The sweet spot for most users is the 64 GB configuration with the 40-core GPU.

What I'd Actually Recommend

So after a weekend of testing, here's where I landed. The M5 Max is genuinely good at this. Small models (4B-8B) generate text faster than you can read it — 105 to 179 tok/s. Mid-range models (12B-27B) feel comfortable for conversation at 31-69 tok/s. And the 70B models, which I honestly expected to struggle, run at 12-13 tok/s — not blazing, but totally usable for longer tasks. The standout is the Qwen 3 30B-A3B with its MoE architecture: 127 tok/s, 30B params worth of smarts, 16 GB of memory. I keep going back to it.

If you're buying a MacBook Pro for AI work, get the 64 GB configuration with the 40-core GPU. It runs 70B dense models and every MoE model, at the full 614 GB/s bandwidth. If you can swing 128 GB, you get frontier MoE models, higher quantization, and multi-model workflows. For software, MLX is the way to go on M5 Max right now — it's faster and doesn't have the shader bugs that are plaguing Ollama. Start with Qwen 3 30B-A3B for everyday use, Devstral 24B for code, and Gemma 3 4B VLM for fast vision tasks. The open-source model ecosystem is moving fast, and having this hardware means you're ready for whatever drops next without needing to pay per token.