Qwen 3.5: Open Weights Closing the Gap to Proprietary Models
How the 27B and 35B-A3B variants are delivering frontier performance on consumer hardware
It’s been a while since my last post about open-weight models — and honestly, the pace of improvement has been wild. Every few months, something new drops that makes you question whether proprietary models are still worth the hype.
Case in point: Devstral Small 2 dropped on Dec 22, 2025 with a solid 68% SWE-Bench score. Impressive for a ~24B model running on consumer hardware. But then just 57 days later, on Feb 17, 2026, Alibaba released Qwen 3.5 — and the open-weight game changed again.
The star players for local deployment? The 35B-A3B (MoE architecture) and dense 27B — both released Feb 17, 2026. Both deliver performance that now beats proprietary models from as recent as Feb 2026, and yes — you can run them on your laptop right now. No API bills, no rate limits.
Let’s dive in.
What Makes Qwen 3.5 Special?
Qwen 3.5 isn’t just another model release. It’s a hybrid reasoning family supporting 256K context across 201 languages, with both thinking and non-thinking modes. For local deployment, the 27B and 35B-A3B variants are particularly interesting:
| Model | Architecture | Parameters Activated | Full Precision Size | Consumer Hardware? |
|---|---|---|---|---|
| Qwen3.5-35B-A3B | MoE (Mixture of Experts) | ~3B per forward pass | ~72GB F16 | ✅ 24GB RAM/VRAM |
| Qwen3.5-27B | Dense | Full 27B | ~54GB F16 | ✅ 18-24GB RAM/VRAM |
💡 Quick guide: Pick 27B if you want slightly more accurate results and can fit it in your hardware. Go for 35B-A3B if you prioritize speed — the MoE architecture means only ~3B parameters activate per token, making inference much faster despite having 35B total params.
The Benchmark Evidence
Here’s where it gets interesting. Let’s look at the actual numbers that show Qwen 3.5 is closing in on proprietary models:
SWE-Bench Verified (Software Engineering - solving real GitHub issues):
| Model | Score | Release Date | Type |
|---|---|---|---|
| Claude Opus 4.5 | 80.9% | Nov 24, 2025 | Proprietary (current leader) |
| Claude Opus 4.6 | 80.8% | Feb 17, 2026 | Proprietary |
| Gemini 3.1 Pro | 80.6% | Feb 2026 | Proprietary |
| GPT-5.2 | 80.0% | Dec 11, 2025 | Proprietary |
| Claude Sonnet 4.6 | 79.6% | Feb 17, 2026 | Proprietary |
| Qwen 3.5-27B | 72.4% | Feb 17, 2026 | ✅ Open-weight (runs locally) |
| Devstral Small 2 (my previous post) | 68.0% | Dec 22, 2025 | ✅ Open-weight (~24B, runs on consumer hardware) |
| Claude 3.7 Sonnet | 70.3% | Feb 24, 2025 | Proprietary (older gen) |
| GPT-4o | ~65% | May 2024 | Proprietary (older gen) |
What this means: Qwen 3.5-27B (open-weight, free to run locally) beats models that are years older — Claude 3.7 Sonnet (Feb 2025), GPT-4o (May 2024) — and nearly matches Devstral Small 2 from just a month prior. That’s a model you can download and run on your own hardware outperforming proprietary models that cost $15-75 per million tokens.
💡 Context: Remember Devstral Small 2 from my last post? Released Dec 22, 2025, it scored 68% on SWE-Bench and was already impressive for a ~24B model. Now Qwen 3.5-27B (released Feb 17, 2026 — just 57 days later) is at 72.4%. That’s a 4-point jump in under two months, closing the gap to top proprietary models (Opus 4.6 at 80.8%) by less than 10 points.
For those with modest hardware: Qwen 3.5 also has smaller variants from 0.8B to 9B released on March 2026 that you can try it out
Other Benchmarks Where Qwen 3.5-27B Shines:
| Benchmark | Qwen 3.5-27B | GPT-4o | Claude Sonnet 4.5 |
|---|---|---|---|
| MMLU-Pro (reasoning) | 86.1% | 74.7% | 80.8% |
| GPQA Diamond (science) | 85.5% | 70.1% | 78.9% |
| IFEval (instruction following) | 95.0% | 81.0% | 87.2% |
The pattern is clear: on reasoning-heavy tasks, Qwen 3.5-27B isn’t just competitive — it’s often beating proprietary models that are 10x larger and cost thousands per month to run via API.
The real win? You’re getting this performance locally with quantized versions that fit in ~18GB of RAM/VRAM. That’s a MacBook Pro with 36GB unified memory or a single RTX 4090 territory — no API bills, no rate limits.
Quick Demo: Running Qwen 3.5 Locally
Let’s get this running on your machine. I’ll use llama.cpp since it’s the most reliable backend for GGUF models right now (Ollama doesn’t support Qwen 3.5 yet due to separate vision projection files).
Prerequisites
Before we dive in:
- Git: For cloning llama.cpp
- CMake + build tools: For compiling from source
- 18-24GB RAM/VRAM: Depending on which model you choose
- Optional GPU: CUDA for NVIDIA, Metal works on macOS out of the box
Step 1: Build llama.cpp
First, grab the latest version and compile it:
apt-get update && apt-get install build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# For NVIDIA GPU (CUDA)
cmake -B build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
# For CPU only or macOS (Metal auto-detects)
# cmake -B build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=OFF
cmake --build build --config Release -j --target llama-cli llama-server
cp build/bin/llama-* .
💡 First compile takes 5-10 minutes. This is normal — the CUDA/Metal backends need to build binaries. Don’t cancel it.
Step 2: Download the Model
Unsloth provides GGUFs with their Dynamic Quantization (UD-Q4_K_XL recommended for best quality-speed balance):
# Install download tools first
pip install huggingface_hub hf_transfer
# For Qwen3.5-35B-A3B (~18GB quantized)
hf download unsloth/Qwen3.5-35B-A3B-GGUF \
--include "*UD-Q4_K_XL*" \
--include "*mmproj-F16*"
# For Qwen3.5-27B (~16GB quantized)
hf download unsloth/Qwen3.5-27B-GGUF \
--include "*UD-Q4_K_XL*" \
--include "*mmproj-F16*"
The mmproj file is for vision tasks — you’ll need it if you want to use multimodal features.
Step 3: Run the Model
Now let’s spin it up with llama-server. This gives you an OpenAI-compatible API endpoint at http://localhost:8080:
For Qwen3.5-35B-A3B (Thinking mode for general tasks):
export LLAMA_CACHE="unsloth/Qwen3.5-35B-A3B-GGUF"
./llama-server \
-hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL \
--ctx-size 16384 \
--temp 1.0 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.00 \
--alias "Qwen3.5-35B-A3B" \
--port 8080
For Qwen3.5-27B (Non-thinking mode):
export LLAMA_CACHE="unsloth/Qwen3.5-27B-GGUF"
./llama-server \
-hf unsloth/Qwen3.5-27B-GGUF:UD-Q4_K_XL \
--ctx-size 16384 \
--temp 0.7 \
--top-p 0.8 \
--top-k 20 \
--min-p 0.00 \
--chat-template-kwargs '{"enable_thinking":false}' \
--alias "Qwen3.5-27B" \
--port 8080
💡 Pro tip: Use
--chat-template-kwargs '{"enable_thinking":false}'to disable reasoning mode for faster responses on simple tasks. For the Small series (0.8B, 2B, 4B, 9B), thinking is disabled by default — you need to explicitly enable it with"enable_thinking":true.
Test It Out
Once the server starts, hit the API:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.5-35B-A3B",
"messages": [
{"role": "user", "content": "Explain quantum computing in simple terms"}
],
"temperature": 0.7,
"max_tokens": 256
}'
You should see a JSON response with the model’s answer. Pretty neat for something running entirely on your machine!
Performance Expectations
What kind of speeds can you expect? Here’s roughly what I’ve seen:
| Hardware | Model | Tokens/sec (4-bit) | Notes |
|---|---|---|---|
| RTX 4090 (24GB) | 35B-A3B | ~40-50 tok/s | Full GPU offload |
| M3 Max (64GB) | 35B-A3B | ~30-40 tok/s | Metal acceleration |
| RTX 4090 | 27B | ~50-60 tok/s | Faster due to dense architecture |
| CPU only (16 cores) | 27B-Q4 | ~8-12 tok/s | Still usable for chat |
The MoE variant (35B-A3B) trades some accuracy for speed — since it only activates ~3B params per token, it’s noticeably faster than the dense 27B on GPU. On CPU, the difference shrinks because memory bandwidth becomes the bottleneck.
Model Use Cases
Here’s where Qwen 3.5 shines in real-world scenarios:
Agentic Coding Using harness (Claude Code, OpenCode, Kilo, Aider, etc.):
Qwen 3.5-27B works great as a main or sub model for your daily SWE task, It’s been said to be more than capable.
General Agentic Workflows (OpenClaw, AutoGen):
The 35B-A3B variant excels at multi-step reasoning tasks:
- OpenClaw: Build autonomous agents that can browse, plan, and execute complex workflows
- AutoGen: Create conversational agents that collaborate on tasks
- Custom tool use: The model’s strong instruction following (95% IFEval) makes it reliable for function calling
Why local matters for agents: When building agentic systems, you’re making hundreds or thousands of API calls. Running locally means:
- No $20-50/month API bills for experimentation
- Zero latency when iterating on agent logic
- Full privacy — your code and data never leave your machine
💡 Pro tip: For coding agents, use the 27B variant (better performence). For general-purpose agentic workflows requiring speed, go with 35B-A3B’s MoE architecture.
Limitations on Consumer Hardware
The model supports 256K context officially, but here’s what actually works on consumer GPUs:
| RAM/VRAM | Max Practical Context | Model Fit | Notes |
|---|---|---|---|
| 24GB | ~16K-32K | 35B-A3B Q4-K_M | Full GPU offload, fast inference |
| 24GB | ~8K-16K | 27B Q4_K_XL | Higher quality quantization fits |
| 32GB+ (MacBook Pro) | ~32K-64K | Either model | Unified memory helps, slower than discrete GPU |
| 16GB | ~8K max | Small series only (9B) | 27B/35B won’t fit comfortably |
Reality check on context windows:
- The advertised 256K context requires ~80-100GB VRAM at full precision — not feasible for consumer hardware
- At Q4 quantization with 24GB VRAM, you’re realistically looking at 16K-64K context window before running out of memory
- Each additional 8K context adds ~2-4GB of RAM usage depending on model size
What this means practically:
- ✅ Codebases: You can load most single-repo codebases (typical dev workflow)
- ✅ Long documents: Technical papers, books, or lengthy conversations work fine
Quantization tradeoffs:
- Q4_K_XL: Best quality, ~2-3GB larger than Q4_K_M
- Q4_K_M: Sweet spot for most users, minimal quality loss
- Q5/Q6: Only use if you have 32GB+ and need every bit of accuracy
💡 Rule of thumb: Start with
--ctx-size 16384(16K). If you OOM (out of memory), drop to 8192. Don’t go higher unless you have 32GB+ RAM/VRAM available.
Final Thoughts
We’re at an inflection point where open-weight models like Qwen 3.5 are delivering performance that used to require massive proprietary APIs. The 27B and 35B-A3B variants prove you don’t need 100B+ parameters when the architecture is well-designed.
The fact that these run on a single consumer GPU (or even a high-RAM MacBook) changes the game for local AI development. No more paying per token or worrying about API rate limits.
Key takeaways:
- Qwen 3.5 brings frontier performance to consumer hardware
- The MoE variant (35B-A3B) offers speed, while 27B gives you dense accuracy
- llama.cpp is your best bet for running GGUFs locally right now
- Both models support 256K context and multimodal tasks
What’s Next?
Once you’ve got this running, explore:
- Model Harness Build something with Qwen 3.5 either 35B MoE or dense 27B variant
- Model performence Compare performence with other open weight model.
- Small series Explore the smaller variant model (0.8B to 9B) if you need ultra-fast inference on mobile devices
The open model ecosystem is moving fast — and honestly, it’s been fun watching the gap close. What will drop next month? I’m here for it. 🚀
Have questions or hit a snag running Qwen 3.5 locally? Drop me a note — happy to help debug.