Qwen 3.5: Open Weights Closing the Gap to Proprietary Models
How the 27B and 35B-A3B variants are delivering frontier performance on consumer hardware

The open-weight LLM scene has been moving fast lately β but most of the noise is just bigger parameter counts chasing diminishing returns. What’s actually interesting right now isn’t about how massive a model can get, but how much capability we’re packing into something that runs on consumer hardware.
Enter Qwen 3.5, which Alibaba released in February with two variants designed for exactly this moment: the 27B dense model and 35B-A3B MoE. These aren’t trying to be GPT-5 replacements. They’re asking a different question entirely β what if you could run frontier-level reasoning locally without needing an API key or worrying about token costs?
I’ve spent the last week running both variants through their paces, from coding tasks to building AI agents for daily use. The results are worth digging into, not because they’re perfect, but because they represent something we haven’t really had before: genuinely useful models that you can actually run yourself.
What Makes Qwen 3.5 Special?
Qwen 3.5 isn’t just another model release. It’s a hybrid reasoning family supporting 256K context across 201 languages, with both thinking and non-thinking modes. For local deployment, the 27B and 35B-A3B variants are particularly interesting:
| Model | Architecture | Parameters Activated | Full Precision Size | Consumer Hardware? |
|---|---|---|---|---|
| Qwen3.5-35B-A3B | MoE (Mixture of Experts) | ~3B per forward pass | ~72GB F16 | β 24GB RAM/VRAM |
| Qwen3.5-27B | Dense | Full 27B | ~54GB F16 | β 18-24GB RAM/VRAM |
π‘ Quick guide: Pick 27B if you want slightly more accurate results and can fit it in your hardware. Go for 35B-A3B if you prioritize speed β the MoE architecture means only ~3B parameters activate per token, making inference much faster despite having 35B total params.
The Benchmark Evidence
Here’s where it gets interesting. Let’s look at the actual numbers that show Qwen 3.5 is closing in on proprietary models:
SWE-Bench Verified (Software Engineering - solving real GitHub issues):
| Model | Score | Release Date | Type |
|---|---|---|---|
| Claude Opus 4.5 | 80.9% | Nov 24, 2025 | Proprietary (current leader) |
| Claude Opus 4.6 | 80.8% | Feb 17, 2026 | Proprietary |
| Gemini 3.1 Pro | 80.6% | Feb 2026 | Proprietary |
| GPT-5.2 | 80.0% | Dec 11, 2025 | Proprietary |
| Claude Sonnet 4.6 | 79.6% | Feb 17, 2026 | Proprietary |
| Qwen 3.5-27B | 72.4% | Feb 17, 2026 | β Open-weight (runs locally) |
| Devstral Small 2 (my previous post) | 68.0% | Dec 22, 2025 | β Open-weight (~24B, runs on consumer hardware) |
| Claude 3.7 Sonnet | 70.3% | Feb 24, 2025 | Proprietary (older gen) |
| GPT-4o | ~65% | May 2024 | Proprietary (older gen) |
What this means: Qwen 3.5-27B (open-weight, free to run locally) beats models that are years older β Claude 3.7 Sonnet (Feb 2025), GPT-4o (May 2024) β and outperforms Devstral Small 2 from just a month prior.
π‘ Context: Remember Devstral Small 2 from my last post? Released Dec 22, 2025, it scored 68% on SWE-Bench and was already impressive for a ~24B model. Now Qwen 3.5-27B (released Feb 17, 2026 β just 57 days later) is at 72.4%. That’s a 4-point jump in under two months, closing the gap to top proprietary models (Opus 4.6 at 80.8%) by less than 10 points.
For those with modest hardware: Qwen 3.5 also has smaller variants from 0.8B to 9B released on March 2026 that you can try it out
Other Benchmarks Where Qwen 3.5-27B Shines:
| Benchmark | Qwen 3.5-27B | GPT-4o | Claude Sonnet 4.5 |
|---|---|---|---|
| MMLU-Pro (reasoning) | 86.1% | 74.7% | 80.8% |
| GPQA Diamond (science) | 85.5% | 70.1% | 78.9% |
| IFEval (instruction following) | 95.0% | 81.0% | 87.2% |
The pattern is clear: on reasoning-heavy tasks, Qwen 3.5-27B isn’t just competitive β it’s often beating proprietary models that are 10x larger and cost thousands per month to run via API.
But here’s the thing about benchmarks: they’re useful, but they don’t tell the whole story. A model can ace MMLU-Pro and still stumble on your specific use case. The only way to know if Qwen 3.5 works for you? Run it locally and test it yourself. I’ll show you how in just a few commands below β it’s genuinely easy now.
The real win? You’re getting this performance locally with quantized versions that fit in ~18GB of RAM/VRAM. That’s a MacBook Pro with 36GB unified memory or a single RTX 4090 territory β no API bills, no rate limits.
Quick Demo: Running Qwen 3.5 Locally
Let’s get this running on your machine. I’ll use llama.cpp since it’s the most reliable backend for GGUF models right now (Ollama doesn’t support Qwen 3.5 yet due to separate vision projection files).
Prerequisites
Before we dive in:
- Git: For cloning llama.cpp
- CMake + build tools: For compiling from source
- 18-24GB RAM/VRAM: Depending on which model you choose
- Optional GPU: CUDA for NVIDIA, Metal works on macOS out of the box
Step 1: Build llama.cpp
First, grab the latest version and compile it:
apt-get update && apt-get install build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# For NVIDIA GPU (CUDA)
cmake -B build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
# For CPU only or macOS (Metal auto-detects)
# cmake -B build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=OFF
cmake --build build --config Release -j --target llama-cli llama-server
cp build/bin/llama-* .
π‘ First compile takes 5-10 minutes. This is normal β the CUDA/Metal backends need to build binaries. Don’t cancel it.
Step 2: Download the Model
Unsloth provides GGUFs with their Dynamic Quantization (UD-Q4_K_XL recommended for best quality-speed balance):
# Install download tools first
pip install huggingface_hub hf_transfer
# For Qwen3.5-35B-A3B (~18GB quantized)
hf download unsloth/Qwen3.5-35B-A3B-GGUF \
--include "*UD-Q4_K_XL*" \
--include "*mmproj-F16*"
# For Qwen3.5-27B (~16GB quantized)
hf download unsloth/Qwen3.5-27B-GGUF \
--include "*UD-Q4_K_XL*" \
--include "*mmproj-F16*"
The mmproj file is for vision tasks β you’ll need it if you want to use multimodal features.
Step 3: Run the Model
Now let’s spin it up with llama-server. This gives you an OpenAI-compatible API endpoint at http://localhost:8080:
For Qwen3.5-35B-A3B (Thinking mode for general tasks):
export LLAMA_CACHE="unsloth/Qwen3.5-35B-A3B-GGUF"
./llama-server \
-hf unsloth/Qwen3.5-35B-A3B-GGUF:UD-Q4_K_XL \
--ctx-size 16384 \
--temp 1.0 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.00 \
--alias "Qwen3.5-35B-A3B" \
--port 8080
For Qwen3.5-27B (Non-thinking mode):
export LLAMA_CACHE="unsloth/Qwen3.5-27B-GGUF"
./llama-server \
-hf unsloth/Qwen3.5-27B-GGUF:UD-Q4_K_XL \
--ctx-size 16384 \
--temp 0.7 \
--top-p 0.8 \
--top-k 20 \
--min-p 0.00 \
--chat-template-kwargs '{"enable_thinking":false}' \
--alias "Qwen3.5-27B" \
--port 8080
π‘ Pro tip: Use
--chat-template-kwargs '{"enable_thinking":false}'to disable reasoning mode for faster responses on simple tasks. For the Small series (0.8B, 2B, 4B, 9B), thinking is disabled by default β you need to explicitly enable it with"enable_thinking":true.
Test It Out
Once the server starts, hit the API:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen3.5-35B-A3B",
"messages": [
{"role": "user", "content": "Explain quantum computing in simple terms"}
],
"temperature": 0.7,
"max_tokens": 256
}'
You should see a JSON response with the model’s answer. Pretty neat for something running entirely on your machine!
Performance Expectations
What kind of speeds can you expect? Here’s roughly what I’ve seen:
| Hardware | Model | Tokens/sec (4-bit) | Notes |
|---|---|---|---|
| RTX 4090 (24GB) | 35B-A3B | ~40-50 tok/s | Full GPU offload |
| M3 Max (64GB) | 35B-A3B | ~30-40 tok/s | Metal acceleration |
| RTX 4090 | 27B | ~50-60 tok/s | Faster due to dense architecture |
| CPU only (16 cores) | 27B-Q4 | ~8-12 tok/s | Still usable for chat |
The MoE variant (35B-A3B) trades some accuracy for speed β since it only activates ~3B params per token, it’s noticeably faster than the dense 27B on GPU. On CPU, the difference shrinks because memory bandwidth becomes the bottleneck.
Model Use Cases
Here’s where Qwen 3.5 shines in real-world scenarios:
Agentic Coding Using harness (Claude Code, OpenCode, Kilo, Aider, etc.):
Qwen 3.5-27B works great as a main or sub model for your daily SWE task, Itβs been said to be more than capable.
General Agentic Workflows (OpenClaw, AutoGen):
The 35B-A3B variant excels at multi-step reasoning tasks:
- OpenClaw: Build autonomous agents that can browse, plan, and execute complex workflows
- AutoGen: Create conversational agents that collaborate on tasks
- Custom tool use: The model’s strong instruction following (95% IFEval) makes it reliable for function calling
Why local matters for agents: When building agentic systems, you’re making hundreds or thousands of API calls. Running locally means:
- No $20-50/month API bills for experimentation
- Zero latency when iterating on agent logic
- Full privacy β your code and data never leave your machine
π‘ Pro tip: For coding agents, use the 27B variant (better performence). For general-purpose agentic workflows requiring speed, go with 35B-A3B’s MoE architecture.
Limitations on Consumer Hardware
The model supports 256K context officially, but here’s what actually works on consumer GPUs:
| RAM/VRAM | Max Practical Context | Model Fit | Notes |
|---|---|---|---|
| 24GB | ~16K-32K | 35B-A3B Q4-K_M | Full GPU offload, fast inference |
| 24GB | ~8K-16K | 27B Q4_K_XL | Higher quality quantization fits |
| 32GB+ (MacBook Pro) | ~32K-64K | Either model | Unified memory helps, slower than discrete GPU |
| 16GB | ~8K max | Small series only (9B) | 27B/35B won’t fit comfortably |
Reality check on context windows:
- The advertised 256K context requires ~80-100GB VRAM at full precision β not feasible for consumer hardware
- At Q4 quantization with 24GB VRAM, you’re realistically looking at 16K-64K context window before running out of memory
- Each additional 8K context adds ~2-4GB of RAM usage depending on model size
What this means practically:
- β Codebases: You can load most single-repo codebases (typical dev workflow)
- β Long documents: Technical papers, books, or lengthy conversations work fine
Quantization tradeoffs:
- Q4_K_XL: Best quality, ~2-3GB larger than Q4_K_M
- Q4_K_M: Sweet spot for most users, minimal quality loss
- Q5/Q6: Only use if you have 32GB+ and need every bit of accuracy
π‘ Rule of thumb: Start with
--ctx-size 16384(16K). If you OOM (out of memory), drop to 8192. Don’t go higher unless you have 32GB+ RAM/VRAM available.
Final Thoughts
We’re at an inflection point where open-weight models like Qwen 3.5 are delivering performance that used to require massive proprietary APIs. The 27B and 35B-A3B variants prove you don’t need 100B+ parameters when the architecture is well-designed.
The fact that these run on a single consumer GPU (or even a high-RAM MacBook) changes the game for local AI development. No more paying per token or worrying about API rate limits.
Key takeaways:
- Qwen 3.5 brings frontier performance to consumer hardware
- The MoE variant (35B-A3B) offers speed, while 27B gives you dense accuracy
- llama.cpp is your best bet for running GGUFs locally right now
- Both models support 256K context and multimodal tasks
What’s Next?
Once you’ve got this running, explore:
- Model Harness Build something with Qwen 3.5 either 35B MoE or dense 27B variant
- Model performence Compare performence with other open weight model.
- Small series Explore the smaller variant model (0.8B to 9B) if you need ultra-fast inference on mobile devices
The open model ecosystem is moving fast β and honestly, it’s been fun watching the gap close. What will drop next month? I’m here for it. π
Have questions or hit a snag running Qwen 3.5 locally? Drop me a note β happy to help debug.