SWE chart is missing Claude on front page, interesting way to present your data. Mix and match at will.
Grown up people showing public school level sneakiness. That fact alone disqualifies your LL. Business/marketing leaders usually are brighter than average developers... so there.
I periodically try to run these models on my MBP M3 Max 128G (which I bought with a mind to run local AI). I have a certain deep research question (in a field that is deeply familiar to me) that I ask when I want to gauge model's knowledge.
So far Opus 4.6 and Gemini Pro are very satisfactory, producing great answers fairly fast. Gemini is very fast at 30-50 sec, Opus is very detailed and comes at about 2-3 minutes.
Today I ran the question against local qwen3.5:35b-a3b - it puffed for 45 (!) minutes, produced a very generic answer with errors, and made my laptop sound like it's going to take off any moment.
Wonder what am I doing wrong?.. How am I supposed to use this for any agentic coding on a large enough codebase? It will take days (and a 3M Peltor X5A) to produce anything useful.
You're comparing 100b parameters open models running on a consumer laptop VS private models with at the very least 1t parameters running on racks of bleeding edge professional gpus
Local agentic coding is closer to "shit me the boiler plate for an android app" not "deep research questions", especially on your machine
Well Opus and Gemini are probably running on multiple H200 equivalents, maybe multiple hundreds of thousands of dollars of inference equipment. Local models are inherently inferior; even the best Mac that money can buy will never hold a candle to latest generation Nvidia inference hardware, and the local models, even the largest, are still not quite at the frontier. The ones you can plausibly run on a laptop (where "plausible" really is "45 minutes and making my laptop sound like it is going to take off at any moment". Like they said -- you're getting sonnet 4.5 performance which is 2 generations ago; speaking from experience opus 4.6 is night and day compared to sonnet 4.5
> Well Opus and Gemini are probably running on multiple H200 equivalents, maybe multiple hundreds of thousands of dollars of inference equipment.
But if you've got that kind of equipment, you aren't using it to support a single user. It gets the best utilization by running very large batches with massive parallelism across GPUs, so you're going to do that. There is such a thing as a useful middle ground. that may not give you the absolute best in performance but will be found broadly acceptable and still be quite viable for a home lab.
Batching helps with efficiency but you can’t fit opus into anything less than hundreds of thousands of dollars in equipment
Local models are more than a useful middle ground they are essential and will never go away, I was just addressing the OPs question about why he observed the difference he did. One is an API call to the worlds most advanced compute infrastructure and another is running on a $500 CPU.
Lots of uses for small, medium, and larger models they all have important places!!
Running local AI models on a laptop is a weird choice. The Mini and especially the Studio form factor will have better cooling, lower prices for comparable specs and a much higher ceiling in performance and memory capacity.
I can never see the point, though. Performance isn't anywhere near Opus, and even that gets confused following instructions or making tool calls in demanding scenarios. Open weights models are just light years behind.
I really, really want open weights models to be great, but I've been disappointed with them. I don't even run them locally, I try them from providers, but they're never as good as even the current Sonnet.
Yeah, for sure, I just don't have many of those. For example, the only use I have for Haiku is for summarizing webpages, or Sonnet for coding something after Opus produces a very detailed plan.
Maybe I should try local models for home automation, Qwen must be great at that.
They're like 6 months away on most benchmarks, people already claimed coding wad solved 6 months ago, so which is it? The current version is the baseline that solves everything but as soon as the new version is out it becomes utter trash and barely usable
That's very large models at full quantization though. Stuff that will crawl even on a decent homelab, despite being largely MoE based and even quantization-aware, hence reducing the amount and size of active parameters.
That's just a straw man. Each frontier model version is better than the previous one, and I use it for harder and harder things, so I have very little use for a version that's six months behind. Maybe for simple scripts they're great, but for a personal assistant bot, even Opus 4.6 isn't as good as I'd like.
use a larger model like Qwen3.5-122B-A10B quantized to 4/5/6 bits depending on how much context you desire, MLX versions if you want best tok/s on Mac HW.
if you are able to run something like mlx-community/MiniMax-M2.5-3bit (~100gb), my guess if the results are much better than 35b-a3b.
On my 32GB Ryzen desktop (recently upgraded from 16GB before the RAM prices went up another +40%), did the same setup of llama.cpp (with Vulkan extra steps) and also converged on Qwen3-Coder-30B-A3B-Instruct (also Q4_K_M quantization)
On the model choice: I've tried latest gemma, ministral, and a bunch of others. But qwen was definitely the most impressive (and much faster inference thanks to MoE architecture), so can't wait to try Qwen3.5-35B-A3B if it fits.
I've no clue about which quantization to pick though ... I picked Q4_K_M at random, was your choice of quantization more educated?
Smells like hyperbole. A lot of people making such claims don’t seem to have continued real world experience with these models or seem to have very weird standards for what they consider usable.
Up until relatively recently, while people had already long been making these claims, it came with the asterisks of „oh, but you can’t practically use more than a few K tokens of context“.
"Create a single page web app scientific RPN calculator"
Qwen 3.5 122b/a10b (at q3 using unsloth's dynamic quant) is so far the first model I've tried locally that gets a really usable RPN calculator app. Other models (even larger ones that I can run on my Strix Halo box) tend to either not implement the stack right, have non-functional operation buttons, or most commonly the keypad looks like a Picasso painting (i.e., the 10-key pad portion has buttons missing or mapped all over the keypad area).
This seems like such as simple test, but I even just tried it in chatgpt (whatever model they serve up when you don't log in), and it didn't even have any numerical input buttons. Claude Sonet 4.6 did get it correct too, but that is the only other model I've used that gets this question right.
We tend to find Qwen3-Coder-Next better at coding at least on our anecdotal examples from our codebases. It's somewhat better at tool calling, maybe the current templates for Qwen3.5 are still not enjoying as "mature" support as Qwen3 on vllm. I can say in my team MiniMax2.5 is the currently favorite.
I used the 35b model to create a polars implementation of PCA (no sklearn or imports other than math and polars). In less than 10 minutes I had the code. This is impressive to me considering how poorly all models were with polars until very recently. (They always hallucinated pandas code.)
Qwen3-Coder-30B-A3B-Instruct is good I think for in line IDE integration or operating on small functions or library code but I dont think you will get too far with one shot feature implementation that people are currently doing with Claude or whatever.
I have been adding a one shot feature to a codebase with ChatGPT 5.3 Codex in Cursor and it worked out of the box but then I realised everything it had done was super weird and it didn't work under a load of edge cases. I've tried being super clear about how to fix it but the model is lost. This was not a complex feature at all so hopefully I'm employed for a few more years yet.
I could be doing something wrong, but I have not had any success with one shot feature implementations for any of the current models. There are always weird quirks, undesired behaviors, bad practices, or just egregiously broken implementations. A week or so ago, I had instructed Claude to do something at compile-time and it instead burned a phenomenal amount of tokens before yeeting the most absurd, and convoluted, runtime implementation—- that didn’t even work. At work I use it (or Codex) for specific tasks, delegating specific steps of the feature implementation.
The more I use the cloud based frontier models, the more virtue I find in using local, open source/weights, models because they tend to create much simpler code. They require more direct interaction from me, but the end result tends to be less buggy, easier to refactor/clean up, and more precisely what I wanted. I am personally excited to try this new model out here shortly on my 5090. If read the article correctly, it sounds like even the quantized versions have a “million”[1] token context window.
And to note, I’m sure I could use the same interaction loop for Claude or GPT, but the local models are free (minus the power) to run.
[1] I’m a dubious it won’t shite itself at even 50% of that. But even 250k would be amazing for a local model when I “only” have 32GB of VRAM.
Thinking about getting a new MBP M5 Max 128GB (assuming they are released next week). I know "future proofing" at this stage is near impossible, but for writing Rust code locally (likely using Qwen 3.5 for now on MLX), the AIs have convinced me this is probably my best choice for immediate with some level of longevity, while retaining portability (not strictly needed, but nice to have). Alternatively was considering RTX options or a mac studio, but was leaning towards apple for the unified memory. What does HN think?
Radeon R9700 with 32 GB VRAM is relatively affordable for the amount of RAM and with llama.cpp it runs fast enough for most things. These are workstation cards with blower fans and they are LOUD. Otherwise if you have the money to burn get a 5090 for speeeed and relatively low noise, especially if you limit power usage.
It depends. How much are you willing to wait for an answer? Also, how far are you willing to push quantization, given the risk of degraded answers at more extreme quantization levels?
It's less than you'd think. I'm using the 35B-A3B model on an A5000, which is something like a slightly faster 3080 with 24GB VRAM. I'm able to fit the entire Q4 model in memory with 128K context (and I think I would probably be able to do 256K since I still have like 4GB of VRAM free). The prompt processing is something like 1K tokens/second and generates around 100 tokens/second. Plenty fast for agentic use via Opencode.
I've had an AMD card for the last 5 years, so I kinda just tuned out of local LLM releases because AMD seemed to abandon rocm for my card (6900xt) - Is AMD capable of anything these days?
> I've had an AMD card for the last 5 years, so I kinda just tuned out of local LLM releases because AMD seemed to abandon rocm for my card (6900xt) - Is AMD capable of anything these days?
Sure. Llama.cpp will happily run these kinds of LLMs using either HIP or Vulcan.
Vulkan is easier to get going using the Mesa OSS drivers under Linux, HIP might give you slightly better performance.
I think the 27B dense model at full precision and 122B MoE at 4- or 6-bit quantization are legitimate killer apps for the 96 GB RTX 6000 Pro Blackwell, if the budget supports it.
I imagine any 24 GB card can run the lower quants at a reasonable rate, though, and those are still very good models.
Big fan of Qwen 3.5. It actually delivers on some of the hype that the previous wave of open models never lived up to.
No experience with 5 and not much with 4.7, but they both have quite a few advocates over on /r/localllama.
Unsloth's GLM-4.7-Flash-BF16.gguf is quite fast on the 6000, at around 100 t/s, but definitely not as smart as the Qwen 3.5 MoE or dense models of similar size. As far as I'm concerned Qwen 3.5 renders most other open models short of perhaps Kimi 2.5 obsolete for general queries, although other models are still said to be better for local agentic use. That, I haven't tried.
18GB was an odd 3-channel one-off for the M3 Pros. I guess there's a bunch of them out there, but how slow would 27B be on it, due to not being an MOE model.
That's like saying "somewhere between Eliza and Haiku 4.5". Haiku is not even a so-called 'reasoning model'.¹
¹ To preempt the easily-offended, this is what the latest Opus 4.6 in today's Claude Code update says: "Claude Haiku 4.5 is not a reasoning model — it's optimized for speed and cost efficiency. It's the fastest model in the Claude family, good for quick, straightforward tasks, but it doesn't have extended thinking/reasoning capabilities."
> Claude Haiku 4.5, a new hybrid reasoning large language model from Anthropic in our small, fast model class.
> As with each model released by Anthropic beginning with Claude Sonnet 3.7, Claude Haiku 4.5 is a hybrid reasoning model. This means that by default the model will answer a query rapidly, but users have the option to toggle on “extended thinking mode”, where the model will spend more time considering its response before it answers. Note that our previous model in the Haiku small-model class, Claude Haiku 3.5, did not have an extended thinking mode.
Not sure what this means, but as a marketing person myself, here's what happened: One day, an Anthropican involved in the Haiku 4.5 launch shrugged, weighed the odds of getting spanked for equating "extended thinking" with "reasoning", and then used Claude to generate copy declaring that. It's not rocket surgery!
I asked it to recite potato 100 times coz I wanted to benchmark speed of CPU vs GPU. It's on 150 line of planning. It recited the requested thing 4 times already and started drafting the 5th response.
Qwen3.5 pretty much requires a long system prompt, otherwise it goes into a weird planning mode where it reasons for minutes about what to do, and double and triple checks everything it does. Both Gemini's and Claude Opus 4.6's prompts work pretty well, but are so long that whatever you're using to run the model has to support prompt caching. Asking it to "Say the word "potato" 100 times, once per line, numbered.", for example, results in the following reasoning, followed by the word "potato" in 100 numbered lines, using the smallest (and therefore dumbest) quant unsloth/Qwen3.5-35B-A3B-GGUF:UD-IQ2_XXS:
"User is asking me to repeat the word "potato" 100 times, numbered. This is a simple request - I can comply with this request. Let me create a response that includes the word "potato" 100 times, numbered from 1 to 100.
I'll need to be careful about formatting - the user wants it numbered and once per line. I should use minimal formatting as per my instructions."
good to know, thanks. I just ran ollama with qwen3.5:27b. Currently it's stuck on picking format
Let's write.
Wait, I'll write the response.
Wait, I'll check if I should use a table.
No, text is fine.
Okay.
Let's write.
Wait, I'll write the response.
Wait, I'll check if I should use a bullet list.
No, just lines.
Okay.
Let's write.
Wait, I'll write the response.
Wait, I'll check if I should use a numbered list.
No, lines are fine.
Okay.
Let's write.
Wait, I'll write the response.
Wait, I'll check if I should use a code block.
Yes.
Okay.
Let's write.
Wait, I'll write the response.
Wait, I'll check if I should use a pre block.
Code block is better.
Yeah, it tends to get stuck in loops like that a lot with everything set to default. I wonder if they distilled Gemini at some point, I've seen that get stuck in a similar "I will now do [thing]. I am preparing to do [thing]. I will do it." failure mode as well a couple of times.
well hold on now, maybe it’s onto something. do you really know what it means to “recite” “potato” “100” “times”? each of those words could be pulled apart into a dissertation-level thesis and analysis of language, history, and communication.
either that, or it has a delusional level of instruction following. doesn’t mean it can’t code like sonnet though
It's still amusing to see those seemingly simple things still put it into loop
it is still going
> do you really know what it means to “recite” “potato” “100” “times”?
asking user question is an option. Sonnet did that a bunch when I was trying to debug some network issue. It also forgot the facts checked for it and told it before...
All the western ones are closed while all the Chinese ones are open. The only exception is the European Mistral but performance of that model is not very satisfactory. Hopefully they make some improvements soon
Nothing personally - Our customers send us highly sensitive financial documents to process. Using a foreign model to process their data (or even just for local testing) will most likely result in a u-turn.
Impressive, very nice, now let's see what would be the odds that the US models developed in SV are also highly positive about Californian and Democrats politics.
So far Opus 4.6 and Gemini Pro are very satisfactory, producing great answers fairly fast. Gemini is very fast at 30-50 sec, Opus is very detailed and comes at about 2-3 minutes.
Today I ran the question against local qwen3.5:35b-a3b - it puffed for 45 (!) minutes, produced a very generic answer with errors, and made my laptop sound like it's going to take off any moment.
Wonder what am I doing wrong?.. How am I supposed to use this for any agentic coding on a large enough codebase? It will take days (and a 3M Peltor X5A) to produce anything useful.
You're comparing 100b parameters open models running on a consumer laptop VS private models with at the very least 1t parameters running on racks of bleeding edge professional gpus
Local agentic coding is closer to "shit me the boiler plate for an android app" not "deep research questions", especially on your machine
Speculation is that the frontier models are all below 200B parameters but a 2x size difference wouldn’t fully explain task performance differences
There are the benchmarks, the promises, and what everybody can try at home
But if you've got that kind of equipment, you aren't using it to support a single user. It gets the best utilization by running very large batches with massive parallelism across GPUs, so you're going to do that. There is such a thing as a useful middle ground. that may not give you the absolute best in performance but will be found broadly acceptable and still be quite viable for a home lab.
Local models are more than a useful middle ground they are essential and will never go away, I was just addressing the OPs question about why he observed the difference he did. One is an API call to the worlds most advanced compute infrastructure and another is running on a $500 CPU.
Lots of uses for small, medium, and larger models they all have important places!!
Admittedly I haven't tried these models on my Mac but I have on my DGX Spark and they ran fine. I didn't see the slowdown you're mentioning.
I really, really want open weights models to be great, but I've been disappointed with them. I don't even run them locally, I try them from providers, but they're never as good as even the current Sonnet.
Maybe I should try local models for home automation, Qwen must be great at that.
if you are able to run something like mlx-community/MiniMax-M2.5-3bit (~100gb), my guess if the results are much better than 35b-a3b.
Also, performance on research-y questions isn't always a good indicator of how the model will do for code generation or agent orchestration.
- llama.cpp
- OpenCode
- Qwen3-Coder-30B-A3B-Instruct in GGUF format (Q4_K_M quantization)
working on a M1 MacBook Pro (e.g. using brew).
It was bit finicky to get all of the pieces together so hopefully this can be used with these newer models.
https://gist.github.com/alexpotato/5b76989c24593962898294038...
On the model choice: I've tried latest gemma, ministral, and a bunch of others. But qwen was definitely the most impressive (and much faster inference thanks to MoE architecture), so can't wait to try Qwen3.5-35B-A3B if it fits.
I've no clue about which quantization to pick though ... I picked Q4_K_M at random, was your choice of quantization more educated?
Up until relatively recently, while people had already long been making these claims, it came with the asterisks of „oh, but you can’t practically use more than a few K tokens of context“.
Qwen 3.5 122b/a10b (at q3 using unsloth's dynamic quant) is so far the first model I've tried locally that gets a really usable RPN calculator app. Other models (even larger ones that I can run on my Strix Halo box) tend to either not implement the stack right, have non-functional operation buttons, or most commonly the keypad looks like a Picasso painting (i.e., the 10-key pad portion has buttons missing or mapped all over the keypad area).
This seems like such as simple test, but I even just tried it in chatgpt (whatever model they serve up when you don't log in), and it didn't even have any numerical input buttons. Claude Sonet 4.6 did get it correct too, but that is the only other model I've used that gets this question right.
The more I use the cloud based frontier models, the more virtue I find in using local, open source/weights, models because they tend to create much simpler code. They require more direct interaction from me, but the end result tends to be less buggy, easier to refactor/clean up, and more precisely what I wanted. I am personally excited to try this new model out here shortly on my 5090. If read the article correctly, it sounds like even the quantized versions have a “million”[1] token context window.
And to note, I’m sure I could use the same interaction loop for Claude or GPT, but the local models are free (minus the power) to run.
[1] I’m a dubious it won’t shite itself at even 50% of that. But even 250k would be amazing for a local model when I “only” have 32GB of VRAM.
Edit: The unsloth quants seem to have been fixed, so they are probably the go-to again: https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks
Quite misleading, really.
If you want to spend twice as much for more speed, get a 3090/4090/5090.
If you want long context, get two of them.
If you have enough spare cash to buy a car, get an RTX Ada with 96G VRAM.
Check out the HP Omen 45L Max: https://www.hp.com/us-en/shop/pdp/omen-max-45l-gaming-dt-gt2...
I'm curious which one you're using.
Sure. Llama.cpp will happily run these kinds of LLMs using either HIP or Vulcan.
Vulkan is easier to get going using the Mesa OSS drivers under Linux, HIP might give you slightly better performance.
I imagine any 24 GB card can run the lower quants at a reasonable rate, though, and those are still very good models.
Big fan of Qwen 3.5. It actually delivers on some of the hype that the previous wave of open models never lived up to.
Unsloth's GLM-4.7-Flash-BF16.gguf is quite fast on the 6000, at around 100 t/s, but definitely not as smart as the Qwen 3.5 MoE or dense models of similar size. As far as I'm concerned Qwen 3.5 renders most other open models short of perhaps Kimi 2.5 obsolete for general queries, although other models are still said to be better for local agentic use. That, I haven't tried.
none of the qwen 3.5 models are anywhere near sonnet 4.5 class, not even the largest 397b.
BUT 27b is the smartest local-sized model in the world by a wide wide margin. (35b is shit. fast shit, but shit.)
benchmarks are complete, publishing on Monday.
Strong vision and reasoning performance, and the 35-a3b model run s pretty ok on a 16gb GPU with some CPU layers.
Obviously there's more to a model than that but it's a data point.
[1]: https://github.com/fairydreaming/lineage-bench
[2]: https://github.com/fairydreaming/lineage-bench-results/tree/...
Somewhere between Haiku 4.5 and Sonnet 4.5
That's like saying "somewhere between Eliza and Haiku 4.5". Haiku is not even a so-called 'reasoning model'.¹
¹ To preempt the easily-offended, this is what the latest Opus 4.6 in today's Claude Code update says: "Claude Haiku 4.5 is not a reasoning model — it's optimized for speed and cost efficiency. It's the fastest model in the Claude family, good for quick, straightforward tasks, but it doesn't have extended thinking/reasoning capabilities."
[0]: https://www-cdn.anthropic.com/7aad69bf12627d42234e01ee7c3630...
> Claude Haiku 4.5, a new hybrid reasoning large language model from Anthropic in our small, fast model class.
> As with each model released by Anthropic beginning with Claude Sonnet 3.7, Claude Haiku 4.5 is a hybrid reasoning model. This means that by default the model will answer a query rapidly, but users have the option to toggle on “extended thinking mode”, where the model will spend more time considering its response before it answers. Note that our previous model in the Haiku small-model class, Claude Haiku 3.5, did not have an extended thinking mode.
I would absolutely believe mar-ticles that Qwen has achieved Haiku 4.5 'extended thinking' levels of coding prowess.
Oh HN never change.
Maybe "Qwen3.5 122B offers Haiku 4.5 performance on local computers" would be a more realistic and defensible claim.
...yeah I doubt it
"User is asking me to repeat the word "potato" 100 times, numbered. This is a simple request - I can comply with this request. Let me create a response that includes the word "potato" 100 times, numbered from 1 to 100.
I'll need to be careful about formatting - the user wants it numbered and once per line. I should use minimal formatting as per my instructions."
either that, or it has a delusional level of instruction following. doesn’t mean it can’t code like sonnet though
> do you really know what it means to “recite” “potato” “100” “times”?
asking user question is an option. Sonnet did that a bunch when I was trying to debug some network issue. It also forgot the facts checked for it and told it before...
What's your problem with Chinese LLMs?
An Analysis of Chinese LLM Censorship and Bias with Qwen 2 Instruct https://huggingface.co/blog/leonardlin/chinese-llm-censorshi...