What hardware do you have it running on? Do you feel you could replace the front...

sosodev · 2026-03-04T20:44:12 1772657052

Around 20ish tokens a second with 6-bit quant at very long context lengths on my AMD AI Max 395+

I’m trying to use local models whenever possible. Still need to lean on the frontier models sometimes.

bigyabai · 2026-03-04T17:43:46 1772646226

I'm getting ~30 tok/s on the A3B model with my 3070 Ti and 32k context.

> Do you feel you could replace the frontier models with it for everyday coding? Would/will you?

Probably not yet, but it's really good at composing shell commands. For scripting or one-liner generation, the A3B is really good. The web development skills are markedly better than Qwen's prior models in this parameter range, too.

jasonjmcghee · 2026-03-05T15:03:24 1772723004

That seems oddly low / slower by a fair amount than i get on my m4. (I believe it was ~45 tok/s?)

What quant are you using? How much ram does it have?

politelemon · 2026-03-04T18:28:11 1772648891

60 to 70 on a 5080, but only tinkering for now. The smaller models seem exceptionally good for what they are, and some can even do OCR reliably.

cpburns2009 · 2026-03-06T14:10:50 1772806250

What quantization are you running on the 5080? I'm waiting to receive mine.