Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm comparing Gemma3 12 B (https://ollama.com/library/gemma3; running fully on my 3060 12GB) and Mistral Small 3 24B (https://ollama.com/library/mistral-small; 10% offloaded to the CPU).

- Gemma3 12B: ~100 t/s on prompt eval; 15 t/s on eval

- MistralSmall3 24B: ~500 t/s on prompt eval; 10 t/s on eval

Do you know what different in architecture could make the prompt eval (prefill) so much slower on the 2x smaller Gemma3 model?



Thank you for the report! We are working with the Ollama team directly and will look into it.


At what context sizes? I've just run the same prompt and query on my RTX3080 with openwebui as frontend.

When I set the context size to 2048 (openwebui's default), the inference is almost twice as fast as when I set it to 4096. I can't set the conext size any higher because my GPU only has 12GB of RAM and ollama crashes for larger context sizes.

Still, I find that thoroughly odd. Using the larger conetext size (4096), the GPU usage is only 50% as seen in nvtop. I have no idea why.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: