I'm comparing Gemma3 12 B (https://ollama.com/library/gemma3; running fully on m...

alekandreev · 2025-03-12T10:33:10 1741775590

Thank you for the report! We are working with the Ollama team directly and will look into it.

remuskaos · 2025-03-13T16:47:56 1741884476

At what context sizes? I've just run the same prompt and query on my RTX3080 with openwebui as frontend.

When I set the context size to 2048 (openwebui's default), the inference is almost twice as fast as when I set it to 4096. I can't set the conext size any higher because my GPU only has 12GB of RAM and ollama crashes for larger context sizes.

Still, I find that thoroughly odd. Using the larger conetext size (4096), the GPU usage is only 50% as seen in nvtop. I have no idea why.