>How large LLMs can be run at reasonable speed on 12GB (3060), 32GM RAM?
If you want to offload fully to VRAM, I'd say 8B is the limit. If you're keeping some on RAM, 15-20B can still give OK performance, depending on your tolerance.
>How much does quantization impact output quality?
Basically with more quantization the output becomes more incoherent and less realistic. At the extreme end it's basically just gibberish. I think the sweet spot generally is at 4 bits. At that point the model is pretty compact and the quality isn't diminished too much.
If you want to offload fully to VRAM, I'd say 8B is the limit. If you're keeping some on RAM, 15-20B can still give OK performance, depending on your tolerance.
>How much does quantization impact output quality?
Basically with more quantization the output becomes more incoherent and less realistic. At the extreme end it's basically just gibberish. I think the sweet spot generally is at 4 bits. At that point the model is pretty compact and the quality isn't diminished too much.