How does the whole kv cache situation work for diffusion models? Like are there ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		nowittyusername 3 days ago \| parent \| context \| favorite \| on: Mercury 2: Fast reasoning LLM powered by diffusion How does the whole kv cache situation work for diffusion models? Like are there latency and computation/monetary savings for caching? is the curve similar to auto regressive caching options? or maybe such things dont apply at all and you can just mess with system prompt and dynamically change it every turn because there's no savings to be had? or maybe you can make dynamic changes to the head but also get cache savings because of diffusion based architecture?... so many ideas...

		help

volodia 3 days ago [–]

There are many ways to do it, but the simplest approach is block diffusion: https://m-arriola.com/bd3lms/

There are also more advanced approaches, for example FlexMDM, which essentially predicts length of the "canvas" as it "paints tokens" on it.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact