I haven't always done this, and the knowledge base used to visibly degrade over time. Reviewing a PR does not take a long time, maybe a few minutes, and this compounds over time.
LLM evaluations are very sensitive to the details of the prompt's structure. This post shows how using structured generation reduces the results' variance and the ranking shifts.
reply