I worked on a project that did fine tuning and RLHF[1] for a major provider, and you would not believe just how utterly broken a large proportion of the prompts (from real users) were. And the project rules required practically reading tea leaves to divine how to give the best response even to prompts that were not remotely coherent human language.
[1] Reinforcement learning from human feedback; basically participants got two model responses and had to judge them on multiple criteria relative to the prompt
I made the argument multiple times that the right answer to many prompts would be a question, and it was allowed under some rare circumstances, but far too few.
I suspect in part because the provider also didn't want to create an easy cop out for the people working on the fine-tuning part (a lot of my work was auditing and reviewing output, and there was indeed a lot of really sloppy work, up to and including cut and pasting output from other LLMs - we know, because on more than one occasion I caught people who had managed to include part of Claudes website footer in their answer...)
I upgraded to a new model (gpt-4o-mini to grok-4.1-fast), suddenly all my workflows were broken. I was like "this new model is shit!", then I looked into my prompts and realized the model was actually better at following instructions, and my instructions were wrong/contradictory.
After I fixed my prompts it did exactly what I asked for.
Maybe models should have another tuneable parameters, on how well it should respect the user prompt. This reminds me of imagegen models, where you can choose the config/guidance scale/diffusion strength.