I regularly find broken thermal solutions in azure and when I worked at Google it was also a low-level but constant irritant. When I joined Dropbox I said to my team on my first day that I could find a machine in their fleet running at 400MHz, and I was right: a bogus redundant PSU controller was asserting PROCHOT. These things happen whenever you have a lot of machines.
The term PROCHOT just brought me back to vivid memories of debugging exactly that at Facebook a while ago.
It was very non-obvious to debug since pretty much most emitted metrics, apart from mysterious errors/timeouts to our service, looked reasonable. Even the cpu usage and cpu temperature graphs looked normal since it was a bogus prochot and not actually a real thermal throttling
And it brought me back to memories of debugging that on my friends laptop.
It kept going to 400mhz.. i suspected throttling and we got it cleaned thermal paste replaced and all that.
Still throttled. We replaced the windows with linux since it was atleast a bit more usable
At the time I didn't know about PROCHOT. And my googling skills clearly weren't sufficient.
One fine day during lunch at a place on campus, Id read about BD_PROCHOT recently. So i wrote a script to msrprobe or whatever it was and disabled it. "Extended" the lifespan of the thing.
I once had a dell laptop that after about three years started complaining on power up that I wouldn't be using a genuine dell PSU and should switch to one. I ignored it at first because you could just hit enter and carry on, but after a while I noticed that every time this happened, the cpu would clock at a fixed 800mhz. I ordered a new power brick but the message didn't go away, so I returned the brick and decided to never buy dell again.
A laptop that I had would assert PROCHOT if it didn't like the power supply you plugged into it. It actually took an embarrassing amount of time for me to notice that this is what was causing Slack to be inexplicably slower at my desk than when I was out working in a common area in the building.
I've seen it on nearly every brand, I have some Lenovo Servers in the basement that also down-clock if both PSU's aren't installed.
I have alerts on PSU's and frequency for this reason.
The servers are so cheap that overcommitting them by double is still significantly cheaper than using cloud hosting, which tends to have the same issue only monitoring it is harder. Though most people using cloud seem to be happy not to know and it's been a known thing that there's a 5x variation between instances of the same size on AWS.: https://www.brendangregg.com/Slides/AWSreInvent2017_performa...