I regularly find broken thermal solutions in azure and when I worked at Google i...

radicality · on Feb 19, 2025

The term PROCHOT just brought me back to vivid memories of debugging exactly that at Facebook a while ago.

It was very non-obvious to debug since pretty much most emitted metrics, apart from mysterious errors/timeouts to our service, looked reasonable. Even the cpu usage and cpu temperature graphs looked normal since it was a bogus prochot and not actually a real thermal throttling

porridgeraisin · on Feb 19, 2025

And it brought me back to memories of debugging that on my friends laptop.

It kept going to 400mhz.. i suspected throttling and we got it cleaned thermal paste replaced and all that.

Still throttled. We replaced the windows with linux since it was atleast a bit more usable

At the time I didn't know about PROCHOT. And my googling skills clearly weren't sufficient.

One fine day during lunch at a place on campus, Id read about BD_PROCHOT recently. So i wrote a script to msrprobe or whatever it was and disabled it. "Extended" the lifespan of the thing.

iforgotpassword · on Feb 20, 2025

I once had a dell laptop that after about three years started complaining on power up that I wouldn't be using a genuine dell PSU and should switch to one. I ignored it at first because you could just hit enter and carry on, but after a while I noticed that every time this happened, the cpu would clock at a fixed 800mhz. I ordered a new power brick but the message didn't go away, so I returned the brick and decided to never buy dell again.

bityard · on Feb 19, 2025

A laptop that I had would assert PROCHOT if it didn't like the power supply you plugged into it. It actually took an embarrassing amount of time for me to notice that this is what was causing Slack to be inexplicably slower at my desk than when I was out working in a common area in the building.

tryauuum · on Feb 19, 2025

in my (limited) experience this only happened with GIGABYTE servers

very weird behavior, I'd prefer my servers to crash instead of lowering frequency to 400MHz.

dijit · on Feb 19, 2025

I've seen it on nearly every brand, I have some Lenovo Servers in the basement that also down-clock if both PSU's aren't installed.

I have alerts on PSU's and frequency for this reason.

The servers are so cheap that overcommitting them by double is still significantly cheaper than using cloud hosting, which tends to have the same issue only monitoring it is harder. Though most people using cloud seem to be happy not to know and it's been a known thing that there's a 5x variation between instances of the same size on AWS.: https://www.brendangregg.com/Slides/AWSreInvent2017_performa...

jeffbee · on Feb 19, 2025

> I'd prefer my servers to crash instead of lowering frequency to 400MHz.

100% agreed. There is nothing worse than a slow server in your fleet. This behavior reeks of "pet" thinking.

formerly_proven · on Feb 19, 2025

Stuff like this just comes up from time to time as soon as you run a four digit and up number of systems.