> Looking back, waiting six months could have helped us avoid many issues. Early...

esafak · on Feb 19, 2025

GitHub is looking to add this feature to dependabot: https://github.com/dependabot/dependabot-core/issues/3651

TZubiri · on Feb 19, 2025

Being so deep into dependencies that you have to find more dependencies and features to make your dependency less of a clusterfuck is sad.

esafak · on Feb 20, 2025

Are you referring to dependabot? You are free to update your dependencies manually.

h1fra · on Feb 19, 2025

In theory, that works in practice nope. You get a random update with a possible bug inside that is only fixed by a new version that you won't get until later. The other strategy is to wait for a package to be fully stable (no update), and in that case, some packages that receive daily/weekly updates are never updated

esafak · on Feb 19, 2025

It does help, because major version updates are more likely to cause breakage than minor ones, so you benefit if you wait for a few minor version updates. That is not to say minor versions can't introduce bugs.

Windows is a well-known example; people used to wait for a service pack or two before upgrading.

ajmurmann · on Feb 19, 2025

We could even wait for a patch version or the minor being out a certain amount of time. For a major I'd wait even longer and potentially for a second patch.

Cthulhu_ · on Feb 19, 2025

And then they went towards a more evergreen update strategy, causing some major outages when some releases caused issues.

I mean evergreen releases make sense imo, as the overhead of maintaining older versions for a long time is huge, but you need to have canary releases, monitoring, and gradual rollout plans; for something like Windows, this should be done with a lot of care. Even a 1% release rate will affect hundreds of thousands if not millions of systems.

InDubioProRubio · on Feb 19, 2025

This is a wildly successfully pattern in nature, the old using the young and inexperienced, as enthusiastic test units.

In the wild for example in Forrest, old boars give safety squeaks to send the younglings ahead into a clearing they do not trust. The equivalent to that- would be to write a tech-blog entry that hypes up a technology that is not yet production ready.

Tzela · on Feb 19, 2025

Just for curiosity: do you have a source?

pwmtr · on Feb 19, 2025

Author of the blog post here.

Yeah, this is generally a good practice. The silver lining is that our suffering helped uncover the underlying issue faster. :)

This isn’t part of the blog post, but we also considered getting the servers and keeping them idle, without actual customer workload, for about a month in the future. This would be more expensive, but it could help identify potential issues without impacting our users. In our case, the crashes started three weeks after we deployed our first AX162 server, so we need at least a month (or maybe even longer) as a buffer period.

ThePowerOfFuet · on Feb 19, 2025

>The silver lining is that our suffering helped uncover the underlying issue faster.

Did you actually uncover the true root cause? Or did they finally uncap the power consumption without telling you, just as they neither confirmed nor denied having limited it?

pwmtr · on Feb 19, 2025

The root cause was a problem with the motherboard, though the exact issue remains unknown to us. I suspect that a component on the motherboard may have been vulnerable to power limitations or fluctuations and that the newer-generation motherboards included additional protection against this. However, this is purely my speculation.

I don't believe they simply lifted a power cap (if there was one in the first place). I genuinely think the fix came after the motherboard replacements. We had 2 batches of motherboard replacements and after that, the issue disappeared.

If someone from Hetzner is here, maybe they can give extra information.

oz3d · on Feb 19, 2025

hetzner is currently replacing motherboards of their dedicated servers [1] But I dont know if thats the same issue that was mentioned in the article.

[1] https://status.hetzner.com/incident/7fae9cca-b38c-4154-8a27-...

ubanholzer · on Feb 19, 2025

Thats the same issue, yes.

axus · on Feb 19, 2025

Customers are the best QA. And they pay you too, instead of the reverse!

rat9988 · on Feb 19, 2025

I'm pretty sure they pay for QA. QA cannot always catch every possible bug.

knowitnone · on Feb 19, 2025

these crashes should have been caught easily

crishoj · on Feb 19, 2025

Were you able to identify the manufacturer and model/revision of the failing motherboards? This would be extremely helpful when shopping for seconds hand servers.

babuskov · on Feb 20, 2025

I cannot find the link now, but it was mentioned that it was ASRock mobos.

crishoj · on Feb 20, 2025

Thanks. This comment above does mention ASRock: https://news.ycombinator.com/item?id=43112594

On the other hand, dmidecode output in the article shows:

Manufacturer: Dell Inc. Product Name: 0H3K7P

fdr · on Feb 19, 2025

It varies by system. As the legendary (to some) Kelly Johnson of the Skunk Works had as one of his main rules:

> The inspection system as currently used by the Skunk Works, which has been approved by both the Air Force and the Navy, meets the intent of existing military requirements and should be used on new projects. Push more basic inspection responsibility back to the subcontractors and vendors. Don't duplicate so much inspection.

But this will be the only and last time Ubicloud does not burn in a new model, or even tranches of purchases (I also work there...and am a founder).