Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Quick takes on the GCP public incident write-up (surfingcomplexity.blog)
8 points by todsacerdoti 8 months ago | hide | past | favorite | 1 comment


The author raised some great questions! But as he admits, they are unlikely to be answered in public, which is why I usually find public retrospectives a bit underwhelming.

Service Control did not have the appropriate randomized exponential backoff implemented to avoid this. It took up to ~2h 40 mins to fully resolve in us-central-1 as we throttled task creation to minimize the impact on the underlying infrastructure and routed traffic to multi-regional databases to reduce the load.

If nothing else, this section of the incident report reminded me of my favorite distributed systems paper: Metastable Failures in Distributed Systems. You should definitely check it out if you haven't already:

https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: