This feels very apropos to a recent banking outage we had here in Singapore. On Oct 14, I was out buying groceries and was asked a very strange question by the NTUC FairPrice supermarket cashier, "which bank is your card from?". Expecting the usual "would you like this product on offer" type question, I didn't even register the question for a second.
Turns out that we had a major outage at a data centre that served DBS - one of the largest lenders, as well as Citi. [1]
The disruption was attributed to a cooling system failure at a data centre operated by Equinix [2]. Further digging led to information that the culprit was the SG3 data centre [3] marketed as their largest IBX data centre in the Asia Pacific region and one of the newest. It turned out that the cooling system was being upgraded on contract with an external vendor, who applied incorrect settings which brought it down.
Further, this particular data centre has 2N electrical redundancy but a cooling redundancy of only N+1 chillers, in comparison to other financial services organizations like the Singapore exchange (SGX) that offers [4] CoLo hosting with 2N chillers, which I believe is essential for warm equatorial climates.
Sadly, this outage was followed by yet another smaller payment related outage the following week, making it the fifth outage this year. DBS was trumpeting their move to the cloud [5] as part of their grand plan to transform themselves from a bank into a software company that also offered banking services [6][7].
In going all out with this questionable and misguided transformation they've lost focus on what made people trust them in the first place - the decades of trust that was built on solid, reliable, transparent and efficient banking services.
There were questions about why their backup data centre didn't kick in and there are no answers till date.
It's clear to see that the recovery mechanisms weren't tested, performance degrade modes were not implemented or not tested, disaster resilience utterly failed, there were no working mitigations for cooling system failures or DC failures, and as a result ATMs and other services were down from 3pm on Oct 14 until the following morning. For a country that prides itself on digital transformation, this is just the latest banking systems failure which makes it far more than an egg in the face, it's an erosion of trust.
Items (2), (3), (7), (8), (9) from the Google report directly apply to this failure. I can only hope something good comes out of it and lessons are learnt.
Turns out that we had a major outage at a data centre that served DBS - one of the largest lenders, as well as Citi. [1]
The disruption was attributed to a cooling system failure at a data centre operated by Equinix [2]. Further digging led to information that the culprit was the SG3 data centre [3] marketed as their largest IBX data centre in the Asia Pacific region and one of the newest. It turned out that the cooling system was being upgraded on contract with an external vendor, who applied incorrect settings which brought it down.
Further, this particular data centre has 2N electrical redundancy but a cooling redundancy of only N+1 chillers, in comparison to other financial services organizations like the Singapore exchange (SGX) that offers [4] CoLo hosting with 2N chillers, which I believe is essential for warm equatorial climates.
Sadly, this outage was followed by yet another smaller payment related outage the following week, making it the fifth outage this year. DBS was trumpeting their move to the cloud [5] as part of their grand plan to transform themselves from a bank into a software company that also offered banking services [6][7].
In going all out with this questionable and misguided transformation they've lost focus on what made people trust them in the first place - the decades of trust that was built on solid, reliable, transparent and efficient banking services.
There were questions about why their backup data centre didn't kick in and there are no answers till date.
It's clear to see that the recovery mechanisms weren't tested, performance degrade modes were not implemented or not tested, disaster resilience utterly failed, there were no working mitigations for cooling system failures or DC failures, and as a result ATMs and other services were down from 3pm on Oct 14 until the following morning. For a country that prides itself on digital transformation, this is just the latest banking systems failure which makes it far more than an egg in the face, it's an erosion of trust.
Items (2), (3), (7), (8), (9) from the Google report directly apply to this failure. I can only hope something good comes out of it and lessons are learnt.
[1] https://www.channelnewsasia.com/singapore/dbs-citibank-outag...
[2] https://www.zdnet.com/article/equinixs-data-center-system-up...
[3] https://www.equinix.com/data-centers/asia-pacific-colocation...
[4] https://www.sgx.com/data-connectivity/co-location
[5] https://www.dbs.com/newsroom/First_bank_in_Singapore_to_laun...
[6] https://bankinginnovation.qorusglobal.com/content/articles/t...
[7] https://www.dbs.com/technology-future/dbs-redefining-the-fut...