Cenlar Network Outage
| Cenlar Network Outage – 3/9 and 3/10
Major Incident Report (RCA) March 14, 2022
Background Cenlar is in the process of a phased approach to migrating systems from our on premise 425 Data Center to our new hosting presence in the Microsoft Azure Cloud. Some production systems remain in our 425 Data Center and are planned for migration to Azure later this year. Starting on Wednesday, March 9, we experienced two concurrent, but unrelated, Severity 1 incidents that impacted our services at the 425 Data Center. For ease of reference, the first incident will be referred to as the Data Center Incident and the second will be referred to as the Internet Circuit Incident.
The Data Center Incident interrupted system authentication services for applications hosted in the 425 Data Center and the Internet Circuit Incident interrupted internet access to and from the 425 Data Center. This Incident Report details the specifics, as well as the root causes.
Overall Incident Timeline: Outage Start Times: 03/09/2022, 8:50 a.m. ET (Data Center Incident) 03/09/2022, 5:00 p.m.ET (Internet Circuit Incident) Resolution Times: 03/10/2021, 4:00 a.m. ET (Data Center Incident) 03/11/2021, 12:00 a.m. ET (Internet Circuit Incident) Situation On 03/09/2022 at 8:50 a.m., employees began reporting service interruptions when using Microsoft Outlook. Within minutes, the Service Desk received calls from employees reporting latency and other anomalous system behavior in Outlook and the inability to access some systems. The IT team established a technical conference bridge line to coordinate diagnostic and troubleshooting efforts. Monitoring tools indicated that systems were available and in a healthy state, yet people were reporting persistent access inabilities and related issues for a subset of systems. External monitoring of perimeter protection systems did not indicate any unusual traffic patterns or any cybersecurity attack. Business continuity plans were invoked and the Crisis Response Team was deployed.
A pattern emerged indicating that systems hosted in the 425 Data Center were disproportionately impacted, but monitoring tools continued to indicate systems were available. Eventually, those monitoring tools were also impacted by the outage. Externally hosted systems, including those hosted in Azure impacted (e.g., Citrix, VPN, and several business applications) continued to be available and remote employees continued to access those systems throughout the incident. Core operations teams continued to process payments and remittances without issue. The Call Center was limited in their ability to maintain systems access and, by 1:00 p.m., the Call Centers were closed down. However, the Interactive Voice Response system (IVR) and Chatbot remained available to service borrowers. Self-service tools hosted in CenNet, the borrower website, remained available for all borrowers except for those who authenticate through client Single Sign-On (SSO). Client-facing systems (e.g., CenAccess, Global Teller, APIs) were unavailable during the incident. Email and messaging services were down as well, complicating internal and external communications.
Throughout the day, IT teams recycled locally hosted services, isolated and tested infrastructure components and executed numerous tactical actions to isolate and correct the problem(s). The Crisis Response Team was kept apprised of troubleshooting efforts and results and they addressed ongoing stakeholder communications and operational response. The team elected to forego a full disaster recovery response given that the majority of systems were still available and a complete DR plan execution would interrupt other services pulling attention from local troubleshooting efforts.
On 3/9, we received notification that the circuit that connects the 425 Data Center to the internet went down at 5:00 p.m. This marked the start of the Internet Circuit Incident.
Since several servers were unable to connect to their storage drives, the team began to isolate the potential root cause(s) to either our Storage Array System or the storage networking switch that physically connects the storage devices to our servers. Vendors were engaged and troubleshooting activities continued throughout the evening.
The team was then able to focus attention on the complication of the Internet Circuit Incident. The team engaged our third party network services provider and learned that the circuit interruption resulted from a serious traffic accident in the area of the 425 Data Center. The accident brought down utility poles and cut off power and internet service over an extended geographic area due to a complete fiber cable cut. Because the cut occurred within the so-called “last mile” between the closest network Point of Presence (PoP) and the demarcation point at the 425 Data Center, the network services provider was unable to route around the fiber cut to re-establish internet access. Since the Internet Circuit Incident resolution required the network services provider and circuit owner alone, we focused our attention on the Data Center Incident.
At 3:15 a.m., the IT team was able to isolate the Data Center Incident to one node of a pair of management controllers in the Storage Array configured in a high availability. The team made the necessary changes to isolate the bad management controller from the one pair and they were able subsequently to bring up the locally hosted services that were previously unable to connect to their storage (including the domain servers). By 4:00 a.m., the domain controllers were once again able to authenticate enabling systems access and the Data Center Incident was resolved. However, any systems that were dependent on internet access through the down circuit were still impacted. For example, internal email access was restored but external email was down due to the fiber cut.
The incident team continued to monitor the Internet Circuit Incident through the network services provider, but the circuit owner was unable to access the site of the fiber cut for 27 hours due to fire, medical and extended police department activity. This was followed by the need to restore electrical power to the impacted area. The circuit owner was finally able to begin physical cable restoration efforts at 8:00 p.m. on 3/10. At 9:45 p.m., our incident team noted that internet traffic was once again flowing through the circuit. However, the circuit owner continued full remediation efforts until approximately midnight on 3/11. The incident team was subsequently able to confirm the availability and full functionality of all internet-dependent systems.
Root Causes The initial root cause of the Data Center Incident appears to be a malfunctioning management controller within the Storage Array System. Subsequent testing should validate this assumption.
The root cause of the Internet Circuit Internet was the traffic accident that resulted in the fiber cable cut that interrupted internet service to the 425 Data Center.
Contributing Factors Troubleshooting efforts for the Data Center Incident were significantly compromised by negative impacts to our monitoring and administration consoles due to the incident. Also, some time was lost due to the need to send engineers and operators to the 425 Data Center for on premise monitoring and troubleshooting efforts.
The Internet Circuit Incident was significantly impacted by extended fire, medical and police activities at the accident site, followed by a lengthy power restoration effort. These efforts were prioritized by local authorities ahead of the fiber cable restoration. Once the circuit owner was permitted access to the site, remediation occurred quickly.
Future Outage Prevention We have made significant investments to date in deploying a new Wide Area Network in anticipation of our full migration from the 425 Data Center to the Azure Cloud. The 425 internet circuit is our last remaining circuit without full redundancy and is slated for decommission upon completion of the Cloud migration. The location of the fiber cut within the last mile of our data center demarcation point prevented our network services provider from being able to re-route around the cut.
The 425 Data Center design did not account for a fully redundant secondary internet circuit. Our new Software Defined Wide Area Network (SD-WAN)/Azure data centers provide every Cenlar site with dual active/active WAN connections providing physically diverse paths from the onsite demarcation point to two different telco Point of Presence (PoP).
In the future, the Internet Circuit Incident we experienced would not be an impact. This emerging WAN also includes four connections to the Azure primary and secondary environments (two High Availability in the East and two DR with High Availability in the West). In addition, all partners/vendors are being configured for both East and West as their systems/services are migrated to Azure. All Cenlar sites utilize this new design today with the exception of the remaining data center components that will be addressed as part of their scheduled Azure migration.
During the Internet Circuit Incident, the team considered forcing traffic to the new internet setup as part of SD-WAN, but doing so would require changes locally and by all our clients and vendors. Doing so would take considerable time and coordination and distract significantly from troubleshooting efforts. In addition, these changes would take time to propagate before external stakeholders could access inbound systems. The team decided the fastest path of resolution was still to rely on resolution by the internet vendors. This decision was based on the expectation that the circuit owner would be able to access the site of the fiber cut much earlier than the 27 hours that actually occurred. 27 hours is an inordinately long time for local authorities to prevent a circuit owner from accessing a downed line.
Work to Date We have been actively working to migrate all Cenlar hosted systems to a highly available, resilient, and robust data center. For more than a year, we have been carefully migrating systems to the Microsoft Azure Cloud as well as deploying a resilient Software Defined Wide Area Network (SD-WAN) that includes internet and telephony circuits. We have completed seven of the ten planned migration waves.
Migration of our non-production systems were completed last year and some of our production systems were migrated last month. The remaining production systems slated for Azure hosting are scheduled to be migrated through the second half of this year. When this occurs, we will retire the 425 Data Center and have world class availability and recovery capabilities.
Some of this Azure Cloud functionality is already available and was utilized during the outage to redirect processing from the 425 Data Center. Business functions such as employee remote access, file transmissions to vendors, and other workarounds enabled the continuity of overnight processing and subsequent file and data extract delivery to internal stakeholders and eventually to clients after the internet circuit was restored.
Next Steps 1) Further testing and troubleshooting with the storage vendor is scheduled to validate the assumed root cause of the Data Center Incident and determine further corrective actions. 2) An after-action review is scheduled for the week of 3/14 to assess incident response and business continuity planning results for further enhancement. 3) Systems hosted in the 425 Data Center will continue to be migrated to our state of the art Microsoft Azure cloud hosting environment supported by the fully redundant SD-WAN. We expect to complete the migration and full SD-WAN utilization in the second half of 2022. 4) We will notify you if any of these reviews identify new information that was not included in this report.
|