Star InactiveStar InactiveStar InactiveStar InactiveStar Inactive
 

Microsoft shed more light on last week’s major Azure outage that generally confirm what everyone already knew –  a storm near the Azure South Central US region knocked cooling systems offline and shut down systems that took days to recover because of issues with the cloud platform’s architecture. But the reports also illuminate the scope...

The post Learn from these Microsoft’s Azure outage postmortem takeaways appeared first on The Troposphere.


Microsoft shed more light on last week’s major Azure outage that generally confirm what everyone already knew –  a storm near the Azure South Central US region knocked cooling systems offline and shut down systems that took days to recover because of issues with the cloud platform’s architecture.

But the reports also illuminate the scope of systems damage, the infrastructure dependencies that crippled the systems, and plans to increase resiliency for customers.

What we know now

The storm damaged hardware. Multiple voltage surges and sags in the utility power supply caused part of the data center to transfer to generator power, and knocked the cooling system offline despite the existence of surge protectors, according to Microsoft’s overall root-cause analysis (RCA). A thermal buffer in the cooling system eventually depleted and temperatures quickly rose, which triggered the automated systems shutdown.

But that shutdown wasn’t soon enough. “A significant number of storage servers were damaged, as well as a small number of network devices and power units,” according to the company.

Microsoft will now look for more environmentally resilient storage hardware designs, and try to improve its software to help automate and accelerate recovery efforts.

Microsoft wants more zone redundancy. Earlier this year Microsoft introduced Azure Availability Zones, defined as one or more physical data centers in a region with independent power, cooling and networking. AWS and Google already broadly offer these zones, and Azure provides zone-redundant storage in some regions, but not in South Central US.

For Visual Studio Team Services (VSTS), this was the worst outage in its seven-year history, according to the team’s postmortem, written by Buck Hodges, VSTS director of engineering. Ten regions, including this affected one, globally host VSTS customers, and many of those don’t have availability zones. Going forward, Microsoft will enable VSTS to use availability zones and move to whatever regions support them, though the service won’t move out of geographic regions where customers have specific data sovereignty requirements.

Service dependencies hurt everyone. Various Azure infrastructure and systems dependencies harmed services outside the region and slowed recovery efforts:

  • The Azure South Central region is the primary site for Azure Service Manager (ASM), which customers typically use for classic resource types. ASM does not support automatic failure, so ASM requests everywhere experienced higher latencies and failures.
  • Authentication traffic from Azure Active Directory automatically routed to other regions which triggered throttling mechanisms, and created latency and timeouts for customers in other regions.
  • Many Azure regions depend on services in VSTS, which led to slowdowns and inaccessibility for several related services.
  • Dependencies on Azure Active Directory and platform services affected Application Insights, according to the group’s postmortem.

Microsoft will review these ASM dependencies, and determine how to migrate services to Azure Resource Manager APIs.

Time to rethink replication options? The VSTS team further explained failover options: wait for recovery, or access data from a read-only backup copy. The latter option would cause latency and data loss, but users of services such as Git, Team Foundation Version Control and Build would be unable to check in, save or deploy code.

Sychronous replication ideally prevents data loss in failovers but in practice it’s hard to do. All services involved must be ready to commit data and respond at any point in time, and that’s not possible, the company said.

Lessons learned? Microsoft said it will reexamine asynchronous replication, and explore active geo-replication for Azure SQL and Azure Storage to asynchronously write data to primary and secondary regions and keep a copy ready for failover.

The VSTS team also will explore how to let customers choose a recovery method based on whether they prioritize faster recovery and productivity over potential loss of data. The system would indicate if the secondary copy is up to date and manually reconcile once the primary data center is back up and running.

The post Learn from these Microsoft’s Azure outage postmortem takeaways appeared first on The Troposphere.


Read full article on Cloud Computing from IT knowledge exchange



My Tweets

Last Articles

VirtualBox 6.0 brings a much needed upgrade to the UI

VirtualBox 6.0 is out and includes a number of improvements including a new file manager, more...

Microsoft fixes Windows 10's Wi-Fi hotspot problems and closes app snooping loophole

The flaws and a tranche of others were resolved by three bundles of updates for Windows 10 that...

Discover what AI visibility will do for your supply chain

Are you struggling to get the information you need from your supply chain? Take this guided tour...

The best programming languages to learn in 2019: Top coding skills that pay you the most

These are the languages that are in the highest demand and offer the highest salaries.

How to recover data from a corrupt Windows user profile

If corruption occurs to a Windows user's profile, data may be retrievable using the method listed...

Office Q&A: How to share Outlook 365 contacts

Sharing information is one of Outlook's best selling points and sharing contacts is no exception.

How to find, use, and modify the Windows 10 Show Desktop button

Depending on how you use Microsoft Windows 10, you may not know about the benefits of the Show...

Top BI Analytics Tools--Definitive Pricing Guide 2018

Get an in-depth comparison of the costs and implementation models for the top-rated BI tools. This...

NetSuite ERP vs. Epicor ERP vs. Sage Business Cloud―Competitive Report

Find out which ERP system is best for your needs. We break down how NetSuite OneWorld Global ERP,...

Top 10 most popular software development technologies

JavaScript, Python, and SQL are among the most popular software development technologies of the...

  • VirtualBox 6.0 brings a much needed upgrade to the UI

    Wednesday, 16 January 2019 19:30
  • Microsoft fixes Windows 10's Wi-Fi hotspot problems and closes app snooping loophole

    Wednesday, 16 January 2019 18:30
  • Discover what AI visibility will do for your supply chain

    Tuesday, 15 January 2019 21:30
  • The best programming languages to learn in 2019: Top coding skills that pay you the most

    Tuesday, 15 January 2019 20:30
  • How to recover data from a corrupt Windows user profile

    Tuesday, 15 January 2019 20:30
  • Office Q&A: How to share Outlook 365 contacts

    Tuesday, 15 January 2019 19:30
  • How to find, use, and modify the Windows 10 Show Desktop button

    Tuesday, 15 January 2019 18:30
  • Top BI Analytics Tools--Definitive Pricing Guide 2018

    Tuesday, 15 January 2019 17:30
  • NetSuite ERP vs. Epicor ERP vs. Sage Business Cloud―Competitive Report

    Tuesday, 15 January 2019 17:30
  • Top 10 most popular software development technologies

    Tuesday, 15 January 2019 15:30