Star InactiveStar InactiveStar InactiveStar InactiveStar Inactive
 

Microsoft shed more light on last week’s major Azure outage that generally confirm what everyone already knew –  a storm near the Azure South Central US region knocked cooling systems offline and shut down systems that took days to recover because of issues with the cloud platform’s architecture. But the reports also illuminate the scope...

The post Learn from these Microsoft’s Azure outage postmortem takeaways appeared first on The Troposphere.


Microsoft shed more light on last week’s major Azure outage that generally confirm what everyone already knew –  a storm near the Azure South Central US region knocked cooling systems offline and shut down systems that took days to recover because of issues with the cloud platform’s architecture.

But the reports also illuminate the scope of systems damage, the infrastructure dependencies that crippled the systems, and plans to increase resiliency for customers.

What we know now

The storm damaged hardware. Multiple voltage surges and sags in the utility power supply caused part of the data center to transfer to generator power, and knocked the cooling system offline despite the existence of surge protectors, according to Microsoft’s overall root-cause analysis (RCA). A thermal buffer in the cooling system eventually depleted and temperatures quickly rose, which triggered the automated systems shutdown.

But that shutdown wasn’t soon enough. “A significant number of storage servers were damaged, as well as a small number of network devices and power units,” according to the company.

Microsoft will now look for more environmentally resilient storage hardware designs, and try to improve its software to help automate and accelerate recovery efforts.

Microsoft wants more zone redundancy. Earlier this year Microsoft introduced Azure Availability Zones, defined as one or more physical data centers in a region with independent power, cooling and networking. AWS and Google already broadly offer these zones, and Azure provides zone-redundant storage in some regions, but not in South Central US.

For Visual Studio Team Services (VSTS), this was the worst outage in its seven-year history, according to the team’s postmortem, written by Buck Hodges, VSTS director of engineering. Ten regions, including this affected one, globally host VSTS customers, and many of those don’t have availability zones. Going forward, Microsoft will enable VSTS to use availability zones and move to whatever regions support them, though the service won’t move out of geographic regions where customers have specific data sovereignty requirements.

Service dependencies hurt everyone. Various Azure infrastructure and systems dependencies harmed services outside the region and slowed recovery efforts:

  • The Azure South Central region is the primary site for Azure Service Manager (ASM), which customers typically use for classic resource types. ASM does not support automatic failure, so ASM requests everywhere experienced higher latencies and failures.
  • Authentication traffic from Azure Active Directory automatically routed to other regions which triggered throttling mechanisms, and created latency and timeouts for customers in other regions.
  • Many Azure regions depend on services in VSTS, which led to slowdowns and inaccessibility for several related services.
  • Dependencies on Azure Active Directory and platform services affected Application Insights, according to the group’s postmortem.

Microsoft will review these ASM dependencies, and determine how to migrate services to Azure Resource Manager APIs.

Time to rethink replication options? The VSTS team further explained failover options: wait for recovery, or access data from a read-only backup copy. The latter option would cause latency and data loss, but users of services such as Git, Team Foundation Version Control and Build would be unable to check in, save or deploy code.

Sychronous replication ideally prevents data loss in failovers but in practice it’s hard to do. All services involved must be ready to commit data and respond at any point in time, and that’s not possible, the company said.

Lessons learned? Microsoft said it will reexamine asynchronous replication, and explore active geo-replication for Azure SQL and Azure Storage to asynchronously write data to primary and secondary regions and keep a copy ready for failover.

The VSTS team also will explore how to let customers choose a recovery method based on whether they prioritize faster recovery and productivity over potential loss of data. The system would indicate if the secondary copy is up to date and manually reconcile once the primary data center is back up and running.

The post Learn from these Microsoft’s Azure outage postmortem takeaways appeared first on The Troposphere.


Read full article on Cloud Computing from IT knowledge exchange



My Tweets

Last Articles

10 most popular programming languages: Rise of Typescript

The popularity of JavaScript has led Typescript to become a more popular coding language,...

How to configure Back in Time to back up over SSH

Need to back up data to a remote, SSH-enabled server? Try Back In Time.

How to make text display larger in Google Docs

You don't have to adjust the font size to make a Google Doc fill up more of your screen in Chrome...

How to make text display larger in Google Docs

You don't have to adjust the font size to make a Google Doc fill up more of your screen in Chrome...

How to increase shutdown speeds in Windows 10

For some users, speed is everything. Eliminate the built-in delay in the Windows 10 shutdown...

How to reduce data input and typos in Excel

Data entry can be boring and rife with errors--it happens to all of us. Learn three ways to reduce...

The clearest sign of AWS' open source success wasn't built by Amazon

AWS Firecracker is great open source technology, but the best indication of its open source...

10 best hidden features in iOS 13

Learn how to use these time-saving and productivity boosting iOS 13 features.

Learning programming languages for free: GitHub's best guides for Java developers

The 10 highest-ranked, English-language repositories on GitHub designed to help those learning Java.

Windows 10, three years later: Why this is as good as it gets

Based on Microsoft's own track record, there's a certain statistic "sweet spot" for Windows...

  • 10 most popular programming languages: Rise of Typescript

    Friday, 19 July 2019 18:30
  • How to configure Back in Time to back up over SSH

    Thursday, 18 July 2019 21:30
  • How to make text display larger in Google Docs

    Thursday, 18 July 2019 20:30
  • How to make text display larger in Google Docs

    Thursday, 18 July 2019 20:30
  • How to increase shutdown speeds in Windows 10

    Wednesday, 17 July 2019 17:30
  • How to reduce data input and typos in Excel

    Wednesday, 17 July 2019 14:30
  • The clearest sign of AWS' open source success wasn't built by Amazon

    Tuesday, 16 July 2019 22:30
  • 10 best hidden features in iOS 13

    Tuesday, 16 July 2019 21:30
  • Learning programming languages for free: GitHub's best guides for Java developers

    Tuesday, 16 July 2019 17:30
  • Windows 10, three years later: Why this is as good as it gets

    Tuesday, 16 July 2019 14:30