Star InactiveStar InactiveStar InactiveStar InactiveStar Inactive
 

Microsoft shed more light on last week’s major Azure outage that generally confirm what everyone already knew –  a storm near the Azure South Central US region knocked cooling systems offline and shut down systems that took days to recover because of issues with the cloud platform’s architecture. But the reports also illuminate the scope...

The post Learn from these Microsoft’s Azure outage postmortem takeaways appeared first on The Troposphere.


Microsoft shed more light on last week’s major Azure outage that generally confirm what everyone already knew –  a storm near the Azure South Central US region knocked cooling systems offline and shut down systems that took days to recover because of issues with the cloud platform’s architecture.

But the reports also illuminate the scope of systems damage, the infrastructure dependencies that crippled the systems, and plans to increase resiliency for customers.

What we know now

The storm damaged hardware. Multiple voltage surges and sags in the utility power supply caused part of the data center to transfer to generator power, and knocked the cooling system offline despite the existence of surge protectors, according to Microsoft’s overall root-cause analysis (RCA). A thermal buffer in the cooling system eventually depleted and temperatures quickly rose, which triggered the automated systems shutdown.

But that shutdown wasn’t soon enough. “A significant number of storage servers were damaged, as well as a small number of network devices and power units,” according to the company.

Microsoft will now look for more environmentally resilient storage hardware designs, and try to improve its software to help automate and accelerate recovery efforts.

Microsoft wants more zone redundancy. Earlier this year Microsoft introduced Azure Availability Zones, defined as one or more physical data centers in a region with independent power, cooling and networking. AWS and Google already broadly offer these zones, and Azure provides zone-redundant storage in some regions, but not in South Central US.

For Visual Studio Team Services (VSTS), this was the worst outage in its seven-year history, according to the team’s postmortem, written by Buck Hodges, VSTS director of engineering. Ten regions, including this affected one, globally host VSTS customers, and many of those don’t have availability zones. Going forward, Microsoft will enable VSTS to use availability zones and move to whatever regions support them, though the service won’t move out of geographic regions where customers have specific data sovereignty requirements.

Service dependencies hurt everyone. Various Azure infrastructure and systems dependencies harmed services outside the region and slowed recovery efforts:

  • The Azure South Central region is the primary site for Azure Service Manager (ASM), which customers typically use for classic resource types. ASM does not support automatic failure, so ASM requests everywhere experienced higher latencies and failures.
  • Authentication traffic from Azure Active Directory automatically routed to other regions which triggered throttling mechanisms, and created latency and timeouts for customers in other regions.
  • Many Azure regions depend on services in VSTS, which led to slowdowns and inaccessibility for several related services.
  • Dependencies on Azure Active Directory and platform services affected Application Insights, according to the group’s postmortem.

Microsoft will review these ASM dependencies, and determine how to migrate services to Azure Resource Manager APIs.

Time to rethink replication options? The VSTS team further explained failover options: wait for recovery, or access data from a read-only backup copy. The latter option would cause latency and data loss, but users of services such as Git, Team Foundation Version Control and Build would be unable to check in, save or deploy code.

Sychronous replication ideally prevents data loss in failovers but in practice it’s hard to do. All services involved must be ready to commit data and respond at any point in time, and that’s not possible, the company said.

Lessons learned? Microsoft said it will reexamine asynchronous replication, and explore active geo-replication for Azure SQL and Azure Storage to asynchronously write data to primary and secondary regions and keep a copy ready for failover.

The VSTS team also will explore how to let customers choose a recovery method based on whether they prioritize faster recovery and productivity over potential loss of data. The system would indicate if the secondary copy is up to date and manually reconcile once the primary data center is back up and running.

The post Learn from these Microsoft’s Azure outage postmortem takeaways appeared first on The Troposphere.


Read full article on Cloud Computing from IT knowledge exchange



My Tweets

Last Articles

How to use Hangouts Chat

If you use G Suite, you can also use Hangouts Chat for conversations within your organization....

Windows 10 October 2018 Update broken your network drives? Here's a possible fix

Despite being the rollout of the major feature update being halted for a month, bugs in the new...

Raspberry Pi 3 Model A+ review: A $25 computer with a lot of promise

Get the lowdown on how well the latest Raspberry Pi board performs.

How to make LibreOffice Writer remember your last cursor position

Jack Wallen shows you how to configure LibreOffice so that it opens documents right where you last...

How to download your data stored by Apple

Downloading a copy of your data that Apple stores in iCloud and other services is easier than...

Microsoft finally resumes rollout of troubled Windows 10 October 2018 Update

The major feature update to Windows 10 will begin rolling out again more than one month after it...

How to configure LibreOffice to automatically save in MS Office

If you collaborate with others on documents and need to configure LibreOffice to automatically...

How to get error messages in Windows 10 in plain English with the System Diagnostic Report

Tired of sifting through hex codes in the Performance Monitor to figure out what's going wrong?...

9 ways to clean foreign or imported data

Whether you're importing data into Access or Excel, Excel is a great place to clean data before...

How to use macOS' built-in tools to write to a hard drive using NTFS

Need to write to an NTFS drive in macOS but don't want to rely on third-party software? Here's how...

  • How to use Hangouts Chat

    Thursday, 15 November 2018 17:30
  • Windows 10 October 2018 Update broken your network drives? Here's a possible fix

    Thursday, 15 November 2018 14:30
  • Raspberry Pi 3 Model A+ review: A $25 computer with a lot of promise

    Thursday, 15 November 2018 09:30
  • How to make LibreOffice Writer remember your last cursor position

    Wednesday, 14 November 2018 17:30
  • How to download your data stored by Apple

    Tuesday, 13 November 2018 20:30
  • Microsoft finally resumes rollout of troubled Windows 10 October 2018 Update

    Tuesday, 13 November 2018 19:30
  • How to configure LibreOffice to automatically save in MS Office

    Tuesday, 13 November 2018 18:30
  • How to get error messages in Windows 10 in plain English with the System Diagnostic Report

    Tuesday, 13 November 2018 18:30
  • 9 ways to clean foreign or imported data

    Tuesday, 13 November 2018 18:30
  • How to use macOS' built-in tools to write to a hard drive using NTFS

    Tuesday, 13 November 2018 18:30