Star InactiveStar InactiveStar InactiveStar InactiveStar Inactive
 

Microsoft shed more light on last week’s major Azure outage that generally confirm what everyone already knew –  a storm near the Azure South Central US region knocked cooling systems offline and shut down systems that took days to recover because of issues with the cloud platform’s architecture. But the reports also illuminate the scope...

The post Learn from these Microsoft’s Azure outage postmortem takeaways appeared first on The Troposphere.


Microsoft shed more light on last week’s major Azure outage that generally confirm what everyone already knew –  a storm near the Azure South Central US region knocked cooling systems offline and shut down systems that took days to recover because of issues with the cloud platform’s architecture.

But the reports also illuminate the scope of systems damage, the infrastructure dependencies that crippled the systems, and plans to increase resiliency for customers.

What we know now

The storm damaged hardware. Multiple voltage surges and sags in the utility power supply caused part of the data center to transfer to generator power, and knocked the cooling system offline despite the existence of surge protectors, according to Microsoft’s overall root-cause analysis (RCA). A thermal buffer in the cooling system eventually depleted and temperatures quickly rose, which triggered the automated systems shutdown.

But that shutdown wasn’t soon enough. “A significant number of storage servers were damaged, as well as a small number of network devices and power units,” according to the company.

Microsoft will now look for more environmentally resilient storage hardware designs, and try to improve its software to help automate and accelerate recovery efforts.

Microsoft wants more zone redundancy. Earlier this year Microsoft introduced Azure Availability Zones, defined as one or more physical data centers in a region with independent power, cooling and networking. AWS and Google already broadly offer these zones, and Azure provides zone-redundant storage in some regions, but not in South Central US.

For Visual Studio Team Services (VSTS), this was the worst outage in its seven-year history, according to the team’s postmortem, written by Buck Hodges, VSTS director of engineering. Ten regions, including this affected one, globally host VSTS customers, and many of those don’t have availability zones. Going forward, Microsoft will enable VSTS to use availability zones and move to whatever regions support them, though the service won’t move out of geographic regions where customers have specific data sovereignty requirements.

Service dependencies hurt everyone. Various Azure infrastructure and systems dependencies harmed services outside the region and slowed recovery efforts:

  • The Azure South Central region is the primary site for Azure Service Manager (ASM), which customers typically use for classic resource types. ASM does not support automatic failure, so ASM requests everywhere experienced higher latencies and failures.
  • Authentication traffic from Azure Active Directory automatically routed to other regions which triggered throttling mechanisms, and created latency and timeouts for customers in other regions.
  • Many Azure regions depend on services in VSTS, which led to slowdowns and inaccessibility for several related services.
  • Dependencies on Azure Active Directory and platform services affected Application Insights, according to the group’s postmortem.

Microsoft will review these ASM dependencies, and determine how to migrate services to Azure Resource Manager APIs.

Time to rethink replication options? The VSTS team further explained failover options: wait for recovery, or access data from a read-only backup copy. The latter option would cause latency and data loss, but users of services such as Git, Team Foundation Version Control and Build would be unable to check in, save or deploy code.

Sychronous replication ideally prevents data loss in failovers but in practice it’s hard to do. All services involved must be ready to commit data and respond at any point in time, and that’s not possible, the company said.

Lessons learned? Microsoft said it will reexamine asynchronous replication, and explore active geo-replication for Azure SQL and Azure Storage to asynchronously write data to primary and secondary regions and keep a copy ready for failover.

The VSTS team also will explore how to let customers choose a recovery method based on whether they prioritize faster recovery and productivity over potential loss of data. The system would indicate if the secondary copy is up to date and manually reconcile once the primary data center is back up and running.

The post Learn from these Microsoft’s Azure outage postmortem takeaways appeared first on The Troposphere.


Read full article on Cloud Computing from IT knowledge exchange



My Tweets

Last Articles

The best five email clients for Android

Jack Wallen offers up his list of best Android email clients on the market.

How to create speed dial groups in the Vivaldi browser

Vivaldi has found a better way of organizing your bookmarks, with Speed Dial Groups. Find out how...

Windows logo keyboard shortcuts: The complete list

The Windows logo key, which is common on most keyboards these days, can be a powerful tool if you...

What to expect from Ubuntu 19.04

The next iteration of Ubuntu is a month away. Find out what new features and improvements will...

How to add the apps you choose to the Windows 10 context menu

The Windows 10 Context Menu is a powerful tool, but it has limits.

What to expect from Ubuntu 19.04

The next iteration of Ubuntu is but a month away. What new features and improvements will find...

EUC Weekly Digest – March 23, 2019

Interesting EUC items from last week: Citrix VDA Citrix VDA versions and Login VSI VSImax value...

China Unicom's big bet on open source

China Unicom, the leading Chinese telco, is going big on Alluxio, Apache Spark, and other top open...

Debunking the open source sustainability myth

Open source vendors are draping themselves in the flag of "sustainability" to try to garner...

Time to try the Julia programming language? Python challenger's new debugger fixes major complaint

Shortcomings in the data science-focused language are being addressed with the release of a...

  • The best five email clients for Android

    Tuesday, 26 March 2019 00:30
  • How to create speed dial groups in the Vivaldi browser

    Tuesday, 26 March 2019 00:30
  • Windows logo keyboard shortcuts: The complete list

    Monday, 25 March 2019 21:30
  • What to expect from Ubuntu 19.04

    Monday, 25 March 2019 20:30
  • How to add the apps you choose to the Windows 10 context menu

    Monday, 25 March 2019 19:30
  • What to expect from Ubuntu 19.04

    Monday, 25 March 2019 16:30
  • EUC Weekly Digest – March 23, 2019

    Saturday, 23 March 2019 12:31
  • China Unicom's big bet on open source

    Saturday, 23 March 2019 00:30
  • Debunking the open source sustainability myth

    Friday, 22 March 2019 22:30
  • Time to try the Julia programming language? Python challenger's new debugger fixes major complaint

    Friday, 22 March 2019 14:30