

 tuzhanbei2010 2024-03-28 发布于内蒙古



Staffing levels: are data centers at risk of unnecessary outages?

January 16, 2024 

译 者 说





With increasing data center automation, it’s only natural for clients to want assurance that their data will be available as close to 100 percent of the time as possible, and to ask whether enough data center staff are available to achieve a high level of uptime. They also want to know that when a potential outage occurs, there are enough technicians on duty or available to restore services as soon as possible.

2023年8月30日,微软在悉尼的澳洲东区(Data Center Region: Australia East Region,2014年开服,位于新南威尔士)数据中心遭遇中断故障,从当天10:30(UTC)开始,持续46小时。

Microsoft suffered an outage on 30th August 2023 in its Australia East region in Sydney, lasting 46 hours. The company says it began at 10.30 UTC that day.

用户在访问或使用Azure、Microsoft 365和Power Platform服务时遇到问题。由08:41(UTC)的市电电压骤降触发,并影响了澳洲东区的三个可用区之一。

Customers experienced issues with accessing or using Azure, Microsoft 365, and Power Platform services. It was triggered by a utility power sag at 08.41 UTC and impacted one of the three Availability Zones of the region.


Microsoft explains: “This power sag tripped a subset of the cooling system chiller units offline and, while working to restore cooling, temperatures in the data center increased to levels above operational thresholds. We powered down a small subset of selected compute and storage scale units, both to lower temperatures and to prevent damage to hardware.”

尽管,在22:40(UTC)之前绝大多数服务已恢复,但直到2023年9月3日20:00 (UTC)才完成全面缓解。微软表示,这是因为某些服务经历了较为长期的影响,“主要是由于对存储、SQL数据库和Cosmos DB等一系列服务恢复的依赖”。

Despite this, the vast majority of services were recovered by 22.40 UTC, but they weren’t able to complete a full mitigation until 20.00 UTC on 3rd September 2023. Microsoft says this was because some services experienced a prolonged impact, “predominantly as a result of dependencies on recovering subsets of Storage, SQL Database, and/or Cosmos DB services.”

Voltage sag cause


The utility voltage sag was caused, according to the company, by a lightning strike on electrical infrastructure situated 18 miles from the impacted Availability Zone of the Australia East region. They add: “The voltage sag caused cooling system chillers for multiple data centers to shut down. While some chillers automatically restarted, 13 failed to restart and required manual intervention. To do so, the onsite team accessed the data center rooftop facilities, where the chillers are located, and proceeded to sequentially restart chillers moving from one data center to the next.”

What was the impact?


“By the time the team reached the final five chillers requiring a manual restart, the water inside the pump system for these chillers (chilled water loop) had reached temperatures that were too high to allow them to be restarted. In this scenario, the restart is inhibited by a self-protection mechanism that acts to prevent damage to the chiller that would occur by processing water at the elevated temperatures. The five chillers that could not be restarted supported cooling for the two adjacent data halls which were impacted in this incident.”

微软表示,两个受影响的数据机房至少需要四台冷机才能冷却。电压骤降前冷却系统的制冷能力由七台冷机组成,5+2配置。随着数据机房温度升高,一些网络、计算和存储等IT设备逐步自动关闭。温升影响了服务可用性。然而,现场数据中心运维团队在 11:34(UTC)开始远程关闭所有剩余的网络、计算和存储等IT设备,以保护数据持久性、设备健康并应对热失控问题。

Microsoft says the two impacted data halls require at least four chillers to be operational. The cooling capacity before the voltage sag consisted of seven chillers, with five of them in operation and two on standby. The company says that some networking, compute, and storage infrastructure began to shut down automatically as data hall temperatures increased. This temperature increase impacted service availability. However, the onsite data center team had to begin a remote shutdown of any remaining networking, compute, and storage infrastructure at 11.34 UTC to protect data durability, infrastructure health, and to address the thermal runaway.

随后,冷冻水回路降低到安全温度内,允许冷机重新启动。尽管如此,它仍导致IT设备进一步关闭,并进一步降低了该可用区的服务可用性。最终,冷机在 12:12(UTC)成功恢复上线,数据机房的温度在13:30(UTC)恢复到运行阈值内。受影响的IT设备恢复供电,基础设施恢复上线的阶段性过程已逐步开始。

Subsequently, the chilled water loop was permitted to return to a safe temperature, allowing the chillers to be restarted. It nevertheless led to a further infrastructure shutdown and a further reduction in service availability for this Availability Zone. Yet the chillers were eventually and successfully brought back online at 12.12 UTC, and the data hall temperatures returned to operational thresholds by 13.30 UTC. This culminated in power being restored to the affected infrastructure, and a phase process to bring the infrastructure back online began.


Microsoft adds that this permitted its team to restore all power to infrastructure by 15.10 UTC, and once the power was restored all compute scale units were returned to operation. This allowed Azure services to recover. However, some services still experienced issues with coming back online.


In the post-incident review, staffing was considered an issue. So, it’s only natural to ask why that was the case, and to consider what could have been done better. It’s not about lambasting the company itself. Even the best-laid plans to prevent outages can go wrong, and across the industry, there is a shortage of data center talent. So, by examining case studies such as this one, there is an opportunity to establish best practices.

Staffing review


Amongst the many mitigations, Microsoft says it increased its technician staffing levels at the data center “to be prepared to execute manual restart procedures of our chillers prior to the change to the Chiller Management System to prevent restart failures.” The night team was temporarily increased from three to seven technicians to enable them to properly understand the underlying issues, so that appropriate mitigations can be put in place. It nevertheless believes the staffing levels at “the time would have been sufficient to prevent impact if a 'load based' chiller restart sequence had been followed, which we have since implemented.”


It adds: “Data center staffing levels published in the Preliminary PIR only accounted for “critical environment” staff onsite. This did not characterize our total data center staffing levels accurately. To alleviate this misconception, we made a change to the preliminary public PIR posted on the Status History page.”

然而,在深入探讨“Azure事件回顾:VVTQ-J98”时,微软亚太区数据中心运营副总裁Michael Hughes回应了有关现场员工数量比公司最初所说的数量多的评论。有人认为,真正的解决办法并不一定是让现场配置更多的人。还有人建议,真正的解决办法是应急操作程序(EOP)中基于模式的顺序,这可能不会改变人员配置标准。

Yet in a Deep Dive 'Azure Incident Retrospective: VVTQ-J98’, Michael Hughes – VP of APAC datacenter operations at Microsoft, responded to comments about more staff being onsite than the company had originally said were present. It was also suggested that the real fix wasn’t necessarily to have more people onsite. It was also suggested that the real fix is a mode-based sequence in the emergency operating procedures (EOPs), which may not change staffing levels.


Hughes explains: “The three that came out in the report just relate to people who are available to reset the chillers. There were people in their operation staff onsite, and there were also people in the operations center. So that information was incorrect, but you’re right.” He asks us to put ourselves in the moment with 20 chillers posting 3 sags and all in an erroneous state. Then 13 require a manual restart, requiring the deployment of manpower across a very large site.


“You’ve got to run out onto the roof of the building to go and manually reset the chiller, and you’re on the clock”, he adds. With chillers impacted and temperatures rising, staff are having to scramble across the site to try to reset the chillers. They don’t quite get to the pod in time, leading to the thermal runaway. The answer in terms of optimization is to go to the highest load data centers – those that have the highest thermal load and highest number of racks operating to recover cooling there.


So, the focus was to recover the chillers with the highest thermal load. This amounts to a tweak on how Microsoft’s EOP is deployed, and it’s about what the system is supposed to do, which he says should have been taken care of by the software. The auto-restart should have happened, and Hughes argues that there shouldn’t have had to be any manual intervention. This has now been fixed. He believes that “you never want to deploy humans to fix problems if you get software to do it for you.” This led to an update of the chiller management system to stop the incident from occurring again.

Industry issue and risk

Uptime Institute 数字基础设施运营副总裁罗恩·戴维斯(Ron Davis)补充道,需要指出的是,这些问题以及与之相关的风险在微软事件之外也存在。并继续补充说:“我曾参与过此类事件,当时发生电力事件,冗余设备无法运转,冷冻水温度迅速升高到禁止任何相关冷水机组启动的水平。”

Ron Davis, vice president of digital infrastructure operations at the Uptime Institute, adds that it’s important to point out that these issues and the risks associated with them exist beyond the Microsoft event. “I have been involved in this sort of incident, when a power event occurred and redundant equipment failed to rotate in, and the chilled water temperature quickly increased to a level that prohibited any associated chiller(s) from starting,” he comments before adding:


“This happens. And it can potentially happen to any organization. Data center operations are critical. From a facilities standpoint, uptime and availability is a primary mission for data centers, to keep them up and running.” Then there is the issue of why the industry is experiencing a staffing shortage. He says the industry is maturing from an equipment, systems, and infrastructure perspective. Even remote monitoring and data center automation are getting better. Yet there is still a heavy reliance on the presence and activities of critical operating technicians - especially during an emergency response as outlined in the Microsoft case.

Davis 补充道:“Uptime十多年来一直在进行运营评估,包括运维与管理认证(M&O)以及运营可持续性管理认证(TCOS)相关的评估。在这些评估过程中,会非常重视人员配置和组织。”

Davis adds: “At Uptime, we have been doing operational assessments for over a decade, including those related to our Management and Operations stamp of approval, and our Tier Certification of Operational Sustainability. During those assessments, we weigh staffing and organization quite highly.”

Optimal staffing levels

至于微软中断期间现场是否有足够的运维人员,以及在场的最佳运维人员数量应该是多少,Carbon3IT公司总经理、数据中心联盟能源效率小组主席John Booth表示这取决于数据中心的设计和规模,以及监控和维护的自动化水平。数据中心还经常依赖外包人员来执行特定的维护和应急任务,并提供4小时响应。除此之外,他建议需要更多信息来确定7名员工是否足够,但他承认3名员工通常是夜班的标准,“白天可能会更多,具体取决于设备的周转率 ”。

As for whether there were sufficient staff onsite during the Microsoft outage, and what should be the optimal number of staff present, John Booth, Managing Director of Carbon3IT Ltd, and Chair of the Energy Efficiency Group of the Data Centre Alliance, says it very much depends on the design and scale of the data center, as well as on the level of automation for monitoring and maintenance. Data centers are also often reliant on outsourced personnel for specific maintenance and emergency tasks and offer a 4-hour response. Beyond this, he suggests there is a need for more information to determine whether 7 staff were sufficient but admits that 3 members of staff are usually the norm for a night shift, “with perhaps more during the day depending on the rate of churn of equipment.”


Davis adds that there is no reliable rule of thumb because each and every organization and site is different. However, there are generally accepted staff calculation techniques that can determine the right staffing levels for a particular data center site. As for the Microsoft incident, he’d need to formally do the calculations to decide whether 3 or 7 technicians were sufficient. It’s otherwise just a guess.


He adds: “I am sure Microsoft has gone through this; any well-developed operating programs must perform these calculations. This is something we look for during our assessments: have they done the staff calculations that are necessary? Some of the factors to include in the calculations are shift presence requirements – what is the number of technicians required to be on-site at all times, in order to do system checks and perform emergency response? Another key consideration is site equipment, systems, and infrastructure: what maintenance hours are required for associated planned, corrective, and other maintenance? Any staffing calculation considers all of these factors and more, including in-house resources and contractors as well.”


Microsoft: Advocate of EOPs

“据我对微软的了解,他们是应急操作程序和相关操作演习的大力倡导者。在进行完善的操作演习期间使用编写恰当的 EOP 可能会支持工作人员的这项工作,和/或可能确定在发生此类事件时是否需要更多的人员配备。”

“From what I know of Microsoft, they are a big advocate for emergency operating procedures and correlating operational drills. The properly scripted EOP, used during the performance of a well-developed operational drill may have supported the staff in this effort, and/or perhaps identified the need for more staffing in the event of such an incident.”

微软制定了应急操作程序 (EOP)。他们从这次事件中吸取了教训并修改了EOP。它们是组织需要开始的地方,且应该检查测试和演练场景。Davis 表示,数据中心最好的保护是基于可能发生的潜在事件。

Microsoft had emergency operating procedures (EOPs) in place. They have learnt from this incident and amended their EOPs. They are where organizations need to start, and they should examine testing and drill scenarios. A data center’s best protection is, says Davis, a significant EOP library, based on potential incidents that can occur.


He believes that the Microsoft team did their best and suggests that they deserve all the support available as these situations are very stressful. This support should come in the form of all the training, tools, and documentation an organization can provide them. He is confident that Microsoft is considering all of the lessons learned and adjusting their practices accordingly.


As to whether staffing levels could be attributed to outages, it’s entirely possible, but that might not have been the sole cause in Microsoft’s case as Booth believes there was a basic design flaw. He thinks an electrical power sag should have triggered backup generators to provide power to all services to prevent the cooling systems from failing. There should therefore be an improved integrated systems test, which is where you test every system under a range of external emergency events. The test program should therefore include the failure of the chillers and any applicable recovery procedures.

深 知 社


Carlson Chen

DKV(DeepKnowledge Volunteer)计划成员



阿里云 暖通工程师

DKV(DeepKnowledge Volunteer)精英成员


    转藏 分享 献花(0



    请遵守用户 评论公约

    类似文章 更多