分享

深入了解微软悉尼西区中断事件

 tuzhanbei2010 2024-03-28 发布于内蒙古

图片


人员配备标准:数据中心是否面临不必要的中断风险?

Staffing levels: are data centers at risk of unnecessary outages?

January 16, 2024 

译 者 说

基础设施运营的首要目标是安全、稳健。

在数据中心关键基础设施自动化水平日趋提高的当下,任意事故的发生,其关联原因不应局限于单一因素本身。从海恩里希法则崇尚的事故因果连锁逆向看,单一事故的发生不是孤立事件,应是一系列事件相继叠加的结果。事后诸葛:运维管理涉及管理流程/体系、人员、设施与平台工具等诸多方面,是否均应从关联方面寻求存在的瑕疵或漏洞呢?或许仅是孰轻孰重的差异罢了。

不得不说,人员配置是支撑运营安全目标达成的核心要素。影响人员配置的因素要考虑Tier等级、轮班要求、数据中心规模和配置、基础设施操作的复杂性、维护工作量、员工加班以及机房业务服务支持等方面。但总的来说,基础设施正常运行时间目标的增加,即Tier等级提高,意味着对人员配置的要求也会增加。这一数量的得出是需要有逻辑和计算过程的,而配置正确数量的合格运维人员则是成功的关键。


随着数据中心自动化水平的提高,用户本能希望可以确保其数据的可用性无限接近100%,且会询问是否有足够多的数据中心员工来实现较高水平的正常运行时间。即当发生潜在的中断风险时,有足够的技术人员当班或可响应以便尽快恢复服务。

With increasing data center automation, it’s only natural for clients to want assurance that their data will be available as close to 100 percent of the time as possible, and to ask whether enough data center staff are available to achieve a high level of uptime. They also want to know that when a potential outage occurs, there are enough technicians on duty or available to restore services as soon as possible.

2023年8月30日,微软在悉尼的澳洲东区(Data Center Region: Australia East Region,2014年开服,位于新南威尔士)数据中心遭遇中断故障,从当天10:30(UTC)开始,持续46小时。

Microsoft suffered an outage on 30th August 2023 in its Australia East region in Sydney, lasting 46 hours. The company says it began at 10.30 UTC that day.

用户在访问或使用Azure、Microsoft 365和Power Platform服务时遇到问题。由08:41(UTC)的市电电压骤降触发,并影响了澳洲东区的三个可用区之一。

Customers experienced issues with accessing or using Azure, Microsoft 365, and Power Platform services. It was triggered by a utility power sag at 08.41 UTC and impacted one of the three Availability Zones of the region.

微软阐明:“电压骤降导致冷却系统中一部分冷机离线,在致力恢复冷却的同时,数据机房的温度升高到了运行阈值以上。当即关闭一小部分既定的计算和存储弹性单元,以降低机房温度并防止硬件损坏。”

Microsoft explains: “This power sag tripped a subset of the cooling system chiller units offline and, while working to restore cooling, temperatures in the data center increased to levels above operational thresholds. We powered down a small subset of selected compute and storage scale units, both to lower temperatures and to prevent damage to hardware.”

尽管,在22:40(UTC)之前绝大多数服务已恢复,但直到2023年9月3日20:00 (UTC)才完成全面缓解。微软表示,这是因为某些服务经历了较为长期的影响,“主要是由于对存储、SQL数据库和Cosmos DB等一系列服务恢复的依赖”。

Despite this, the vast majority of services were recovered by 22.40 UTC, but they weren’t able to complete a full mitigation until 20.00 UTC on 3rd September 2023. Microsoft says this was because some services experienced a prolonged impact, “predominantly as a result of dependencies on recovering subsets of Storage, SQL Database, and/or Cosmos DB services.”

电压骤降的根本原因
Voltage sag cause

微软称,距离澳州东区受影响的可用区18英里外的电力基础设施遭受雷击,进而造成市电电压骤。电压骤降导致多个数据中心的冷却系统中的冷水机组关闭。虽然部分冷水机组自动重启,但仍有13台机组未能重启,需人工启动。为此,现场运维团队需进入冷机所在的数据中心屋顶设施区域,并按顺序重启一个数据中心至下一个数据中心的所有冷机。

The utility voltage sag was caused, according to the company, by a lightning strike on electrical infrastructure situated 18 miles from the impacted Availability Zone of the Australia East region. They add: “The voltage sag caused cooling system chillers for multiple data centers to shut down. While some chillers automatically restarted, 13 failed to restart and required manual intervention. To do so, the onsite team accessed the data center rooftop facilities, where the chillers are located, and proceeded to sequentially restart chillers moving from one data center to the next.”

影响几何
What was the impact?

当团队到达需要手动重启的最后五台冷机时,冷冻水回路内的水温已经过高,无法重新启动。在这种情况下,冷机的自我保护机制会阻止重新启动,该机制可防止在高温下制取冷水而对冷机造成损坏。无法重新启动的五台冷机为本次事件中受到影响的两个毗邻数据机房提供冷却。

“By the time the team reached the final five chillers requiring a manual restart, the water inside the pump system for these chillers (chilled water loop) had reached temperatures that were too high to allow them to be restarted. In this scenario, the restart is inhibited by a self-protection mechanism that acts to prevent damage to the chiller that would occur by processing water at the elevated temperatures. The five chillers that could not be restarted supported cooling for the two adjacent data halls which were impacted in this incident.”

微软表示,两个受影响的数据机房至少需要四台冷机才能冷却。电压骤降前冷却系统的制冷能力由七台冷机组成,5+2配置。随着数据机房温度升高,一些网络、计算和存储等IT设备逐步自动关闭。温升影响了服务可用性。然而,现场数据中心运维团队在 11:34(UTC)开始远程关闭所有剩余的网络、计算和存储等IT设备,以保护数据持久性、设备健康并应对热失控问题。

Microsoft says the two impacted data halls require at least four chillers to be operational. The cooling capacity before the voltage sag consisted of seven chillers, with five of them in operation and two on standby. The company says that some networking, compute, and storage infrastructure began to shut down automatically as data hall temperatures increased. This temperature increase impacted service availability. However, the onsite data center team had to begin a remote shutdown of any remaining networking, compute, and storage infrastructure at 11.34 UTC to protect data durability, infrastructure health, and to address the thermal runaway.

随后,冷冻水回路降低到安全温度内,允许冷机重新启动。尽管如此,它仍导致IT设备进一步关闭,并进一步降低了该可用区的服务可用性。最终,冷机在 12:12(UTC)成功恢复上线,数据机房的温度在13:30(UTC)恢复到运行阈值内。受影响的IT设备恢复供电,基础设施恢复上线的阶段性过程已逐步开始。

Subsequently, the chilled water loop was permitted to return to a safe temperature, allowing the chillers to be restarted. It nevertheless led to a further infrastructure shutdown and a further reduction in service availability for this Availability Zone. Yet the chillers were eventually and successfully brought back online at 12.12 UTC, and the data hall temperatures returned to operational thresholds by 13.30 UTC. This culminated in power being restored to the affected infrastructure, and a phase process to bring the infrastructure back online began.

微软补充道,这使得其团队能够在15:10(UTC)之前恢复所有电力,而一旦电力恢复,所有弹性计算单元都将恢复运行。Azure服务也可得以恢复。但是,一些服务在恢复上线时仍然遇到了问题。

Microsoft adds that this permitted its team to restore all power to infrastructure by 15.10 UTC, and once the power was restored all compute scale units were returned to operation. This allowed Azure services to recover. However, some services still experienced issues with coming back online.

在事后复盘中,人们意识到人员配备是一个问题。因此,很自然地要问为什么会出现这种情况,在哪些方面可以做得更好。这并不是要痛斥微软本身。即使是最周全的防止宕机计划也可能会出错,而且整个行业都存在数据中心人才短缺的问题。因此,通过研究诸如此类的案例,就有机会制定最符合实际的方案。

In the post-incident review, staffing was considered an issue. So, it’s only natural to ask why that was the case, and to consider what could have been done better. It’s not about lambasting the company itself. Even the best-laid plans to prevent outages can go wrong, and across the industry, there is a shortage of data center talent. So, by examining case studies such as this one, there is an opportunity to establish best practices.

人员配置审查
Staffing review

在诸多缓解措施中,微软表示,它增加了数据中心的技术人员配置标准,“准备在变更冷机管理系统之前执行冷机的手动重启程序,以防止重启失败。”夜班团队的技术人员暂时从三名增加到七名,以便他们能够正确了解根本问题,进而采取适当的缓解措施。尽管如此,微软仍旧仍然认为,按已部署的“如果遵循'基于负荷’的冷机重启顺序,当时的人员配置水平将足以规避影响。”

Amongst the many mitigations, Microsoft says it increased its technician staffing levels at the data center “to be prepared to execute manual restart procedures of our chillers prior to the change to the Chiller Management System to prevent restart failures.” The night team was temporarily increased from three to seven technicians to enable them to properly understand the underlying issues, so that appropriate mitigations can be put in place. It nevertheless believes the staffing levels at “the time would have been sufficient to prevent impact if a 'load based' chiller restart sequence had been followed, which we have since implemented.”

微软补充道:“初步回顾审查中公布的数据中心人员配置标准仅包含现场“关键环境”人员的数量。这并没有准确地描述我们数据中心的总体人员配置水平。为了消除这种误解,我们对“状态历史记录”页面上发布的初步公共回顾审查进行了更改。”

It adds: “Data center staffing levels published in the Preliminary PIR only accounted for “critical environment” staff onsite. This did not characterize our total data center staffing levels accurately. To alleviate this misconception, we made a change to the preliminary public PIR posted on the Status History page.”

然而,在深入探讨“Azure事件回顾:VVTQ-J98”时,微软亚太区数据中心运营副总裁Michael Hughes回应了有关现场员工数量比公司最初所说的数量多的评论。有人认为,真正的解决办法并不一定是让现场配置更多的人。还有人建议,真正的解决办法是应急操作程序(EOP)中基于模式的顺序,这可能不会改变人员配置标准。

Yet in a Deep Dive 'Azure Incident Retrospective: VVTQ-J98’, Michael Hughes – VP of APAC datacenter operations at Microsoft, responded to comments about more staff being onsite than the company had originally said were present. It was also suggested that the real fix wasn’t necessarily to have more people onsite. It was also suggested that the real fix is a mode-based sequence in the emergency operating procedures (EOPs), which may not change staffing levels.

Hughes解释说:“报告中出现的三个内容仅与可以重置冷水机的人员有关。数据中心现场有运维人员,也有运营中心的人。所以这些信息是不正确的,但你是对的。”他要求我们置身于20台冷机出现3次电压骤降且全部处于错误状态的时刻。然后13台需要手动重启,对于一个非常大的站点这需要部署很多人力。

Hughes explains: “The three that came out in the report just relate to people who are available to reset the chillers. There were people in their operation staff onsite, and there were also people in the operations center. So that information was incorrect, but you’re right.” He asks us to put ourselves in the moment with 20 chillers posting 3 sags and all in an erroneous state. Then 13 require a manual restart, requiring the deployment of manpower across a very large site.

“你必须跑到建筑物的屋顶上手动重置冷机,然后要按时完成。”由于冷机受到影响且气温上升,工作人员不得不在现场爬行以尝试重置冷机。它们没有及时到达机组前,导致热失控。优化方面的答案是前往负荷最高的数据中心—那些具有最高热负荷和运行最多数量的机架且需要恢复冷却的数据中心。

“You’ve got to run out onto the roof of the building to go and manually reset the chiller, and you’re on the clock”, he adds. With chillers impacted and temperatures rising, staff are having to scramble across the site to try to reset the chillers. They don’t quite get to the pod in time, leading to the thermal runaway. The answer in terms of optimization is to go to the highest load data centers – those that have the highest thermal load and highest number of racks operating to recover cooling there.

因此,重点是恢复热负荷最高的冷机。这相当于对微软EOP部署方式的调整,以及系统应该做什么,他说这些应该由软件来处理。自动重启应该发生,Hughes认为不应该有任何手动干预。现在这个问题已经得到解决。他认为,“如果你有软件来解决问题,你就永远不想部署人工来解决问题。”这导致了冷机管理系统的变更,以阻止该事件再次发生。

So, the focus was to recover the chillers with the highest thermal load. This amounts to a tweak on how Microsoft’s EOP is deployed, and it’s about what the system is supposed to do, which he says should have been taken care of by the software. The auto-restart should have happened, and Hughes argues that there shouldn’t have had to be any manual intervention. This has now been fixed. He believes that “you never want to deploy humans to fix problems if you get software to do it for you.” This led to an update of the chiller management system to stop the incident from occurring again.

行业问题与风险
Industry issue and risk

Uptime Institute 数字基础设施运营副总裁罗恩·戴维斯(Ron Davis)补充道,需要指出的是,这些问题以及与之相关的风险在微软事件之外也存在。并继续补充说:“我曾参与过此类事件,当时发生电力事件,冗余设备无法运转,冷冻水温度迅速升高到禁止任何相关冷水机组启动的水平。”

Ron Davis, vice president of digital infrastructure operations at the Uptime Institute, adds that it’s important to point out that these issues and the risks associated with them exist beyond the Microsoft event. “I have been involved in this sort of incident, when a power event occurred and redundant equipment failed to rotate in, and the chilled water temperature quickly increased to a level that prohibited any associated chiller(s) from starting,” he comments before adding:

“有时候是这样的。这可能发生在任何组织身上。数据中心运营至关重要。从设施的角度来看,正常运行时间和可用性是数据中心的首要任务,以保持其正常运行。”接下来的问题是,为什么该行业会出现人员短缺的情况。他表示,从设备、系统和基础设施的角度来看,该行业正在走向成熟。甚至远程监控和数据中心自动化水平也在变得更好。然而,仍然严重依赖关键运维技术人员的存在和活动,在微软案例中概述的应急响应期间这一点更加凸显。

“This happens. And it can potentially happen to any organization. Data center operations are critical. From a facilities standpoint, uptime and availability is a primary mission for data centers, to keep them up and running.” Then there is the issue of why the industry is experiencing a staffing shortage. He says the industry is maturing from an equipment, systems, and infrastructure perspective. Even remote monitoring and data center automation are getting better. Yet there is still a heavy reliance on the presence and activities of critical operating technicians - especially during an emergency response as outlined in the Microsoft case.

Davis 补充道:“Uptime十多年来一直在进行运营评估,包括运维与管理认证(M&O)以及运营可持续性管理认证(TCOS)相关的评估。在这些评估过程中,会非常重视人员配置和组织。”

Davis adds: “At Uptime, we have been doing operational assessments for over a decade, including those related to our Management and Operations stamp of approval, and our Tier Certification of Operational Sustainability. During those assessments, we weigh staffing and organization quite highly.”

最佳人员配置标准
Optimal staffing levels

至于微软中断期间现场是否有足够的运维人员,以及在场的最佳运维人员数量应该是多少,Carbon3IT公司总经理、数据中心联盟能源效率小组主席John Booth表示这取决于数据中心的设计和规模,以及监控和维护的自动化水平。数据中心还经常依赖外包人员来执行特定的维护和应急任务,并提供4小时响应。除此之外,他建议需要更多信息来确定7名员工是否足够,但他承认3名员工通常是夜班的标准,“白天可能会更多,具体取决于设备的周转率 ”。

As for whether there were sufficient staff onsite during the Microsoft outage, and what should be the optimal number of staff present, John Booth, Managing Director of Carbon3IT Ltd, and Chair of the Energy Efficiency Group of the Data Centre Alliance, says it very much depends on the design and scale of the data center, as well as on the level of automation for monitoring and maintenance. Data centers are also often reliant on outsourced personnel for specific maintenance and emergency tasks and offer a 4-hour response. Beyond this, he suggests there is a need for more information to determine whether 7 staff were sufficient but admits that 3 members of staff are usually the norm for a night shift, “with perhaps more during the day depending on the rate of churn of equipment.”

Davis补充说,没有可靠的经验法则,因为每个组织和数据中心都是不同的。然而,有普遍接受的人员计算技术可以确定特定数据中心站点的正确人员配置水平。至于微软事件,他需要正式计算一下,决定3个还是7个技术人员足够。否则这只是一个猜测。

Davis adds that there is no reliable rule of thumb because each and every organization and site is different. However, there are generally accepted staff calculation techniques that can determine the right staffing levels for a particular data center site. As for the Microsoft incident, he’d need to formally do the calculations to decide whether 3 or 7 technicians were sufficient. It’s otherwise just a guess.

他补充道:“我确信微软已经经历过这一切;任何成熟的操作程序都必须执行这些计算。这是我们在评估期间寻找的内容:他们是否完成了必要的人员计算?计算中需要考虑的一些因素是轮班要求—需要多少技术人员始终在现场,以便进行系统检查和执行应急响应?另一个关键考虑因素是现场设备、系统和基础设施:相关的计划维护、故障维护和其他维护需要多少维护时间?任何人员配置计算都会考虑所有这些因素以及更多因素,包括内部资源和承包商。”

He adds: “I am sure Microsoft has gone through this; any well-developed operating programs must perform these calculations. This is something we look for during our assessments: have they done the staff calculations that are necessary? Some of the factors to include in the calculations are shift presence requirements – what is the number of technicians required to be on-site at all times, in order to do system checks and perform emergency response? Another key consideration is site equipment, systems, and infrastructure: what maintenance hours are required for associated planned, corrective, and other maintenance? Any staffing calculation considers all of these factors and more, including in-house resources and contractors as well.”

图片

微软:EOP的倡导者
Microsoft: Advocate of EOPs

“据我对微软的了解,他们是应急操作程序和相关操作演习的大力倡导者。在进行完善的操作演习期间使用编写恰当的 EOP 可能会支持工作人员的这项工作,和/或可能确定在发生此类事件时是否需要更多的人员配备。”

“From what I know of Microsoft, they are a big advocate for emergency operating procedures and correlating operational drills. The properly scripted EOP, used during the performance of a well-developed operational drill may have supported the staff in this effort, and/or perhaps identified the need for more staffing in the event of such an incident.”

微软制定了应急操作程序 (EOP)。他们从这次事件中吸取了教训并修改了EOP。它们是组织需要开始的地方,且应该检查测试和演练场景。Davis 表示,数据中心最好的保护是基于可能发生的潜在事件。

Microsoft had emergency operating procedures (EOPs) in place. They have learnt from this incident and amended their EOPs. They are where organizations need to start, and they should examine testing and drill scenarios. A data center’s best protection is, says Davis, a significant EOP library, based on potential incidents that can occur.

他相信微软团队已经尽力了,并表示在这种压力巨大的情况下,他们应该得到所有可用的支持。这种支持应该以组织可以向他们提供的所有培训、工具和文档的形式出现。他相信微软正在考虑所有的经验教训并相应地调整他们的做法。

He believes that the Microsoft team did their best and suggests that they deserve all the support available as these situations are very stressful. This support should come in the form of all the training, tools, and documentation an organization can provide them. He is confident that Microsoft is considering all of the lessons learned and adjusting their practices accordingly.

至于人员配备水平是否可以归因于中断,这是完全有可能的,但这可能不是微软案件的唯一原因,因为Booth认为存在基本的设计缺陷。他认为电压骤降应该触发备用柴发为所有服务提供电力,以防止冷却系统发生故障。因此,应该改进集成系统联合验证测试,在一系列外部紧急事件下测试每个系统。因此,测试计划应包括冷水机组的故障和任何适用的恢复程序。

As to whether staffing levels could be attributed to outages, it’s entirely possible, but that might not have been the sole cause in Microsoft’s case as Booth believes there was a basic design flaw. He thinks an electrical power sag should have triggered backup generators to provide power to all services to prevent the cooling systems from failing. There should therefore be an improved integrated systems test, which is where you test every system under a range of external emergency events. The test program should therefore include the failure of the chillers and any applicable recovery procedures.



深 知 社


翻译:

Carlson Chen

DKV(DeepKnowledge Volunteer)计划成员

校对:

贾梦檩

阿里云 暖通工程师

DKV(DeepKnowledge Volunteer)精英成员

公众号声明:

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多