【DKV】数据中心行业必读：下一代弹性（下）

yi321yi 2019-03-06

展开全文

Next-Generation Resiliency

下一代弹性

FOCUS I SEPTEMBER 2017

By Andy Lawrence, Executive Director, Uptime Institute &451 Research and Todd Traver, Vice President IT Optimization and Strategy, Uptime Institute

Andy Lawrence，常务董事，Uptime Institute & 451 Research

Todd Traver，IT优化和战略副总裁， Uptime Institute

接上部分：数据中心行业必读：下一代弹性（上）

Next-Generation Resiliency

None of these challenges are remotely new, and many systems for distributing data and locking and unlocking databases were developed in the 1980s. (Early papers by engineers at IBM and Tandem, among others, are still available. Influential relational database pioneer Ted Codd published rules for distributed database management systems in the 1980s.) However, cloud providers that have huge amounts of data in multiple locations, and that offer in- and out-of-region replication and backup, now have to deal with the these issues on an altogether new scale.

这些挑战都不是新问题，很多分布式数据系统和锁定解锁数据库是20世纪80年代开发的。（IBM和天腾工程师的早期论文仍然是可用的。有影响力的关系数据库先锋Ted Codd在20世纪80年代发表了分布式数据管理系统的规则。）然而，拥有在多地海量数据的云供应商提供区域内外复制和备份，现在必须在一个全新的规模上处理这些问题。

Professor Eric Brewer of Stanford University (now VP of Infrastructure at Google) identified a key issue. His theorem (see Figure 2) states that it is not possible to design a distributed system that guarantees both availability and complete integrity in the face of the loss of a network partition or node.

斯坦福大学的Eric Brewer教授（现谷歌基础设施副总裁）证实了一个关键问题。它的定理（如图2）表明：当面对网络分区或者节点失效时，不可能设计出一个可以同时保证可用性和完全的完整性的分布式系统。

CAP theorem, also called Brewer’s theorem, states that it is impossible for a distributed computer system to simultaneously guarantee all three of the following attributes:

CAP定理，也称作Brewer定理，表明一个分布式计算机系统不可能同时保证以下全部三个属性：

· Consistency: Every read receives the most recent write or an error.

· 一致性：每一个读操作接受最近的写操作或者一个错误；

· Availability: Every request receives a response, though without a guarantee that it contains the most recent version of the information.

· 可用性：每一个请求接受一个响应，虽然没有保证它包含信息的最近版本；

· Partition tolerance: The system continues to operate despite arbitrary partitioning due to network failures.

· 分区容错性：尽管由于网络失效产生任意分区，系统仍继续操作

图2：CAP定理

Next-Generation Resiliency

This theorem is important when it comes to resiliency planning using more than one active site. Organizations typically place a very high value on database accuracy, but availability is also critical for many applications, especially transactional, customer-facing ones. This rule shows that, by moving to a distributed environment, a company may have to prioritize one guarantee over the other. Brewer’s theorem also points to the critical importance of the network, which, if it is highly available at all times, can reduce if not eliminate the need for that choice. This explains why hyperscale operators such as Google have invested so heavily in intra-data center fiber and other networking equipment to ensure high availability and capacity.

当使用一个以上活动站点做弹性规划时，这个定理非常重要。组织典型地很重视数据准确性，但是可用性对于许多应用也同样关键，尤其是那些事务型、面向客户的应用。这个规则显示了，通过迁移一个分布式环境，一个公司不得不将一个优先级置于另一个之上。Brewer定理也指出网络的重要性，如果网络总是高可用的，即使不能消除也会减少这种选择的需要。这也解释了为什么像谷歌这样的超大型运营商在数据中心内的光纤和其他网络设备上投入如此巨大以保障高可用和容量。

Base, Acid and Databases

Until recently, the organizations that most used distributed resiliency were those for which even a small outage could be catastrophic. This group - investment banks, for example - writes all data to two data centers simultaneously (synchronous replication). While one set of data may act as the master, the second is a real-time copy, and if there is a failure, traffic is switched to the second site. There is no danger of an integrity issue, because the software only allows writes to one live master. Suppliers of databases and storage systems and software, such as IBM, HP, Hitachi, Oracle, EMC and others, have long engineered systems for this high-spending category.

BASE，ACID和数据库

直到最近，那些最多使用分布式弹性的是那些即使遇到一个小故障也是毁灭性的组织。这些集团比如银行，将所有数据同时写入两个数据中心（同步复制）。当一组数据作为主，第二组作为实时拷贝，这样即使有故障，流量会切换到第二个站点。没有完整性问题的风险，因为软件只允许写在一个活的主系统。数据库、存储系统和软件供应商，比如IBM、HP、Hitachi、Oracle、EMC和其他，对这种高支出类别都有长期的工程化系统。

Systems that allow no compromise on integrity are sometimes called ACID systems, to denote Atomicity (each transaction is all or nothing), Consistency (transactions complete according to all valid rules), Isolation (each part of the transaction is isolated from others, as if performed sequentially) and Durability (the transaction is permanent). ACID favors consistency over all else. When ACID databases work together, or if a single database is spread across multiple locations, protocols and processes ensure agreement between multiple endpoints before a transaction can go ahead. Recent advances in so-called NewSQL databases, including Google’s Spanner, replicate this on a distributed, wide scale, with some limited trade-offs.

不允许对完整性做妥协的系统有时被称作ACID系统，代表了原子性（每一个事务要么是全部要么什么不存在），一致性（事务完全符合所有有效原则），隔离性（如果被顺序执行，事务的每一部分都与其它隔离），持久性（事务是永久的）。ACID偏爱一致性超过所有。当ACID数据库们一起工作时，如果一个数据库散布在多个地点，协议和过程保证多个端点在一个事务进行之前的一致性。最新的进展是所谓的NewSQL数据库，包括谷歌的Spanner，在分布式的，广泛的范围内复制这个，当然有一些受限的折中。

In recent years, with the aid of lower-cost, homogenous and virtualized architectures, it has become much easier (and cheaper) to replicate IT environments in several active data centers in different locations. This has led to the development of architectures that temporarily (usually momentarily) sacrifice integrity for availability if there is a contention issue. Processes are put in place to resolve any conflicts, in some cases reversing one of two transactions that may have happened independently of each other.

最近几年，在更低成本、同构、虚拟化的架构帮助下，在多活异地数据中心复制IT环境变得更加容易（更加便宜）。这已导致架构发展为当有竞争问题时临时（短暂的）牺牲完整性以保障可用性。一些处理被采取以解决冲突，这些处理可以在某些情况下回退相互独立发生的两个事务中的一个。

These database design architectures are known as BASE, to denote the characteristics of Basically Available, Soft State and Eventual Consistency. These architectures, supported by modern open source NoSQL databases such as MongoDB and Apache’s CouchDB, incorporate mechanisms for allowing and then resolving conflicting transactions.

这些数据库设计架构被称为BASE，以代表基本可用的特性，软状态和最终一致的特性。这些被现代开源NoSQL数据库（比如MongoDB和Apache的CouchDB）支持的架构包含允许和解决事务的冲突的机制。

Next-Generation Resiliency

The use of BASE architectures is now very common, especially in cloud environments, and effectively tolerates failures. But there are classes of application for which it is unsuitable - for example, trading systems or control situations where eventual resolution or reversible transactions are not acceptable. Even so, given that the conflicts may often be rare and easily resolved, this architecture is now being widely adopted, reducing costs and enabling more use of distributed architectures to improve resiliency.

BASE架构的使用非常普遍且能够有效地容错，尤其是在云环境中。但是有些类别的软件不适合，比如，在交易系统或者控制情况中，最终解决和可逆的事务是不可接受的。即使这样，考虑到冲突通常很少见并且容易被解决，BASE架构正在被广泛采用，同时减少成本和使能分布式架构更多的被使用以提升弹性。

BASE architectures rely very heavily on fast, reliable networks. The longer the latency, the more likely it is that conflicts between reads and writes from different users will occur. While these will mostly be resolved easily, too many conflicts could cause problems with clients or control systems in real-time networks. Some Internet of Things (IoT) applications will not sit comfortably on cloud platforms that use BASE architectures.

BASE架构严重依赖快速稳定的网络。延迟越长，越可能在不同用户的读写之间发生冲突。虽然这些大部分都将会被轻松解决，但在实时网络中太多的冲突可能导致客户端或者控制系统出现问题。一些物联网应用将不会舒服的坐落在使用了BASE架构云平台上。

Types of Distributed Architecture

分布式架构的种类

As we have seen, differing business requirements, including legacy investments, will influence the degree to which newer, distributed systems and databases can be used; similarly, the business requirements and the design of the existing systems will, to some extent, point toward certain resiliency architectures. We see the models in Figure 4 being used for resiliency, with the cloud- based models being markedly different from the earlier ones.

如我们所见，不同的业务需求，包括历史投资，将会影响新的分布式系统和数据库被使用的程度；同样的，业务需求和现存系统的设计在一定程度上指向了某一确定的弹性架构。我们看图3中用于弹性的模型，基于云的模型显著的与早期模型不同。

Figure 3: Types of Distributed Architecture

图3：分布式架构的种类

This is the traditional setup, with high levels of redundancy at the infrastructure level, including facilities and basic IT. With sufficient redundancy and planned design, operations can continue in spite of planned (concurrent maintainability), and in some cases unplanned, facilities failure. At the IT level, resilience is further assured by internal replication (e.g., clusters), so that loads may be replicated elsewhere and data/applications/configurations backed up to an offsite DR.

单站点可用性

这是一个传统配置，包含物理设施和基础IT的基础设施层具备高级别的冗余。通过充分的冗余和规划的设计，在计划内的（并发维护性）以及某些情况下计划外的物理设施故障时，运营仍然能够继续。在IT层，弹性通过内部复制（比如集群）得到进一步的保障，负载可能被复制到别处，数据/应用/配置备份到一个离线容灾节点。

Linked Site Resiliency

This describes two or more lower-tier data centers within a campus, region or zone using a dedicated network to achieve a higher level of availability than is possible at any individual site, typically within synchronous replication distance. (This means that the two data centers are near enough to each other and to customers that they are always synchronized. This distance will depend on the applications, but is usually less than 50 miles.) In order to achieve the same or higher level of facility availability as a high-availability single-site data center, linked sites may double up and share some less-critical infrastructure with nearby in-zone data centers. This assumes resilient and sufficient network capacity with predictable and independent pathways.

链接站点弹性

这描述了在同一园区、地区或者区域内的两个及以上低级别数据中心，它们通过使用专用网络来达到比任一单站可能达到的更高级别的可用性。（这意味着两个数据中心相互之间以及到客户之间足够近，它们一直是同步的。这个距离会取决于具体应用，但通常小于50英里。）为了达到与高可用单站数据中心相同甚至更高的物理设施可用性，链接站点可能共享在一些附近同一区域内数据中心的非关键基础设施。这假设在可预测的和独立的路径上，有弹性的和充足的网络容量。

In this configuration, concurrent maintainability (downtime at one site does not disrupt service) is possible as long as there is sufficient capacity, and processes are in place, to support full operations at either site. At the IT level, this setup can be used to support either synchronous (fault-tolerant automated failover to the second site) or asynchronous (a second copy of applications, data and files is kept at the second site to pick up the load) replication.

在这种配置下，只要有足够的容量并且处理是适当的，并发可维护能力（一个站点断服不会导致服务中断）是可能支持在其中一个站点的完整操作。在IT层，这种配置能够被用于支持要么同步（容错自动故障切换到第二个站点）或者要么异步（为承载负载，应用、数据和文件的第二拷贝被保留在第二个站点）的复制。

Distributed Site Resiliency

This term describes two or more independent sites, in or out of region or globally distributed (cloud or not), using shared internet/VPN networks to provide resiliency through multiple asynchronously connected instances. This can produce very high availability but can result in some (usually minor) loss of integrity between instances if outages occur.

分布式站点弹性

这个术语描述了在区域内外或是全局分布的（云或非云）两个及以上的独立站点，它们通过多个异步连接的实例以及使用共享互联网/VPN网络来提供弹性。这种方式能够产生非常高的可用性，但是如果中断发生，也会导致一些（通常很小）实例之间的完整性损失。

At the IT level, distributed site resiliency is the architecture that underpins most DR services, and especially the modern cloud iteration, DR as a service (DRaaS). Improved network capacity, software tools, database synchronization protocols and, critically, homogenous IT infrastructure running virtualized workloads have now made this option far more practical, flexible and economically feasible both for active/active operations and for backup and recovery. As more distributed management technologies are added, distributed site resiliency can support or blur into cloud-based resiliency.

在IT层，分布式站点弹性是一种支持大多数容灾服务的架构，尤其是现在云迭代，容灾即服务（DRaaS）。改进后的网络容量，软件工具，数据库同步协议和非常关键的运行虚拟化负载的同构IT基础设施现在已经使这种弹性方式对于双活操作和备份恢复来说更加实用，灵活以及经济可行。随着越多的分布式管理技术加入，分布式站点弹性能够支持或者模糊的看做基于云的弹性。

Next-Generation Resiliency

Cloud-Based Resiliency

This term describes resiliency provided by distributing virtualized applications, instances and/or containers with associated data across multiple data centers, using middleware, orchestration and distributed databases, under the control of a comprehensive and distributed control system. These systems will enable service or design choices to be made between, for example, absolute database integrity or immediate availability. Effectively, cloud-based resiliency moves the resiliency up to the IT level. Any facility resilience achieved through redundancy provides added security, but may not prove essential. It does, however, assume that there is sufficient capacity in place, including the network, which is critical if loads are shifted from place to place. Developers do not need to concern themselves with location or infrastructure - this architecture is primarily suited for stateless or cloud-native applications.

基于云的弹性

这个术语描述了通过使用中间件、编排和分布式数据库，在一个综合的、分布式的控制系统控制下，将虚拟化应用、实例和/或携带相关数据的容器分布到多个数据中心来提供弹性。这些控制系统会做出服务或者设计选择，比如绝对数据库完整或者立即可用。实际上，基于云的弹性将弹性上升到IT层。任何通过冗余实现的物理设施弹性提供了额外的安全，但是可能证明不是必须的。不管怎样，它的确假设在相应的地方有足够的容量，包括网络，如果负载从一个地方迁移到另一个地方，它非常关键。开发者不需要关注他们自己的位置或者基础设施，这个架构主要是和无状态的或者云原生的应用。

Clearly, each type of resiliency architecture described above fulfills different purposes and has a different profile in terms of objectives, cost, level of availability and technical maturity. Cloud- based resiliency is the newest, and currently the most expensive; it may provide good total cost of ownership, but effectively can only be achieved at scale and with considerable capital. Each type is not mutually exclusive, at least at the facilities level.

显而易见，以上描述的每一种弹性架构实现了不同的目的，根据目标、成本、可用性级别和技术成熟度有不同的画像。基于云的弹性是最新的，也是当前最昂贵的；它可能提供很好的总体拥有成本，但实际上只有在大规模情况下，具备大量资金时才会实现。

For CIOs setting out to develop appropriate resiliency strategies, this is a challenging period because engineering control is being eroded, to be replaced with a more nuanced and strategic approach where good assessments are needed.

对于CIO着手开发合适的弹性策略来说，这是一个具有挑战的时期，因为工程控制正在被侵蚀，它被更加微妙的、战略性的方法替代，这个方法需要好的评估。

With cloud services and architectures now part of the mix, or even the totality, the CIO must determine which type (or types) of resiliency is most appropriate for each type of application and data, based on business needs and technical risk, and then architect the best combination of IT infrastructure. This will span data center resiliency, applications, databases and networking, and must take into account organizational structure, processes, tools and automation. From all this, the organization must then deliver comprehensive and consistent applications that meet and exceed customer expectations for service availability and resiliency.

通过云服务和架构的部分混合甚至完全混合，CIO必须决定对于每一种应用、数据，基于业务需求和技术风险哪种弹性最适合，然后构建IT基础设施最佳组合。这会横跨数据中心弹性、应用、数据库和网络，同时必须考虑组织结构、流程、工具和自动化。从这一切，组织必须交付理解深刻的和一致的应用，它们能够从业务可用性和弹性上符合并超越客户期望。

(全文完)

翻译：