栖息谷-管理人的网上家园

标题: [原创] The Datacenter as a Computer -译文 [打印本页]

作者: trybestying    时间: 2012-6-20 18:07
标题: [原创] The Datacenter as a Computer -译文
本帖最后由 trybestying 于 2012-6-25 16:36 编辑

序:最近因工作需要,了解云计算和数据中心建设相关的资料,发现英文资料居多,且没有译文。因此打算看这本书时顺手翻译成中文,以供将来同行的朋友们查阅。随后,我会相继翻译并发布。必竟不是专业翻译且英语水平有限,有翻译不当的地方,请朋友们多指正,谢谢。

The Datacenter as a Computer
An Introduction to the Design of
Warehouse-Scale Machines

Contents
[attach]164560[/attach]
[attach]164561[/attach]

[attach]164562[/attach]

[attach]164563[/attach]
[attach]164564[/attach]



chapter 1   第一章
Introduction  介绍
The ARPANET is about toturn forty, and the World Wide Web is approaching its 20th anniversary. Yet theInternet technologies that were largely sparked by these two remarkablemilestones continue to transform industries and our culture today and show nosigns of slowing down. More recently the mergence of such popular Internetservices as Web-based email, search and social networks plus the increasedworldwide availability of high-speed connectivity have accelerated a trend towardserver-side or “cloud”computing.
阿帕网格快要满四十岁,而万维网也接近20周年,然而,互联网技术的这两项显著里程碑,在很大程度上引发并继续改造行业和当今文化,并且丝毫没有放缓的迹象。近年来出现的比较流行的基于互联网技术的WEB邮件,搜索,社交网络 促使全球高速互联技术的快速发展,并日渐呈服务端或云计算发展的趋势。
Increasingly,computing and storage are moving from PC-like clients to large Internetservices.While early Internet services were mostly informational, today manyWeb applications offer services that previously resided in the client,including email, photo and video storage and office applications. The shifttoward server-side computing is driven primarily not only by the need for user experienceimprovements, such as ease of management (no configuration or backups needed)and
ubiquity ofaccess (a browser is all you need), but also by the advantages it offers tovendors. Software as a service allows faster application development because itis simpler for software vendors to make changes and improvements. Instead ofupdating many millions of clients (with a myriad of peculiar hardware andsoftware configurations), vendors need only coordinate improvements and fixesinside their datacenters and can restrict their hardware deployment to a fewwell-tested
configurations.Moreover, datacenter economics allow many application services to run at a lowcost per user. For example, servers may be shared among thousands of activeusers (and many more inactive ones), resulting in better utilization.Similarly, the computation itself may become cheaper in a shared service (e.g.,an email attachment received by multiple users can be stored once rather than manytimes). Finally, servers and storage in a datacenter can be easier to managethan the desktop or laptop equivalent because they are under control of asingle, knowledgeable entity.
很快的,存储和计算从PC客户端向互联网服务端转移,早期,互联网服务大多只是咨询信息,当下很多原来安装在PC客户端的应用都实现了基于WEB方式的互联网服务,如邮件,照片,视频存储,办公应用等。这种各服务端计算转变的趋势不仅是由于用户对服务体验的提升,如易管理(不再需要配置和备份),随处访问(仅需要一个浏览器),而且更是给服务提供商们提供堵多的优势,软件做为一种服务被允许快速的应用和开发出来,因为软件提供商能够更容易的进行软件的变更和改进,而不用更新数百万的客户端(无数怪异的硬件以及软件配置),软件提供商只需要协调改良和修复内部数据中心,而且能够限定硬件部署到几个有限的充分测试过的环境中。此外数据中心经济学使得许多应用服务能够低成本/每用户运行,例如服务器可能在成千上万的活动用户之间共享(有更多的非活动用户),导致更好的利用率,同样的,计算本身在共享服务中可能变得更便宜(例如,被多用户接受的邮件附件只需要存一次而非多次)。 最后,在数据中心的服务和存储比起台式或便携设备更容易管理,因为是在单一、智慧的实体控制下的。
Someworkloads require so much computing capability that they are a more natural fitfor a massive computing infrastructure than for client-side computing. Searchservices (Web, images, etc.) are a prime example of this class of workloads,but applications such as language translation can also run more effectively onlarge shared computing installations because of their reliance on massive-scalelanguage models.
有些工作负载需要如此多的计算能力,以至于他们更自然的适合巨型计算基础设施而非客户端计算。搜索服务(网页,图片等)是此类工作负载最好的例子,但是类似语言翻译此类的应用因为依赖大规模的语言模式同样需要更高效的运行在在大型共享计算环境中。
The trendtoward server-side computing and the exploding popularity of Internet services
has created anew class of computing systems that we have named warehouse-scale computers,or WSCs. The name is meant to call attention to the most distinguishingfeature of these machines: the massive scale of their software infrastructure,data repositories, and hardware platform. This perspective is a departure froma view of the computing problem that implicitly assumes a model where oneprogram runs in a single machine. In warehouse-scale computing, the program isan Internet service, which may consist of tens or more individual programs thatinteract to implement complex end-user services such as email, search, or maps.These programs might be implemented and maintained by different teams ofengineers, perhaps even across organizational, geographic, and companyboundaries (e.g., as is the case with mashups).
服务器端计算的趋势和互联网服务的爆炸流行,成就了一个新类的计算系统,我们称之为仓储级电脑或wscs。这个名字是为了呼吁人们关注这些机器最显著的特点:大规模的软件基础设施,数据仓储和硬件平台。这一观点有别于计算问题隐式地假定在单一机器中运行的模式,在仓储级计算中,程序做为互联网服务可能包括数十或更多的程序交互以实现复杂终端用户的服务如电子邮件,搜索和地图。这些程序可能初不同的工程团队,甚至跨组织跨地区,跨公司界限实现和维护(如mashups就是一个实例)。
The computingplatform required to run such large-scale services bears little resemblance
to apizza-box server or even the refrigerator-sized high-end multiprocessors thatreigned in the last decade. The hardware for such a platform consists ofthousands of individual computing nodes with their corresponding networking andstorage subsystems, power distribution and conditioning equipment, andextensive cooling systems. The enclosure for these systems is in fact abuilding structure and often indistinguishable from a large warehouse.
计算平台运行所需的这种大规模的服务 不象过去十年中的一个比萨饼盒大小的服务器或甚至冰箱大小的高端处理器所能支配的,对于这样一个平台的硬件由成千上万的独立计算节点以及与之相应的网络和存储子系统、配电和空调设备,广泛的冷却系统组成。外形实际是一建筑结构,经常被识别为一个大型仓库。




作者: trybestying    时间: 2012-6-21 13:00
本帖最后由 trybestying 于 2012-6-29 10:04 编辑

1.1 WAREHOUSE-SCALE COMPUTERS 仓储级计算机
Had scale been the only distinguishing feature of these systems, we might simply refer to them as datacenters. Datacenters are buildings where multiple servers and ommunication gear are collocated because of their common environmental requirements and physical security needs, and for ease of maintenance. In that sense, a WSC could be considered a type of datacenter. Traditional datacenters, however, typically host a large number of relatively small- or medium-sized applications, each running on a dedicated hardware infrastructure that is de-coupled and protected from other systems in the same facility. Those datacenters host hardware and software for multiple organizational units or even different companies. Different computing systems within such a datacenter often have little in common in terms of hardware, software, or maintenance infrastructure, and tend
not to communicate with each other at all.
规模是唯一的特色的这些系统,我们可能简单地称之为数据中心。数据中心建筑中多个服务器和通信装备因共同的环境要求和物理安全的要求通常协同安置,易于维护。从这种意义上,WSC可认为是一种类型的数据中心。然而,通常数据中心主机上大量运着相对小型或中型的应用程序,且每一个程序运行在一个专用的硬件基础设施上且与专用的硬件基础设施高耦合(即使是相同的设施),与其他系统保护隔离。 这些数据中心主机的硬件和软件有多个组织、单位甚至不同的公司提供,且运用不同的计算机系统。这样的数据中心,通常而言,几乎没有什么共同点的硬件、软件,或者维护基础设施,并往往不交流彼此所有。
WSCs currently power the services offered by companies such as Google, Amazon, Yahoo,and Microsoft’s online services division. They differ significantly from traditional datacenters: they belong to a single organization, use a relatively homogeneous hardware and system software platform,and share a common systems management layer. Often much of the application, middleware,and system software is built in-house compared to the predominance of third-party software running in conventional datacenters. Most importantly, WSCs run a smaller number of very large applications (or Internet services), and the common resource management infrastructure allows significant deployment flexibility. The requirements of homogeneity, single-organization control, and enhanced focus on cost efficiency motivate designers to take new approaches in constructing and operating these systems.
WSCs目前提供有力服务的公司,如谷歌、亚马逊、雅虎,和微软的在线服务部门,他们明显不同于传统数据中心:他们属于一个组织,使用同源的硬件和系统软件平台,并共享通用的系统管理层。通常大多数应用程序、中间件和系统软件是内部建立,相比采用第三方软件运行的传统数据中心有很大的优势。更重要的,WSCs运行较少的非常大的应用程序(或互联网服务)且通用的资源管理基础设施允许重大部署的灵活性。同质性的要求,单一组织控制和增强的关注成本效率激励着设计师们采取新的方法来建设和运营这些系统
Internet services must achieve high availability, typically aiming for at least 99.99% uptime
(about an hour of downtime per year). Achieving fault-free operation on a large collection of hardware and system software is hard and is made more difficult by the large number of servers involved. Although it might be theoretically possible to prevent hardware failures in a collection of 10,000 servers, it would surely be extremely expensive. Consequently, WSC workloads must be designed to gracefully tolerate large numbers of component faults with little or no impact on service level performance and availability.
互联网服务必须达到高可用性,通常目标至少99.99%的正常运行时间(大约一个小时的停机时间/年)。实现无故障操作大量的硬件和系统软件是非常困难的, 而其上大量相关服务使得这一目标更加困难,虽然可能理论上可以防止10000服务器集的硬件故障,但付出的代价无异是非常昂贵的。因此,WSC工作负载设计必须能优雅地忍受大量的组件故障,达到很少或没有影响服务级别的性能和可用性。

1.2 EMPHASIS ON COST EFFICIENCY 强调成本效率
Building and operating a large computing platform is expensive, and the quality of a service may depend on the aggregate processing and storage capacity available, further driving costs up and requiring a focus on cost efficiency. For example, in information retrieval systems such as Web search, the growth of computing needs is driven by three main factors 。
建造和运营一个大型计算平台是昂贵的,而且高质量的服务,可能取决于聚合的处理能力和可用存储容量,进一步推动成本增加,需要注重成本效率。例如,信息检索系统,如网络搜索,计算需求的增长,由以下三个主要因素:
        Increased service popularity that translates into higher request loads.
        快速的服务普及将转化为更高的请求负载
        The size of the problem keeps growing—the Web is growing by millions of pages per day,which increases the cost of building and serving a Web index.
        问题的规模不断扩大-网页以每天数百万页的速度增加,使得构建和提供Web索引服务的成本急增。
        Even if the throughput and data repository could be held constant, the competitive nature of this market continuously drives innovations to improve the quality of results retrieved and the frequency with which the index is updated. Although some quality improvements can be achieved by smarter algorithms alone, most substantial improvements demand additional computing resources for every request. For example, in a search system that also considers synonyms of the search terms in a query, retrieving results is substantially more expensive—either the search needs to retrieve documents that match a more complex query that includes the synonyms or the synonyms of a term need to be replicated in the index data structure for each term.
        即使吞吐量和数据存储库可以保持不变,市场的竞争特性促使不断创新才能提高检索的搜索质量,索引更新的频率。尽管某些质量改进可以通过智能算法实现,但大部分实质性改进取决于为每个请求增加的额外计算资源。例如,搜索系统在搜索请求时要考虑同义词述语、检索结果代价更高-不论是搜索需要在检索文档中匹配更复杂的包括同义词的查询,还是为每个同义词术语在索引数据结构中复制。
The relentless demand for more computing capabilities makes cost efficiency a primary metric of interest in the design of WSCs. Cost efficiency must be defined broadly to account for all the significant components of cost, including hosting-facility capital and operational expenses (which include power provisioning and energy costs), hardware, software, management personnel, and repairs.
更高计算能力的钢性需求 使得成本效率成为 WSCs 设计 主要感兴趣的指标之一。成本效率必须更大范围的明确所有重要组成成分的成本,包括主机设备资金和运营费用(包括电力供应和能源成本)、硬件、软件、管理人员和维修费。
1.3 NOT JUST A COLLECTION OF SERVERS  不只是一组服务器
Our central point is that the datacenters powering many of today’s successful Internet services are no longer simply a miscellaneous collection of machines co-located in a facility and wired up together. The software running on these systems, such as Gmail or Web search services, execute at a scale far beyond a single machine or a single rack: they run on no smaller a unit than clusters of hundreds to thousands of individual servers. Therefore, the machine, the computer, is this large cluster or aggregation of servers itself and needs to be considered as a single computing unit.
我们的焦点是,数据中心作为许多今天的成功的互联网服务已不再只是一个杂七杂八的机器共存的设施连接在一起。这些系统上运行的软件,比如Gmail或Web搜索服务,执行规模远远超出一台机器或单一机架: 他们以最小单元在成百上千的单独的服务器组成的集群中运行。因此,这些机器、计算机,被识为大型集群或聚合的服务器中的一个单一的计算单元。
The technical challenges of designing WSCs are no less worthy of the expertise of computer systems architects than any other class of machines. First, they are a new class of large-scale machines driven by a new and rapidly evolving set of workloads. Their size alone makes them difficult to experiment with or simulate efficiently; therefore, system designers must develop new techniques to guide design decisions. Fault behavior and power and energy considerations have a more significant impact in the design of WSCs, perhaps more so than in other smaller scale computing platforms. Finally, WSCs have an additional layer of complexity beyond systems consisting of individual servers or small groups of server; WSCs introduce a significant new challenge to programmer productivity, a challenge perhaps greater than programming multicore systems. This additional
complexity arises indirectly from the larger scale of the application domain and manifests itself as a deeper and less homogeneous storage hierarchy (discussed later in this chapter), higher fault rates (Chapter 7), and possibly higher performance variability (Chapter 2).
设计WSCs的技术挑战不亚于设计其他任何种类的机器所需要的计算机系统架构师专业知识。首先,它们是一种新型的,负载着新的快速增长的工作负载的大型机器组。其次,规模本身使它们难以试验或模拟效率;因此,系统设计者必须开发新技术来指导设计决策。在WSCs的设计中,故障反应、电力和能源因素产生更为显著的影响,相比其他规模较小的计算平台。最后,WSCs有额外的一层复杂程度远远超出由单独的服务器或服务器的小群体组成的系统:WSCs导入一个新的重大编程工作效率挑战,该挑战甚至高于编程多核系统。这个额外的复杂性间接来自于规模较大的应用程序域,表现为一种更深和更少的同构存储体系(在本章稍后讨论),更高的出错率(第7章),并可能较高的性能易变性(第2章)
The objectives of this book are to introduce readers to this new design space, describe some of the requirements and characteristics of WSCs, highlight some of the important challenges unique to this space, and share some of our experience designing, programming, and operating them within Google. We have been in the fortunate position of being both designers of WSCs, as well as customers and programmers of the platform, which has provided us an unusual opportunity to evaluate design decisions throughout the lifetime of a product. We hope that we will succeed in relaying our enthusiasm for this area as an exciting new target worthy of the attention of the general research and technical communities.
这本书的目的是给读者介绍这种新设计空间,描述一些WSCs的要求和特点,强调一些重要的独有挑战,并分享一些我们在谷歌的经验,包括设计、编程和操作等方面。我们有幸设计WSCs 并成为该平台的客户和程序员,这为我们提供了一个不寻常的机会在一个产品的全生命周期中进行评估设计决策。我们希望我们能成功传达我们对该领域的热情, 并期待引发对该领域的常规研究和技术交流的关注和极大兴趣。

作者: trybestying    时间: 2012-6-25 12:05
1.4 O NE DATACENTER VS. SEVERAL DATACENTERS  一个数据中心VS.多个数据中心
In this book, we define the computer to be architected as a datacenter despite the fact that Internet services may involve multiple datacenters located far apart. Multiple datacenters are sometimes used as complete replicas of the same service, with replication being used mostly for reducing user latency and improving serving throughput (a typical example is a Web search service). In those cases, a given user query tends to be fully processed within one datacenter, and our machine definition seems appropriate.
在本书中,我们假单个数据中心的计算机架构,尽管互联网服务可能会涉及多个相距很远的数据中心。有时可以用作多个数据中心相同服务的完整副本,复制主要用于降低用户延迟和改善服务的吞吐量(一个典型的例子是一个Web搜索服务)。在这些情况下,给定用户查询往往完全在一个数据中心处理,我们的机器定义似乎是合适的。
However, in cases where a user query may involve computation across multiple datacenters,our single-datacenter focus is a less obvious fit. Typical examples are services that deal with nonvolatile user data updates, and therefore, require multiple copies for disaster tolerance reasons. For such computations, a set of datacenters might be the more appropriate system. But we have chosen to think of the multi-datacenter scenario as more analogous to a network of computers. This is in part to limit the scope of this lecture, but is mainly because the huge gap in connectivity quality between intra- and inter-datacenter communications causes programmers to view such systems as separate
computational resources. As the software development environment for this class of applications evolves, or if the connectivity gap narrows significantly in the future, we may need to adjust our choice of machine boundaries.
然而,在用户查询的情况下可能会涉及到计算跨多个数据中心,我们的单数据中心聚焦是不合适的。典型的例子是处理非易失性用户数据更新服务,因此需要多个副本作为容灾的原因。这样的计算,一组数据中心可能是更适当的系统。但是我们选择把多数据中心场景看作更类似于一个计算机网络。这在一定程度上是为了限制这本课程的范围,但主要是因为内部和内部数据中心的通信存在连接质量的巨大差距,导致程序员们认为这样的系统作为单独的计算资源。作为软件开发环境为这类应用程序的发展,或如果未来连通性差距显著缩小,我们可能需要调整选择的机器边界。
1.5 WHY WSCs MIGHT MATTER TO YOU 为什么WSCs可能对你很重要
As described so far, WSCs might be considered a niche area because their sheer size and cost render them unaffordable by all but a few large Internet companies. Unsurprisingly, we do not believe this to be true. We believe the problems that today’s large Internet services face will soon be meaningful to a much larger constituency because many organizations will soon be able to afford similarly sized computers at a much lower cost. Even today, the attractive economics of low-end server class computing platforms puts clusters of hundreds of nodes within the reach of a relatively broad range
of corporations and research institutions. When combined with the trends toward large numbers of processor cores on a single die, a single rack of servers may soon have as many or more hardware threads than many of today’s datacenters. For example, a rack with 40 servers, each with four 8-core dual-threaded CPUs, would contain more than two thousand hardware threads. Such systems will arguably be affordable to a very large number of organizations within just a few years, while exhibiting some of the scale, architectural organization, and fault behavior of today’s WSCs. Therefore, we believe that our experience building these unique systems will be useful in understanding the design
issues and programming challenges for those potentially ubiquitous next-generation machines.
目前为止所述,WSCs可能考虑的细分领域,因为他们的规模和成本使他们负担不起的,除了一些大的互联网公司。不出所料,我们不相信这是真的。我们相信今天的大型互联网服务面临的问题,很快就会有更大的选区,因为许多组织将很快能够以更低的成本负担得起相同大小的电脑。即使是在今天,引人注目的经济形低端服务器类计算平台最大可能的将数百个节点的集群集成在企业和研究机构。集结大量的处理器核心趋向死亡,单机架的服务器可能很快就会被多硬件线程的当下新型数据中心替代。例如,一个容量40台服务器,每台服务器四个8核多线程cpu的机架,将包含超过二千硬件线程。这样的系统才能在未来的几年内负担起大多数的组织机构。这展现了今天的WSCs的某些规模,组织架构和故障特性。因此,我们相信构建这些独特的系统的经验将有助理解设计问题和编程挑战那些潜在的无处不在的下一代的机器。

作者: eMe    时间: 2012-6-25 15:03
坐观

作者: trybestying    时间: 2012-6-25 15:54
本帖最后由 trybestying 于 2012-6-29 17:45 编辑

1.6 ARCHITECTURAL OVERVIEW OF WSCs  WSCs架构概览
The hardware implementation of a WSC will differ significantly from one installation to the next. Even within a single organization such as Google, systems deployed in different years use different basic elements, reflecting the hardware improvements provided by the industry. However, the architectural organization of these systems has been relatively stable over the last few years. Therefore, it is useful to describe this general architecture at a high level as it sets the background for subsequent discussions.
WSC的硬件实现不同批次显著不同。即使在单一组织如谷歌、不同的年份使用不同的基本元素进行系统部署,这反映了该行业的硬件改进特性。然而,这些系统的架构组织在过去的几年里一直相对稳定,因此,以较高的水平来描述这个通用的架构为后续讨论做了很好的铺垫。[attach]164555[/attach]

FIGURE 1.1: Typical elements in warehouse-scale systems: 1U server (left), 7′ rack with Ethernet switch (middle), and diagram of a small cluster with a cluster-level Ethernet switch/router (right). 1.1仓储级系统典型元素:1U服务(左),7以太网交换机(中间),一个小的集群有一个集群级以太网交换机或路由器()Figure 1.1 depicts some of the more popular building blocks for WSCs. A set of low-end servers,typically in a 1U or blade enclosure format, are mounted within a rack and interconnected using a local Ethernet switch. These rack-level switches, which can use 1- or 10-Gbps links, have a number of uplink connections to one or more cluster-level (or datacenter-level) Ethernet switches. This second-level switching domain can potentially span more than ten thousand individual servers.
图1.1描述了一些比较流行的WSCs构建块。一组低端的服务器,通常以1U或刀片式,安装到机架,内部互联使用本地以太网交换机。这些机架级交换,可以使用1 -或10-Gbps链接,有许多的上行链路连接到一个或多个集群级(或数据中心级)以太网交换机。二级切换域可以潜在跨度超过一万种不同的服务器。
1.6.1 Storage
Disk drives are connected directly to each individual server and managed by a global distributed file system (such as Google’s GFS [31]) or they can be part of Network Attached Storage (NAS) devices that are directly connected to the cluster-level switching fabric. A NAS tends to be a simpler solution to deploy initially because it pushes the responsibility for data management and integrity to a NAS appliance vendor. In contrast, using the collection of disks directly attached to server nodes requires a fault-tolerant file system at the cluster level. This is difficult to implement but can lower hardware costs (the disks leverage the existing server enclosure) and networking fabric utilization (each server network port is effectively dynamically shared between the computing tasks and the file system). The replication model between these two approaches is also fundamentally different. A NAS provides extra reliability through replication or error correction capabilities within each appliance, whereas systems like GFS implement replication across different machines and consequently will use more networking bandwidth to complete write operations. However, GFS-like systems are able to keep data available even after the loss of an entire server enclosure or rack and may allow higher aggregate read bandwidth because the same data can be sourced from multiple replicas. Trading off higher write overheads for lower cost, higher availability, and increased read bandwidth was the right solution for many of Google’s workloads. An additional advantage of having disks collocated with compute servers is that it enables distributed system software to exploit data locality. For the remainder of this book, we will therefore implicitly assume a model with distributed disks directly connected to all servers.
硬盘驱动直接连接到每个独立的服务器,由全球分布式文件系统(比如谷歌的Google文件系统[31])管理,或者作为(NAS)设备的部件直接连接到集群级交换结构。NAS往往是一个最初的简单的部署解决方案,因为它促进NAS设备供应商对数据管理和完整性的负责。相比之下,使用直接附加到服务器节点的硬盘集合需要一个集群级别的容错文件系统。这很难实现,但可以降低硬件成本(硬盘利用现有的服务器附件)和网络带宽的利用率(每个服务器网络端口在计算任务和文件系统间有效动态共享)。两种方式的复制模型本质不同。NAS通过在每个设备中复制或纠错功能以提供额外的可靠性,而类似的GFS系统则通过跨不同机器实现复制,因此将占用更多的网络带宽来完成写操作。然而,类似GFS的系统能保持数据的高可用性(甚至在整个服务器的外壳或架子损坏的情况下),可能会允许更高的聚集读带宽,因为相同的数据可以来自多个副本。对于许多类似谷歌这样的工作负载,以更低的成本达到更高的写开销,更高的可用性,并增加阅读带宽是正确的解决方案。另外一个好处, 硬盘集中集合服务使得它的分布式系统软件能够本地化浏览数据。这本书的剩余部分,隐式地假定分布式硬盘直接连接到所有服务器。
Some WSCs, including Google’s, deploy desktop-class disk drives instead of enterprise-grade disks because of the substantial cost differential between the two. Because that data are nearly always replicated in some distributed fashion (as in GFS), this mitigates the possibly higher fault rates of desktop disks. Moreover, because field reliability of disk drives tends to deviate significantly from the manufacturer’s specifications, the reliability edge of enterprise drives is not clearly established.For example, Elerath and Shah [24] point out that several factors can affect disk reliability more substantially than manufacturing process and design.
一些WSCs,包括谷歌,部署桌面级硬盘驱动器代替企业级硬盘,因为两者间的成本差异很大。因为这些数据几乎总是以分布式方式(如GSF) 复制,这减缓了桌面级硬盘可能较高的故障率。此外,因为硬盘驱动器的字段可靠性往往严重偏离制造商的规格,可靠性的边缘企业推动并不明确。例如,Elerath和Shah [24]指出数个因素比生产过程和设计更能严重影响硬盘可靠性比。
1.6.2 Networking Fabric 网络结构
Choosing a networking fabric for WSCs involves a trade-off between speed, scale, and cost. As of this writing, 1-Gbps Ethernet switches with up to 48 ports are essentially a commodity component, costing less than $30/Gbps per server to connect a single rack. As a result, bandwidth within a rack of servers tends to have a homogeneous profile. However, network switches with high port counts, which are needed to tie together WSC clusters, have a much different price structure and are more than ten times more expensive (per 1-Gbps port) than commodity switches. In other words, a switch that has 10 times the bi-section bandwidth costs about 100 times as much. As a result of this cost discontinuity, the networking fabric of WSCs is often organized as the two-level hierarchy depicted in Figure 1.1. Commodity switches in each rack provide a fraction of their bi-section bandwidth for interrack communication through a handful of uplinks to the more costly cluster-level switches. For example, a rack with 40 servers, each with a 1-Gbps port, might have between four and eight 1-Gbps uplinks to the cluster-level switch, corresponding to an oversubscription factor between 5 and 10 for communication across racks. In such a network, programmers must be aware of the relatively scarce cluster-level bandwidth resources and try to exploit rack-level networking locality, complicating software development and possibly impacting resource utilization.
为WSCs选择一个网络组织需要在速度、规模和成本之间平衡。在撰写本文时,1gbp s以太网交换机多达48端口,每服务器连接一个机架/1gbp s的交换连接成本不到30美元,因此,一个机架服务器上的带宽往往有一个均匀分布。然而,, WSC集群需要和高端口数网络交换机相配合,每1gbp s端口要比普通交换的成本高出很多甚至超过数十倍。换句话说,有10次bi-section的一次交换带宽成本约100倍。由于成本不连续性、WSCs的网络结构通常组织为两级层次,如图1.1所示。每个机柜的日常交换提供部分的bi-section 带宽,机架内部通信通过少部分的上行链路连接到更昂贵的集群级交换机。例如,一个容纳40台服务器,每个服务器达1-Gbps的机架,,可能有四到八个1-Gbps上行链路连接到集群级交换机,对应的机架间通信发挥超负荷模型系数在5到10之间。这样的网络中,程序员必须了解相对稀缺的集群级带宽资源,努力开拓本地rack-level网络,复杂的软件开发,可能影响资源利用率。
Alternatively, one can remove some of the cluster-level networking bottlenecks by spending more money on the interconnect fabric. For example, Infiniband interconnects typically scale to a few thousand ports but can cost $500–$2,000 per port. Similarly, some networking vendors are starting to provide larger-scale Ethernet fabrics, but again at a cost of at least hundreds of dollars per server. Alternatively, lower-cost fabrics can be formed from commodity Ethernet switches by building “fat tree” Clos networks [1]. How much to spend on networking vs. spending the equivalent amount on buying more servers or storage is an application-specific question that has no single correct answer. However, for now, we will assume that intra rack connectivity is often cheaper than inter rack connectivity.
另外你可以在架间互连结构上花费更多的钱以消除一些集群级网络瓶颈问题,例如,网络互连一般规模几千端口,但每端口的费用在$ 500- $ 2000之间。同样,一些网络供应商已开始提供大规模的以太网网数, 但每台服务器又要花费至少数百美元。另外,可通过构建“胖树”Clos网络[1] 商用以太网交换机形成低成本的网络。在网络上花费多少钱vs.消费相同金额购买更多的服务器或存储?这是一个特定应用问题,没有单一的正确答案。然而,现在,我们将假定内架连接成本通常是低于架间连接。
1.6.3 Storage Hierarchy 存储器体系
Figure 1.2 shows a programmer’s view of storage hierarchy of a typical WSC. A server consists of a number of processor sockets, each with a multicore CPU and its internal cache hierarchy, local shared and coherent DRAM, and a number of directly attached disk drives. The DRAM and disk resources within the rack are accessible through the first-level rack switches (assuming some sort of remote procedure call API to them), and all resources in all racks are accessible via the cluster-level switch.
1.2 以一个程序员的视角展示了一个典型WSC的存储体系。服务器由一批处理器插槽,每槽有多核CPU和其内部缓存的层次结构,共享偶合本地DRAM,以及大量的直接附加硬盘驱动器。机架内的DRAM和硬盘资源通过一级机架交换访问(假设某种形式的远程过程调用API),所有机架的所有资源通过集群级交换机访问。
[attach]164556[/attach]
FIGURE 1.2: Storage hierarchy of a WSC. 图1.2 WSC 的存储体系

[attach]164557[/attach]
FIGURE 1.3: Latency, bandwidth, and capacity of a WSC. 图1.3 WSC 的延迟、带宽和容量
1.6.4 Quantifying Latency, Bandwidth, and Capacity WSC 的延迟、带宽和容量
Figure 1.3 attempts to quantify the latency, bandwidth, and capacity characteristics of a WSC. For illustration we assume a system with 2,000 servers, each with 8 GB of DRAM and four 1-TB disk drives. Each group of 40 servers is connected through a 1-Gbps link to a rack-level switch that has an additional eight 1-Gbps ports used for connecting the rack to the cluster-level switch (an oversubscription factor of 5). Network latency numbers assume a socket-based TCP-IP transport,and networking bandwidth values assume that each server behind an oversubscribed set of uplinks is using its fair share of the available cluster-level bandwidth. We assume the rack- and cluster-level switches themselves are not internally oversubscribed. For disks, we show typical commodity disk drive (SATA) latencies and transfer rates. The graph shows the relative latency, bandwidth, and capacity of each resource pool. For example, the bandwidth available from local disks is 200 MB/s, whereas the bandwidth from off-rack disks is just 25 MB/s via the shared rack uplinks. On the other hand, total disk storage in the cluster is almost ten million times larger than local DRAM. A large application that requires many more servers than can fit on a single rack must deal effectively with these large discrepancies in latency, bandwidth, and capacity. These discrepancies are much larger than those seen on a single machine, making it more difficult to program a WSC.
图1.3 试图量化 WSC 的延迟、带宽和容量特征,为了说明我们假设系统由2000台服务器,每台服务器都有8 GB的内存和四个1-TB硬盘驱动器。每40组服务器 通过一个1-gbps链路连接到一个机架级交换机,有额外的8个1-gbps端口用于连接机架到集群级交换机(发挥超负荷模型系数为5)。网络延迟数假设基于套接字的TCP-IP传输,网络带宽值假设每台服务器连接使用它的集群级可用平均配额带宽(使用发挥超负荷模型的前提下)。我假设机架和集群级交换机本身并不在内部的超负荷模型之内。对于硬盘,显示典型的商用硬盘驱动器(SATA)的延迟和转化率。图表显示了每一个资源池的相对延迟、带宽和容量,例如,本地硬盘可用带宽是200 MB / s,而通过共享机架上行链接仅为 25 MB/s。另一方面,集群中的总硬盘存储量是本地DRAM的近万倍。需要更多的服务器集成在单一机架的大型应用必须有效地处理延迟、带宽和容量的巨大差异。这些差异要远远大于那些单一机器,编程设计WSC更难。
A key challenge for architects of WSCs is to smooth out these discrepancies in a cost-efficient manner. Conversely, a key challenge for software architects is to build cluster infrastructure and services that hide most of this complexity from application developers.
架构WSCs的一个关键挑战是以一个经济合算的方式来消除这些差异。相反, 软件架构的一个关键挑战是构建集群基础设施和服务,使应用开发者对大部分复杂性不可见。
1.6.5 Power Usage 电力供给
Energy and power usage are also important concerns in the design of WSCs because, as discussed in more detail in Chapter 5, energy-related costs have become an important component of the total cost of ownership of this class of systems. Figure 1.4 provides some insight into how energy is used in modern IT equipment by breaking down the peak power usage of one generation of WSCs deployed at Google in 2007 categorized by main component group.
能源和电力供给也是WSCs的重要设计问题,将在第5章进行深入详细的讨论。能源相关成本已成为此类典型系统的重要组成部分。 图1.4有助于深入了解 第一代WSCs 2007部署谷歌时,如何在各主要组件中使用现代IT设备打破能源用电高峰。
Although this breakdown can vary significantly depending on how systems are configured for a given workload domain, the graph indicates that CPUs can no longer be the sole focus of energy efficiency improvements because no one subsystem dominates the overall energy usage profile.Chapter 5 also discusses how overheads in power delivery and cooling can significantly increase the actual energy usage in WSCs.
尽管这一分布率对于不同工作负载域的系统配置可能会有显著地变化, 图表显示,cpu已不再是改进能源效益的唯一焦点,因为没有一个子系统主导整个能源使用概要文件。第五章还将讨论WSCs如何在电力输送和冷却方面显著提高实际能源使用量。
[attach]164558[/attach]
FIGURE 1.4: Approximate distribution of peak power usage by hardware subsystem in one of Google’s datacenters (circa 2007). 图1.4:谷歌数据中心硬件子系统功率使用峰值分布图(大约2007年)
1.6.6 Handling Failures 故障处理
The sheer scale of WSCs requires that Internet services software tolerate relatively high component fault rates. Disk drives, for example, can exhibit annualized failure rates higher than 4% [65,76]. Different deployments have reported between 1.2 and 16 average server-level restarts per year.With such high component failure rates, an application running across thousands of machines may need to react to failure conditions on an hourly basis. We expand on this topic further on Chapter 2, which describes the application domain, and Chapter 7, which deals with fault statistics.
WSCs的规模要求互联网服务软件容忍相对较高的组件故障率。例如, 磁盘驱动器,从已报道的不同部署的每年服务器级的重启率平均在1.2到16之间,显示年故障率高于4%[65,76]。如此高的组件故障率,使得跨越数以千计机器运行的应用程序可能需要应对失败条件以小时为基础。在第2章的应用程序域以及第7章的处理故障统计中将对该话题进一步展开。


作者: trybestying    时间: 2012-6-25 16:09
eMe 发表于 2012-6-25 15:03
坐观

谢谢,因原文比较坳口,所以翻译起来比较费力!  有不当之处,还望多指教。
作者: 班玛康乐    时间: 2012-6-25 22:09
楼主辛苦了
作者: trybestying    时间: 2012-6-26 08:40
本帖最后由 trybestying 于 2012-6-26 08:41 编辑
班玛康乐 发表于 2012-6-25 22:09
楼主辛苦了

呵呵,谢谢支持!希望翻译的东东对大家有用!
作者: trybestying    时间: 2012-6-26 12:41
本帖最后由 trybestying 于 2012-6-26 14:40 编辑

chapter 2
第二章
Workloads and Software Infrastructure工作负载和软件基础设施
Theapplications that run on warehouse-scale computers (WSCs)dominate many system design trade-off decisions. This chapter outlines some ofthe distinguishing characteristics of software that runs in large Internetservices and the system software and tools needed for a complete computingplatform. Here is some terminology that defines the different software layersin a typical WSC deployment:
在仓储级计算机(WSCs)中运行的应用程序引领了许多系统设计权衡决策。这一章概述运行在大型互联网服务上的软件的一些特别要求以及一个完整的计算平台所需要的系统软件和工具。下面是一些典型的WSC部署相关的术语,这些术语用来定义不同的软件层:
.Platform-level software——the common firmware, kernel, operating system distribution, andlibraries expected to be present in all individual servers to abstract thehardware of a single machine and provide basic server-level services.
.平台层的软件——预计将出现在所有单独的服务器中的常见的固件、内核、分布式操作系统和函数库,抽象单独机器的硬件,并提供基本、服务级的服务。
.Cluster-level infrastructurethe collectionof distributed systems software that manages resources and provides services atthe cluster level; ultimately, we consider these services as an operatingsystem for a datacenter. Examples are distributed file systems, schedulers,remote procedure call (RPC) layers, as well as programming models that simplifythe usage of resources at the scale of datacenters, such as MapReduce [19],Dryad [47], Hadoop [42], Sawzall [64], BigTable [13], Dynamo [20], and Chubby[7].
.集群层基础设施—用来管理资源和提供集群级服务的分布式系统软件集;实际上这些服务可看作数据中心的操作系统,比如分布式文件系统、调度器,远程过程调用(RPC)层,以及可以简化数据中心资源使用规模的编程模型等。像 MapReduce [19],Dryad [47], Hadoop [42],Sawzall [64], BigTable [13], Dynamo [20], and Chubby [7]都是。
.Application-level software—software that implements a specific service. It is often useful tofurther divide application-level software into online services and offlinecomputations because those tend to have different requirements. Examples ofonline services are Google search, Gmail, and Google Maps. Offline computationsare typically used in large-scale data analysis or as part of the pipeline thatgenerates the data used in online services; for example, building an index ofthe Web or processing satellite images to create map tiles for the onlineservice.
.
应用层软件—实现特定服务的软件。这对进一步分割应用级软件为在线服务和离线计算通常是有用的, 因为这些应用往往有不同的要求。例如在线服务有谷歌搜索、Gmail和谷歌地图。离线计算通常用在大型数据分析或部分管道产生的用于在线服务的数据;例如,建Web索引或 处理卫星图像以创建地图图像块用来提供在线服务。
2.1DATACENTEr VS. DESKToP
数据中心VS.桌面
Softwaredevelopment in Internet services differs from the traditional desktop/servermodel in many ways:
互联网服务方面的软件开发在许多方面有别于传统的桌面/服务器模式:
.Ample parallelism—Typical Internet services exhibit a large amount of parallelismstemming from both data- and request-level parallelism. Usually, the problem isnot to find parallelism but to manage and efficiently harness the explicitparallelism that is inherent in
theapplication. Data parallelism arises from the large data sets of relativelyindependent records that need processing, such as collections of billions ofWeb pages or billions of log lines. These very large data sets often requiresignificant computation for each parallel (sub) task, which in turn helps hideor tolerate communication and synchronization overheads. Similarly,request-level parallelism stems from the hundreds or thousands of requests persecond that popular Internet services receive. These requests rarely involveread-write sharing of data or synchronization across requests. For example,search requests are essentially independent and deal with a mostly read-onlydatabase; therefore, the computation can be easily partitioned both within arequest and across different requests. Similarly, whereas Web emailtransactions do modify user data, requests from different users are essentiallyindependent from each other, creating natural units of data partitioning andconcurrency.
.足够的并发性—典型的互联网服务表现出大量的并发性堵塞主要源于数据和请求级并发。通常,问题不是发现并发性,而是如何管理和有效处理应用程序内在的显式并发。数据并发来自需要处理的相关性独立记录形成的大型数据集,,比如收藏的数十亿网页或日志线。这些非常大的数据集, 每个并行(子)的任务通常都需要大量计算,这反过来有助于隐藏或容忍通信和同步开销。同样,请求级并行性源于流行的互联网服务接收到的每秒成百上千的请求。
.Workload churnUsers ofInternet services are isolated from the service‘simplementation details by relatively well-defined and stable high-level APIs(e.g., simple URLs), making it much easier to deploy new software quickly. Keypieces of Google‘s services have release cycles on theorder of a couple of weeks compared to months or years for desktop softwareproducts. Google‘s front-end Web server binaries, forexample, are released on a weekly cycle, with nearly a thousand independentcode changes checked in by hundreds of developers-——thecore of Google‘s search services has been reimplementednearly from scratch every 2 to 3 years. This environment creates significantincentives for rapid product innovation but makes it hard for a system designerto extract useful benchmarks even from established applications.Moreover,because Internet services are still a relatively new field, new products andservices frequently emerge, and their success with users directly affects theresulting workload mix in the datacenter. For example, video services such asYouTube have flourished in relatively short periods and may present a verydifferent set of requirements from the existing large customers of computingcycles in the datacenter, potentially affecting the optimal design point ofWSCs in unexpected ways. A beneficial side effect of this aggressive softwaredeployment environment is that hardware architects are not necessarily burdenedwith having to provide good performance for immutable pieces of code. Instead,architects can consider the possibility of significant software rewrites totake advantage of new hardware capabilities or devices.
. 工作负载搅动—用户的互联网服务通过定义相对良好、稳定的高级APIs(如,简单的url),来实现,隔离了服务实现细节,从更容易快速部署新软件。谷歌服务的关键部分的发布周期已达到大约几周,相比桌面软件产品则需几个月或几年。例如,谷歌前端Web服务的二进制文件(由数以百计的开发人员完成的近一千个独立的代码变更检入)发布周期仅为一个月—谷歌核心搜索服务的编码每2到3年几乎从零开始重新实现。这种环境极大激励了产品快速创新,然尔使系统设计师提取有用的基准库,甚至建立应用程序变得很难。此外,由于互联网服务仍然是一个相对较新的领域,新产品和服务经常出现,他们的成功与用户的直接影响所产生的工作负载,混合到数据中心。例如,YouTube等视频服务在相对较短的时间蓬勃发展,对于现有的数据中心的大客户的计算周期而言, 可能会呈现一组非常不同的需求集,可能以意想不到的方式影响WSCs优化设计的角度。这种激进的软件部署环境的有益的一面,则是硬件架构师不再纠结于为不可变的代码片断提供良好的性能,相反的, 架构师可以考虑利用新的硬件功能或设备实现大量软件重写的可能性。
.Platform homogeneity—The datacenter is generally a more homogeneous environment than thedesktop as a target platform for software development. Large Internet servicesoperations typically deploy a small number of hardware and system softwareconfigurations at any given time. Significant heterogeneity arises primarilyfrom the incentives to deploy more cost-efficient components that becomeavailable over time. Homogeneity within a platform generation simplifiescluster-level scheduling and load balancing and reduces the maintenance burdenfor platforms software (kernels, drivers, etc.). Similarly, homogeneity canallow more efficient supply chains and more efficient repair processes becauseautomatic and manual repairs benefit from having more experience with fewertypes of systems. In contrast, software for desktop systems can make fewassumptions about the hardware or software platform they are deployed on, andtheir complexity and performance characteristics may suffer from the need tosupport thousands or even millions of hardware and system softwareconfigurations.
.平台同质性—相比桌面作为目标平台的软件开发而言,数据中心的环境通常更同构化。在任何给定的时间,大型互联网服务操作通常使用少量的硬件和系统软件配置。显著的异质化主要来自部署更多成本更低的组件的,这些组件日久可用。平台同质简化集群级调度和负载平衡,减轻平台软件(内核,驱动,等等)的维护负担。类似地, 同质性可以使供应链效率更高,修复过程更高效,因为自动和手动修复得益于更少类型系统的更多经验积累。相比之下,运行在桌面系统上的软件,假设部署少量硬件或软件,也需要面对数以千计甚至数以百万计的需要满足复杂特性要求的硬件和系统软件配置。
.Fault-freeoperation—Because Internet service applications run on clusters of thousandsof machines—each of them not dramatically more reliablethan PC-class hardware—the multiplicative effect ofindividual failure rates means that some type of fault is expected every fewhours or less (more details are provided in Chapter 6). As a result, althoughit may be reasonable for desktop-class software to assume a fault-free hardwareoperation for months or years, this is not true for datacenter-level services—Internet services need to work in an environment where faults arepart of daily life. Ideally, the cluster-level system software should provide alayer that hides most of that complexity from application-level software,although that goal may be difficult to accomplish for all types ofapplications.
.无故障运行——因为互联网服务应用程序运行在成千上万S机器组成的集群上——相比PC级硬件,他们并不显得更可靠——个体失败率的乘法效应,意味着某种类型的故障将可能每几小时或者更少时间(第6章提供更多细节)发生。因此,假设一个桌面级软件可以在硬件无故障的情况下运行几个月或几年可能是合理的,但对于数据中心级服务缺是不实现的——互联网服务需要在错误为日常生活一部分的工作环境中运行。理想情况下,集群级系统软件应能提供一个层,以隔离应用程序级软件的大部分复杂性,这个目标对于所有类型的应用程序来说很难实现。
Although the plentiful thread-level parallelism and amore homogeneous computing platform help reduce software development complexityin Internet services compared to desktop systems, the scale, the need tooperate under hardware failures, and the speed of workload churn have the oppositeeffect.
相比桌面系统,尽管互联网服务中大量线程级别的并行性和更同质化的计算平台有助于减少软件开发的复杂性,但其规模、在硬件故障下运行的需求以及工作负载搅动速度都将会产生相反的效果。

作者: trybestying    时间: 2012-6-28 17:24
本帖最后由 trybestying 于 2012-6-28 17:29 编辑

2.2 PERFORMANCE AND AVAILABILITY  TOOLBOX性能和可用工具箱
Some basic programming concepts tend to occur often in both infrastructure and application levels because of their wide applicability in achieving high performance or high availability in large-scale deployments. The following table describes some of the most prevalent concepts.
不论是基础设施,还是应用程序级的一些基本的编程概念,趋向于时常产生, ,因为他们在大规模部署下能够实现高性能或高可用性的广泛适用性。下表描述了一些最普遍的概念。




Performance


Availability


Desc ripti on


Replication


Yes


Yes


Data replication is a powerful technique because it can  improve both performance and availability. It is particularly powerful when  the replicated data are not often modified because replication makes updates  more complex.


Sharding
  
(partitioning)
   


Yes


Yes


Splitting a data set into smaller fragments (shards)  and distributing them across a large number of machines. Operations on the  data set are dispatched to some or all of the machines hosting shards, and  results are coalesced by the client. The sharding policy can vary depending on  space constraints and performance considerations. Sharding also helps availability  because recovery of small data fragments can be done faster than larger ones.


Load-
  
Balancing


yes





In large-scale services, service-level performance  often depends on the slowest responder out of hundreds or thousands of  servers. Reducing response-time variance is therefore critical.
  
In a sharded service, load balancing can be achieved by  biasing the sharding policy to equalize the amount of work per server. That  policy may need to be informed by the expected mix of requests or by the  computing capabilities of different servers. Note that even homogeneous  machines can offer different performance characteristics to a load-balancing  client if multiple applications are sharing a subset of the load-balanced servers.  
  
In a replicated service, the load balancing agent can  dynamically adjust the load by selecting which servers to dispatch a new  request to. It may still be difficult to approach perfect load balancing  because the amount of work required by different types of requests is not  always constant or predictable..


Health
  
checking and
  
watchdog
  
timers





YES


In a large-scale system, failures often are manifested  as slow or unresponsive behavior from a given server. In this environment, no  operation can rely on a given server to respond to make forward progress.  Moreover, it is critical to quickly determine that a server is too slow or  unreachable and steer new requests away from it. Remote procedure calls must  set well-informed time-out values to abort long-running requests, and  infrastructure-level software may need to continually check connection-level  responsiveness of communicating servers and take appropriate action when  needed.


Integrity
  
Checks
   





YES


In some cases, besides unresponsiveness,faults are  manifested as data corruption.Although those may be rarer, they do occur and  often in ways that underlying hardware or software checks do not catch (e.g.,  there are known issues with the error coverage of some networking CRC  checks).Extra software checks can mitigate these problems by changing the  underlying encoding or adding more powerful redundant integrity checks.


Application-specific
  
compression


YES





Often a large portion of the equipment costs in modern  datacenters is in the various storage layers. For services with very high  throughput requirements, it is critical to fit as much of the working set as  possible in DRAM; this makes compression techniques very important because the  extra CPU overhead of decompressing is still orders of magnitude lower than  the penalties involved in going to disks. Although generic compression  algorithms can do quite well on the average, application-level compression  schemes that are aware of the data encoding and distribution of values can  achieve significantly superior compression factors or better decompression  speeds.


Eventual
  
consistency


yes


yes


Often, keeping multiple replicas up to date using the  traditional guarantees offered by a database management system significantly  increases complexity, hurts performance, and  reduces availability of distributed applications [90]. Fortunately, large  classes of applications have more relaxed requirements and can tolerate  inconsistent views for limited periods, provided that the system eventually  returns to a stable consistent state.







性能


可用性


描述


复制








数据复制是一项功能强大的技术,因为它可以同时提高性能和可用性。当复制的数据不经常修改时尤其强大,因为复制使更新更加复杂化。


分片
  
(分区)






分割数据集为更小的片段(碎片),并将其分散分布到大量的机器上去。数据集上的操作被分派到部分或所有托管碎片的机器上,由客户端合并结果。分片策略非常受限于空间限制和性能考虑。分片还有助于可用性,因为恢复小的数据片段要比恢复大的数据片段快的多。


均衡负载







在大规模的中,服务级性能往往取决于成千上万的服务当中最慢的响应者。减少响应时间差异无疑至关重要。
  
在分区服务中,可以通过偏倚的分片策略来补尝每个服务的任务总数,从而达到均衡负载。这样的策略可能需要被告之预期的混合请求或者不同服务器的计算性能告。注意, 如果多个应用程序共享负载均衡服务器的一个子集,即使是同质的机器,均衡负载的客户端会提供不同的性能特征。
  
在复制服务中,均衡负载代理通过选择服务器来分派新请求的方式能够动态调整负载。但它距最优化的均衡负载可能仍旧很困难,因为不同类型的请求至使任务请求总数往往不能常量化或不可预知。


健康检查和看门狗计时器







在大型系统中,故障往往表现为给定的服务器缓慢或无反应。在这种情形下,没有操作能够依赖给定服务器的响应来改进处理,此外,重要的是很快断定一个服务器太慢或不可访问,引导新的请求远离它。远程过程调用必须设置消息灵通的响应超时值来中过长时间运行的请求,而且基础设施层软件可能需要持续不断的检查通讯服务的连接级响应,并在需要的时候采取适当的行动。


完整性检查







在某些情况,,除了反应迟钝,故障表现为数据腐败。虽然这些可能是少见的,他们确实发生了,常常在底层硬件或软件检查不到位时(例如,有已知问题的一些网络CRC检查错误报道)。额外的软件检查可以缓解这些问题,通过改变底层编码或添加更强大的冗余的完整性检查。


应用特定压缩







通常,现代数据中心设备成本的大部分是各种存储层。对于非常高的吞吐量要求的服务器,重要的是在可能的DRAM中适合尽可能多的任务集;这使得压缩技术非常重要,因为减压额外的CPU开销仍然数量级低于参与磁盘的惩罚。尽管通用压缩算法, 一般说来能做的相当好,但感知数据编码和分布值的应用程序级压缩方案,可以实现显著优于压缩因子或更好的压缩速度。


最终一致性






通常,保持多个副本更新使用传统的数据库管理系统提供担保,明显增加了复杂性,影响性能,减少分布式应用程序的可用性[90]。幸运的是,大类型的应用程序对有限的时间有更轻松的要求,而且能容忍不一致的观点,只要这个系统最终返回到一个稳定一致的状态。



作者: trybestying    时间: 2012-7-18 14:53
标题: 2.3 CLUSTER-LEVEL INFRASTRUCTURE SOFTWARE
本帖最后由 trybestying 于 2012-7-26 17:59 编辑

大型的并行应用程序通过使用冗余计算技术,也可以提高响应时间。有几种情况会导致大型的并发工作的某一个给定的子任务比其它的任务慢得多,要么由于其他工作负载或者软件/硬件故障而产生的性能干扰。冗余计算没有像其他技术那样被广泛部署,因为涉及易见的开销。
然而,有些情况下完成一个大型的工作是被举起来 执行一个非常小的比例的子任务。这样的一个例子就是掉队问题,在MapReduce的技术白书中描述[19]。在这种情况下,一个单一较慢的任务可以决定一个巨型并行任务的响应时间。MapReduce的策略 是识别这种情况的末尾,一份工作和大胆的多余的工人开始工作,只是那些对慢。这个策略增加资源的使用几个百分点,同时减少一个并行计算的完成时间超过30%
2.3 CLUSTER-LEVEL INFRASTRUCTURE SOFTWARE
就像电脑操作系统层管理资源和提供基本服务一样,由成千上万的计算机、网络和存储组成的系统同样需要一个软件层,使其在更大规模上提供类似的功能。我们称这种层为集群层基础设施。下面的段落描述这一的四大类基础设施软件。
2.3.1 R esource Management 资源管理
资源管理是集群层基础设施层最不可缺少的组件。它手动或静态给用户或任务分配服务器组这,控制用户任务到硬件资源的映射。或更高级别的抽象、自动化分配资源,并允许资源合理级的同享。
2.3.2 Hardware Abstraction and Other Basic Services 硬件抽象和其他基础服务
大规模并发应用需要一组的基础服务,如分布式存储可靠性,消息传递,集群层同步等,在大型集群中恰当的实现这类功能的高性能和高可用性,是很复杂的。明智的做法是避免为每个应用程序重新实现这种棘手的代码,而是创建可以重用的模块或服务。Google和 Amazon的GFS,Dynamo和Chubby是大型集群开发中存储可靠性和锁服务的很好例子。
2.3.3 Deployment and Maintenance 部署和维护
在大规模系统中,许多任务都需要手动过程,在一个小的部署中需要大量的基础设施的有效操作。例如软件的图像分布和配置管理、监控服务的性能和质量,和在紧急情况下的筛选报警器操作。Microsoft 的自动化系统提供了示例设计,如Windows Live数据中心的功能 。监控整体的硬件健康,认真监控、自动诊断,以及自动化的修复工作流程。Google的系统健康基础设施(Pinheiro 等),就是典型的软件基础设施的有效健康管理。性能调试和优化同样需要专门的解决方案。在加州大学伯克利分校开发的X-Trace系统就是很好的例子。
2.3.4 Programming Frameworks 编程架构
建立基础设施软件,为所有应用软件隐藏底层集群硬件的复杂性。 MapReduce [19], BigTable [13], and Dynamo 是很好的基础设施软件的例子,,通过自动处理数据分区、分布和容错等,极大地提高程序员在相应领域的生产力。

作者: trybestying    时间: 2012-7-19 09:19
2.3.4 Programming Frameworks编程架构
建立基础设施软件,为所有应用软件隐藏底层集群硬件的复杂性。 MapReduce [19], BigTable [13], and Dynamo 是很好的基础设施软件的例子,,通过自动处理数据分区、分布和容错等,极大地提高程序员在相应领域的生产力。
2.4 应用层软件
2.4.1工作负载的例子
两大类工作负载:在线和离线
2.4.2  在线应用  WEB搜索
2.4.3   离线应用 Scholar Article Similarity

2.5 监控基础设施
2.5.1 服务层仪表盘
使操作者快速识别服务层问题,但缺乏问题的详细信息。
2.5.2 性能调试工具
分布式系统跟踪工具,实现方式分两大类:1.黑盒监控系统,例如 WAP5和the Sherlock system。2.应用/中间件仪表系统,例如: Pip , Magpie,和 X-trace.  Dapper系统(GOOGL 基于注解的跟踪工具.
不论是服务层仪表盘,还是性能调试工具,仅能测量应用程序级的健壮和性能.它们的设计是基于容认硬件故障的,将错识大量的低层硬件问题.
2.5.3平台层监控
持续不断的检查和监控计算平台健康性的工具,需要理解和分析硬件和软件故障.在第6章将详细描述GOOGLE采用的监控工具.
2.6 购买&自建
传统IT会使用大量第三方软件组件.但GOOGLE的应用特定逻辑以及很多集群层基础设施软件均自行开发,平台层软件则采用第三方组件构成,但都是开源的,在需要的时候可自行修改.所以GOOGLE更多的整个软件栈是在应用开发者的掌控之下。这种方式增加了大量的软件开发和维护工作,但获得了灵活性和成本效率的重大好处。何况没有一家软件提供商能够管理和维护像GOOGLE这样规模的数据中心,GOOGLE的软件体系依据自身业务特性而灵活实现,保证了软件的灵活性和高效性。

作者: trybestying    时间: 2012-7-20 17:27
第3章 硬件构建块
主要的构建块是服务器硬件,网络结构和存储体系组件。本章将聚焦服务器硬件选择。
3.1 硬件成本效率
低端服务器集群成为WSCs的首选构建块,有很多的原因,其中主要的原因就是成本效率。
3.1.1 并发应用性能
低端服务器
3.1.2 How Low-End Can You Go?

1)降低成本的同时,也带来了开发量的上升,因为大多的应用必须显式地并行或进一步优化。
2)网络的需求也会增加更大数量的小系统,增加网络延迟和网络的成本。低端服务器内部联接的带宽成本有可能会抵消采用CUP所带来的成本优势。
3)较小的服务器可能会导致低利用率。
4)甚至高度并行算法,当计算和数据被划分成小块时,有时也会效率更低。
 一般说来,相对有竞争力的高端服务器,较低端的服务器构建块必须有一个健康的成本效率优势,。
3.1.3 Balanced Designs
计算机建筑师被训练来从不同的构建块组成的WSC找到正确的组合,来解性能和容量问题。记住三个重要的注意事项:
1.        聪明的程序员可能能够调整他们的算法,以更好地满足更便宜的设计方案,但要适度,程序不能太过于复杂。
2.        对于硬件,可能最有效的均衡配置,是匹配合并多个工作负载的资源需求,而非完全适合每一个工作负载。
3.        可替换的资源往往是更有效的利用。提供WSC合理的连通性,努力应该放在创建可以在远端服务器上灵活地运用资源的软件系统上。这在很多方面影响平衡的决策。
正确的设计点不仅仅依赖于工作负载的高层结构本身,数据大小和服务流行也扮演了重要的角色。

作者: trybestying    时间: 2012-7-20 17:29
第4章 数据中心基础
数据中心本质上是非常大的消耗电力和产生热量的设备。还需要额外的冷缺系统来解决热量。数据中心大量的建设成本,在其所需要的电力供给以及制冷系统,对于大型数据中心,典型的建设成本都在10到20美元/ W,但根据大小、位置和设计,相差很大。
4.1 DATACENTER TIER CLASSIFICATIONS 数据中心等级分类
数据中心的总体设计经常分为1-4个等级:
I.        I级 单路电力和冷却分布路径,没有冗余的部件。
II.        ⅱ级 增加了冗余组件设计(N + 1), 提高可用性
III.        3级 有多个电力和冷却分布路径但仅有一个活动路径。有冗余部件,同时可维护。也就是说,   甚至在维护期间提供冗余性,通常是N + 2的设置。
IV.        4级 有两路活动的电力和冷却分布路径,每路提供冗余组件,能容忍任何单一的设备故障,均不  影响负载。
 以上分级并不是100%精确。大多数商业数据中心介于III和IV之间,在建设成本和可靠性之间选择一个平衡点即可。现实世界数据中心的可靠性也受组织运行质量的强烈影响,而不仅是数据中心的设计。典型的用于工业范围的数据中心的可用性评估ⅱ级为99.7%,III级和IV级分别为99.98%和99.995%。
 数据中心大小差别很大。三分之二的美国数据中心的服务器被安置在小于5000平方英尺(450平方米),且用不到1兆瓦的临界功率。大多数大型数据中心主机服务器的建立来自多个公司(通常称为数据中心托管,或“colos”),并且可以支持10 - 20 MW的临界荷载。现如今很少有超过30兆瓦临界能力的数据中心。

作者: trybestying    时间: 2012-7-21 21:57
本帖最后由 trybestying 于 2012-7-21 22:08 编辑

4.2 数据中心能源系统
图 4.1 示意典型数据中心的构成。电源经外部变压器(通常位于通用变电所)进入建筑。这部分的电力系统通常被称为“中压”(通常是10 - 20千伏) ,有别于远程高压线(60–400千伏)和“低压“内部配电(110 - 600伏)。中压线终结在主开关柜,其中包括断路器保护电源故障和变压器来缩减电压至400 - 600 V,低压电源流入不间断电源(UPS)系统。电源发生故障时,由一组柴油发电机为UPS提供电源。
[attach]164765[/attach]
FIGURE 4.1: The main components of a typical datacenter (image courtesy of DLB Associates [23]).
4.2.1 UPS Systems UPS系统
UPS系统有3个功能:
首先,它包含一个转换开关用来选择活动的电源输入。保证UPS电源系统不间断供电,当市电故障时,能够自动感知并在10 - 15 秒内完成额定载荷的供给。
其次,UPS包含电池或旋转飞轮储能系统来桥接市电和备用电能。完成直流-交流-直流的转化。当市电正常时,电池将直流电存贮,故障时,将存贮电能输出。
最后,UPS控制输入电压,消除电压峰值或松弛,或直流电中的谐波失真。 这种调节通过双转换步骤自然地完成。
因为UPS电池占用大量的空间,UPS通常安置在单独的UPS房间,而不是在数据中心的地板上。典型的UPS尺寸范围从数以百计的千瓦到2兆瓦。
4.2.2 Power Distribution Units 配电单元
UPS输出到配电单元,似类居家中的断电板,分布在数据中心地板下方。配电单元将大电压(200–480 V)分解决为多路110- or 220-V 的电压供服务器使用。每个电路都有自己的断路器保护,以便在地面服务器短路或电力供应故障时只影响某一路,不会影响整个PDU,甚至整个UPS。一个典型的PDU  负载75 - 225千瓦。典型电路处理20或30A在110 - 220 V电压下,最大值为6千瓦。通常pdu提供额外的冗余以接受两个独立的电源,且两个电源可相互无延迟切换,所以即使一路UPS出故障,也不会中断服务器的供电。
4.3 DATACENTER COOLING SYSTEMS  数据中心冷却系统
数据中心高架地板下的区域,经常用于电力电缆路由的上架,但它的主要用途是分发冷气到服务器机架。
4.3.1 CRAC Units
多采用精密空调等冷却系统,下进风热交换系统。部署冷、热通。
4.3.2 Free Cooling
新的数据中心使用冷缺塔对到达制冷机之前的冷凝器水循环流体进行“免费”预冷。免费冷却并不是真正的免费, 相比用冷水机至冷,它更节能。
或者使用基于乙二醇的散热器。
4.3.3 Air Flow Considerations 空气流动的考虑
改善空气流通有很多种方式。如  
新的数据中心已经开始从房间上物理分离热通道来消除再循环,并优化CRACs回流路径。这种设置  整个房间里充满了冷空气(因为热排气都在单独的静压箱或管道系统中),这种,机架中所有服务器会收到相同的气流温度。
4.3.4 In-Rack Cooling 机架冷却
 通常,in-rack冷却器在机架背面增加一个air-to-water换热器,来减弱服务器的热出口热量,以改善CRACs的负载。有些解决方案则由它完全替代CRACs。主要的缺点是,需要冷冻水到每个机架,极大地提高了管道成本,且有水泄漏的担忧。
4.3.5 Container-Based Datacenters 集装箱数据中心
集装箱式数据中心比机架制冷更进一步,采用水冷方式冷却。箱内集成了动力电源部分,和机架,比传统数据中心密度高。
GOOGLE于2005建立并运营集装箱式数据中心,尽管这个想法可以追溯到2003年谷歌专利申请。
这种数据中心有非常高的能源效率等级。微软公司还宣布, 新数据中心将严重依赖于集装箱式。

作者: trybestying    时间: 2012-7-25 11:51
本帖最后由 trybestying 于 2012-7-25 11:54 编辑

第5章 能源和功率效率
数据中心的能源效率越来越受到人们的关注。有越来越多的节能技术被开发利用到数据中心的方方面面。本章将讨论数据中心能效相关的话题。
5.1 数据中心能源效率
[attach]164807[/attach][attach]164807[/attach]
EQUATION 5.1: Breaking an energy efficiency metric into three components: a facility term (a),a server energy conversion term (b), and the efficiency of the electronic components in performing the computation per se (c).

PUE值近年来趋向于更好的水平,得益于人们对能源效率的不断关注以及使用蒸发冷却塔,   更高效的空气流动,等手段来取消不必要的能量转化的损失。
5.1.1 数据中心效率损失源
UPS 是主要的效率损失源之一,其次是冷缺系统。
5.1.2 提高数据中心的能源效率
将置冷通道的温度由20C.提高到25–27C,有效管理热通道,使用高速的飞轮以降低UPS和能源配送系统的损耗。
Google 2008年PUE达到1.24,与其他数据中心相比,主要有以下几方面的不同:
        留神空气流处理:
服务器的热空气消耗,不允许混合冷空气,冷却线圈路径很短,避免很长的距离来传递冷热气而带来的能耗。
        升高冷过道温度
冷过道总是保持在大约27°C,相比18 - 20°C。更容易有效地冷却数据中心。
        使用免费冷却技术
若干个冷却塔通过蒸发水散热,极大地减少了运行制冷机的需求。在大多气候数温和时,冷却塔可以消除大多数的冷水机组的运行。谷歌在比利的数据中心甚至完全消除了制冷机,免费冷却运行时间为100%。
        每个服务器12-V直流UPS
每个服务器包含一个最小电池UPS,这种电池电力供应和效率是99.99%。电力基础设施从大约90%下降到接近99%。

所有的数据中心均可采用上述技术,使 PUE 达到 1.35 ~ 1.45。

服务器的SPUE 值通常在f 1.6–1.8 之间,更好的可以低于1.2。
5.2 测量计算的效率
Joulesort 和 SPECpower_ssj2008基线
5.2.1 一些有用的基线
SPECpower_ssj2008是一个利用标准Java的JDK计算整体服务器性能,并根据其11个不同工作负载区域段的功耗得出服务器的工作负载/能耗比的测试方式,这更像是一个性价比--SPECpower_ssj2008的测试方式是:以一个服务器最大的workload为100%指标,每10%的workload降低为一个区域段,对比在每个不同的workload区域段之内的能耗。
5.2.2 Load vs. Efficiency
GFS
将通讯流量转化为所有机器的较低的活动。
Tickless kernel project 提供了建立和维护闲置资源的另一个示例。

作者: trybestying    时间: 2012-7-25 11:56
本帖最后由 trybestying 于 2012-7-25 11:58 编辑

5.3 ~5.8暂缺

第6章 构造成本
数据中心的成本可分为建设资本成本和运营维护成本两大类
TCO=datacenter depreciation +datacenter Opex +server depreciation+server Opex

6.1 资本成本
数据中心建设本成很大成度上取决于设计,规模,所处位置以及建设速度。添加可靠性和冗余使得数据中心成本更昂贵。通常, 大约80%的总工程造价来自于电力和冷却,剩下的20%为一般建筑和场地建设。
不论是非常小的或非常大的数据中心的成本都趋向于越来越高。 (前者由于固定成本无法分期摊销,后者因为大型中心需要额外的基础设施,如所变电站等)。
一般经验,许多大型数据中心的建设成本都在$12–15/W,(这里指的临界负载功率,即可以提供给IT设备的峰值功率)小型则更高些。例如,一个数据中心20MW发电机,2 N配置,只提供6兆瓦的临界负载功率(加上4兆瓦电力冷却器)。因此,如果建造费为1.2亿美元, 则$20/W,而非$6/W。通常,成本采用美元/每平方英尺的方式来表示,但是没多大用处。行业专家多避免采用这种成本表述方式。

数据中心每月的折旧费用(或摊销费用),结果取决于最初的建筑费用持续的投资摊销(有关其使用寿命)和假定的利率。通常,数据中心的折旧周期超过了10 - 15年。根据美国会计准则,通常使用直线折旧即每月下降固定数量的资产价值。例如,15美元/ W的数据中心的折旧(周期超过12年), 折旧成本是0.10美元/ W /月。如果建设资金借债利率为8%,相应每月支付利息增加额外的0.06美元/ W 的成本, 总计每月0.16美元/ W。通常利率会随时间变化, 但许多公司将支付10 - 13%的利息。
服务器的折旧成本同上(除折旧周期短些,一些为3~4年),采用$?/W 的方式,使用服务器峰值时的真实能耗为标准。例如,4000美元的服务器实际峰值功率消耗500 W,成本为8美元/ W。4年以上折旧期,服务器成本0.17美元/ W每月。利率8%,则每年利息增加额外0.03美元/ W每月,总计0.20美元/ W每月,差不多和数据中心每瓦成本相同。
6.2 运营成本
数据中心运营成本比较难描述,因为很大程度取决于运营标准。(在同一时间有多少保安执勤,发动机检测服务多长时间一次)和数据中心的规模。成本还受物理位置(气候、税收、工资水平等)以及设计和寿命的影响。简单期间,将成本分摊为$?/W每月(包括安防,维护以及电)。 在美国,典型的multi-MW数据中心运营成本为0.02美元~0.08美元/ W每月,不包括实际的电力成本。
同样的,服务器还有运营成本。除基础设施本身运行成本,还需要关注硬件维护和维修,以及电力成本。服务器维护成本差异很大,取决于服务器类型和维护标准(例如,4小时响应时间vs.两个工作日)。
同样,在传统IT环境中,大部分的运营成本在应用,就是说,软件许可证,系统管理员、数据库管理员、网络  工程师的成本 等等。在这里将不包括这些应用成本,因为关注物理基础设施的运营成本,还因为应用成本因环境不同而相异较大。
6.3 案例研究
大量案例研究表明,在长期内,数据中心设施费用(比例的能耗)将占总成本的很大部分。(服务器的购买价格将不那么重要,主要的是能耗。)

软件性能和服务器利用率同样重要

作者: trybestying    时间: 2012-7-26 17:49
标题: 第7章处理故障和维修
本帖最后由 trybestying 于 2012-7-26 17:53 编辑

第7章处理故障和维修
确定适当级别的可靠性,从根本上讲是在故障(包括修理)和防止故障的成本之间的权衡。传统服务器失效成本被认为是非常高的,因此设计师不惜一切地提供更可靠的硬件,通过添加冗余电源、风扇、纠错编码(ECC)、RAID磁盘,等等。许多传统企业应用不是被设计来渡过频繁发生硬件故障,当故障发生时很难幸免。
 (WSCs)的硬件难以做到“足够可靠“因其规模。WSC应用程序必须在软件上解决服务器故障,要么在应用程序本身用代码实现或通过提供的功能或通过中间件,如开通虚拟机系统在一个空闲节点重启一个故障VM。编写软件用于这样的环境,Hamilton进行了一些启发性论述,这些论述基于一些大型服务(MSN和Windows Live)的设计和操作经验。
7.1 基于软件的容错能力的影响
尽可能的,应该尝试实现一个容错软件基础设施层,以避免应用程序级软件直面该层太多的故障复杂性。
GFS是个很有用的用于存储系统的例子,数据更新(需要与多个系统通讯以更新所有副本)引发网络开稍的增加,但提高了聚合读带宽,客户端可以从多个端点访问数据。
现代DRAM系统是一个很好的例子,在一个非常低的额外硬件成本下,可以提供强大的误差修正。GOOGLE下一代服务器将使用ECC DRAM。
7.2 故障分类 
7.2.1 Fault Severity 故障严重性
服务级故障大致分类:
破坏性的:提交的数据是无法再生,丢失或损坏
服务不可达的:服务停止或其他用户不可达
服务降级:服务是可用的,但在一些降级模式
故障掩蔽的:发生了故障,但完全隐藏在用户的容错软件/硬件机制
如果故障不能掩蔽,则可采用服务降级的方式(由Brewer提议),这被设计集群级软件服务所普遍采用。互联网搜索和邮件服务就是很好的例子。
即使互联网服务完全可靠的,用户的平均感知也不会大于99.0%的可用性。因为受互联网本身可用性限制。
测量服务可用性的标准是yield (由Brewer提出),即满意服务请求数除以总的服务请求数所得的分值。
总之,近乎完美的可靠性在互联网服务中不是普遍要求。
7.2.2 服务级故障的源由
据 Oppenheimer et al研究得出结论:,由操作导致的故障或错误配置引发的故障是最多的。由硬件相关引发的彻底故障事件(服务器或网络)占10 - 25%。
GOOGL的故障源如下图:
[attach]164818[/attach]
7.3 机器故障
7.3.1什么导致机器崩溃
软件更容易导致机器崩溃,而硬件中的内存和磁盘也是导致机器崩溃的焦点。
DRAM soft-errors. DRAM 软件错误
Disk errors 磁盘错误
 值得一提的是 精心设计的容错软件的一个关键特性是幸免无是硬件或软件错误引起的个别错误的能力。
7.3.2 Predicting Faults预测故障
采用预测模型来预测出故障,预测模型必须具有更大的准确性,在经济上具备竞争。
Pinheiro et a描述谷歌试图为磁盘建立预测模型,将基于磁盘驱动器故障健康参数,可以通过自我监测分析和报导技术标准。他们断定这些模型是不可能预测大多数故障,且预测到相对不精确的故障。我们通常经验是,只有一小部分故障类, 可以以足够高的精度被准确预测,为WSCs产生有用的操作模型。
7.4 REPAIRS 修复
高效的修复过程是至关重要的,在WSCs总体成本效率中。下图是GOOGLE系统健康自检架构。[attach]164819[/attach]
7.5容忍故障,不隐瞒故障

作者: 黑白人生-Alex    时间: 2012-11-19 10:29
楼主,真是太欣赏你了,其实这篇文章我也在翻,不过水平真的很有限,所以只敢留着自己看,加油吧,有机会可以讨论下。




欢迎光临 栖息谷-管理人的网上家园 (http://bbs.21manager.com.cn/) Powered by Discuz! X3.2