程序员必知的内存知识

原文地址:https://lwn.net/Articles/252125/

What every programmer should know about memory

Part 2: CPU Caches

程序员必知的内存知识

第二部分:CPU 缓存

[Editor’s note: This is the second installment in Ulrich Drepper’s “What every programmer should know about memory” document. Those who have not read the first part will likely want to start there. This is good stuff, and we once again thank Ulrich for allowing us to publish it.]
[编者注:这是Ulrich Drepper的文章“程序员必知的内存知识”的第二部分。那些没有阅读过第一部分的人可能会想从那看起。这篇文章很棒,再次感谢Ulrich允许我们发布它。]

CPUs are today much more sophisticated than they were only 25 years ago. In those days, the frequency of the CPU core was at a level equivalent to that of the memory bus. Memory access was only a bit slower than register access. But this changed dramatically in the early 90s, when CPU designers increased the frequency of the CPU core but the frequency of the memory bus and the performance of RAM chips did not increase proportionally. This is not due to the fact that faster RAM could not be built, as explained in the previous section. It is possible but it is not economical. RAM as fast as current CPU cores is orders of magnitude more expensive than any dynamic RAM.
如今的CPU比25年前更加复杂。25年前,CPU内核的频率处于与内存总线相当的水平。访问内存只比访问寄存器慢一点。但是在90年代早期,当CPU设计人员提高CPU内核的频率而内存总线的频率和RAM(random-access memory,内存;随机存取存储器)芯片的性能没有按比例增加时,这种情况发生了巨大的变化。如上一节所述,这不是因为无法建造更快的RAM。造更快的RAM是可能的,但这样做不经济。与当前CPU内核一样快的RAM比任何动态RAM都要贵上几个数量级。

If the choice is between a machine with very little, very fast RAM and a machine with a lot of relatively fast RAM, the second will always win given a working set size which exceeds the small RAM size and the cost of accessing secondary storage media such as hard drives. The problem here is the speed of secondary storage, usually hard disks, which must be used to hold the swapped out part of the working set. Accessing those disks is orders of magnitude slower than even DRAM access.
如果非要在在具有非常小、非常快的RAM的机器和具有大量相对较快RAM的机器之间做一个选择,那么第二个方案总是更好的选择,如果给定的工作集大小超过小RAM的大小和访问硬盘之类的二级辅助存储介质的成本。问题在于二级存储的速度。二级储存通常是硬盘,它必须够大来保存工作集中从内存交换给它的部分。访问这些磁盘的速度比访问DRAM(动态随机存取存储器,Dynamic Random Access Memory)慢几个数量级。

Fortunately it does not have to be an all-or-nothing decision. A computer can have a small amount of high-speed SRAM in addition to the large amount of DRAM. One possible implementation would be to dedicate a certain area of the address space of the processor as containing the SRAM and the rest the DRAM. The task of the operating system would then be to optimally distribute data to make use of the SRAM. Basically, the SRAM serves in this situation as an extension of the register set of the processor.
幸运的是,我们并不需要做这种非黑即白的决定。计算机可以在拥有大量DRAM的同时带有少量的高速SRAM(静态随机存储器,Static Random Access Memory)。一种可能的实现方式是将处理器地址空间的某个区域用SRAM存储而将其余部分用DRAM。接下来,操作系统的任务就是最佳地分配数据以利用SRAM。基本上,SRAM在这种情况下被用作处理器寄存器组的扩展。

While this is a possible implementation, it is not viable. Ignoring the problem of mapping the physical resources of such SRAM-backed memory to the virtual address spaces of the processes (which by itself is terribly hard) this approach would require each process to administer in software the allocation of this memory region. The size of the memory region can vary from processor to processor (i.e., processors have different amounts of the expensive SRAM-backed memory). Each module which makes up part of a program will claim its share of the fast memory, which introduces additional costs through synchronization requirements. In short, the gains of having fast memory would be eaten up completely by the overhead of administering the resources.
虽然这是一种可能的实现方式,但它并不可行。即使先忽略将这种SRAM支持的内存的物理地址映射到进程的虚拟地址空间(这本身就非常困难)所产生的问题,这种方法要求每个进程直接在软件中管理该存储区域的分配。存储区域的大小将随处理器而变化(即,不同处理器具有不同数量的昂贵的SRAM支持的内存)。构成一个程序的每个模块都将试图占用其快速存储器的份额,这又会因为协调不同要求而引入额外的成本。简而言之,拥有快速内存的收益将完全被管理资源带来的开销所抵消。

So, instead of putting the SRAM under the control of the OS or user, it becomes a resource which is transparently used and administered by the processors. In this mode, SRAM is used to make temporary copies of (to cache, in other words) data in main memory which is likely to be used soon by the processor. This is possible because program code and data has temporal and spatial locality. This means that, over short periods of time, there is a good chance that the same code or data gets reused. For code this means that there are most likely loops in the code so that the same code gets executed over and over again (the perfect case for spatial locality). Data accesses are also ideally limited to small regions. Even if the memory used over short time periods is not close together there is a high chance that the same data will be reused before long (temporal locality). For code this means, for instance, that in a loop a function call is made and that function is located elsewhere in the address space. The function may be distant in memory, but calls to that function will be close in time. For data it means that the total amount of memory used at one time (the working set size) is ideally limited but the memory used, as a result of the random access nature of RAM, is not close together. Realizing that locality exists is key to the concept of CPU caches as we use them today.
因此,我们不应该将SRAM置于操作系统或用户的控制之下,而应该让它成为由处理器透明地使用和管理的资源。这时,SRAM用于临时复制(或者缓存)主存中那些可能即将被处理器使用的数据。这种做法是可行的,因为程序代码和数据具有时间和空间局部性。这意味着,在很短的时间内,曾经被使用过的代码和数据很有可能被再次使用。从代码的层面来说,这意味着代码中很可能存在循环,所以相同的代码一次又一次地执行(空间局部性的完美情况)。理想情况下,数据访问也仅限于很小的内存区域内。即使在短时间段内访问的内存不紧凑,也很有可能不久之后重复使用相同的数据(时间局部性)。从代码的层面来说,这时可能在循环中进行函数调用,并且该函数位于地址空间的其他位置。该函数可能在内存中相距甚远,但时间上对该函数的调用并不遥远。对于数据,这意味着每次使用的内存的总量(工作集大小)是有限的,但由于RAM 的随机访问性质,所使用的内存并不需要被紧凑地存储在一起。当我们今天使用CPU缓存时,意识到存在局部性理解是CPU缓存概念的关键。

A simple computation can show how effective caches can theoretically be. Assume access to main memory takes 200 cycles and access to the cache memory take 15 cycles. Then code using 100 data elements 100 times each will spend 2,000,000 cycles on memory operations if there is no cache and only 168,500 cycles if all data can be cached. That is an improvement of 91.5%.
一个简单的计算可以显示理论上缓存有多有效。假设访问主存储器需要200个时钟周期而访问高速缓冲存储器需要15个周期。然后,如果没有缓存,则对100个数据元素使用100次的代码将在内存操作上花费2,000,000个周期,如果所有数据都可以缓存,则仅花费168,500个周期。这可是91.5%的改善。

The size of the SRAM used for caches is many times smaller than the main memory. In the author’s experience with workstations with CPU caches the cache size has always been around 1/1000th of the size of the main memory (today: 4MB cache and 4GB main memory). This alone does not constitute a problem. If the size of the working set (the set of data currently worked on) is smaller than the cache size it does not matter. But computers do not have large main memories for no reason. The working set is bound to be larger than the cache. This is especially true for systems running multiple processes where the size of the working set is the sum of the sizes of all the individual processes and the kernel.
用于缓存的SRAM的大小比主存小许多倍。在作者使用带CPU缓存的工作站的经验中,缓存大小一直是主内存大小的1/1000(现在:4MB缓存和4GB主存)。仅这一点不构成问题。如果工作集的大小(当前处理的数据的大小)小于缓存大小则无关紧要。但计算机不是无缘无故配有大容量主存的。工作集一定会大于缓存。对于运行多个进程的系统尤其如此:其工作集的大小是所有单个进程和内核的大小的总和。

What is needed to deal with the limited size of the cache is a set of good strategies to determine what should be cached at any given time. Since not all data of the working set is used at exactly the same time we can use techniques to temporarily replace some data in the cache with other data. And maybe this can be done before the data is actually needed. This prefetching would remove some of the costs of accessing main memory since it happens asynchronously with respect to the execution of the program. All these techniques and more can be used to make the cache appear bigger than it actually is. We will discuss them in Section 3.3. Once all these techniques are exploited it is up to the programmer to help the processor. How this can be done will be discussed in Section 6.
处理缓存大小有限问题所需要的是一组好的策略,来确定在什么时间应该缓存什么。由于工作组的数据不会被同时全部访问,我们可以用技术来将缓存中的一些数据暂时替换成其它数据。这种替换可以在实际需要这些数据之前完成。预先缓存数据将消除访问主存的一些成本,因为它与程序的执行异步发生。这种技术和更多其它技术可使得缓存显得比实际容量更大。我们将在3.3节讨论这一点。在所有这些技术被使用之后,就需要靠程序员来帮助处理器了。如何做到这一点将在第6节中讨论。

3.1 CPU Caches in the Big Picture
3.1 CPU 缓存在计算机整体层面上的地位
Before diving into technical details of the implementation of CPU caches some readers might find it useful to first see in some more details how caches fit into the “big picture” of a modern computer system.
在深入研究实现CPU缓存的技术细节之前,一些读者可能会觉得,首先更多了解缓存在现代计算机系统的“大局”中的地位会有帮助。

Figure 3.1: Minimum Cache Configuration
图 3.1:最小缓存配置
Figure 3.1 shows the minimum cache configuration. It corresponds to the architecture which could be found in early systems which deployed CPU caches. The CPU core is no longer directly connected to the main memory. {In even earlier systems the cache was attached to the system bus just like the CPU and the main memory. This was more a hack than a real solution.} All loads and stores have to go through the cache. The connection between the CPU core and the cache is a special, fast connection. In a simplified representation, the main memory and the cache are connected to the system bus which can also be used for communication with other components of the system. We introduced the system bus as “FSB” which is the name in use today; see Section 2.2. In this section we ignore the Northbridge; it is assumed to be present to facilitate the communication of the CPU(s) with the main memory.
图3.1显示了最小缓存配置的情况。它所对应的架构可以在早期部署CPU缓存的系统中找到。CPU内核不再直接连接到主内存。(在更早的系统中,缓存像CPU和主存一样连接到系统总线。这更像是一种权宜之计而不是真正的解决方案。)所有加载和存储都必须通过缓存。CPU内核和缓存之间的连接是一种特殊的快速连接。简单来说,主存和高速缓存连接到系统总线,系统总线也可用于与系统的其他组件通信。我们介绍系统总线时叫它“FSB”,这是目前使用的名称; 见2.2节。在本节中,我们忽略北桥,只有在帮助 CPU与主存储器的通信时我们才假设它存在。

Even though computers for the last several decades have used the von Neumann architecture, experience has shown that it is of advantage to separate the caches used for code and for data. Intel has used separate code and data caches since 1993 and never looked back. The memory regions needed for code and data are pretty much independent of each other, which is why independent caches work better. In recent years another advantage emerged: the instruction decoding step for the most common processors is slow; caching decoded instructions can speed up the execution, especially when the pipeline is empty due to incorrectly predicted or impossible-to-predict branches.
尽管在过去几十年中计算机一直在使用冯诺伊曼(von Neumann)架构,但实践表明,把用于代码和数据的高速缓存分离是有利的。Intel(英特尔)自1993年以来一直使用分开的代码和数据缓存,从未改变过。代码和数据所需的内存区域几乎彼此独立地储存着,这就是独立缓存更有效的原因。近年来出现了另一个优点:处理器的指令解码步骤一般都很慢; 缓存解码过的指令可以加速执行,特别是当流水线由于错误预测或无法预测的分支而为空时。

Soon after the introduction of the cache, the system got more complicated. The speed difference between the cache and the main memory increased again, to a point that another level of cache was added, bigger and slower than the first-level cache. Only increasing the size of the first-level cache was not an option for economical reasons. Today, there are even machines with three levels of cache in regular use. A system with such a processor looks like Figure 3.2. With the increase on the number of cores in a single CPU the number of cache levels might increase in the future even more.
引入缓存后不久,系统变得更加复杂。缓存和主存之间的速度差异再次增大,以至于增加了另一级缓存,比第一级缓存更大,更慢。出于经济原因,仅增加第一级缓存的大小不是一种选择。今天,甚至有常规的使用三级缓存的机器。具有这种处理器的系统如图3.2所示。随着单个CPU中核心数量的增加,未来缓存级别的数量可能会增加。

Figure 3.2: Processor with Level 3 Cache
图3.2:具有3级缓存的处理器

Figure 3.2 shows three levels of cache and introduces the nomenclature we will use in the remainder of the document. L1d is the level 1 data cache, L1i the level 1 instruction cache, etc. Note that this is a schematic; the data flow in reality need not pass through any of the higher-level caches on the way from the core to the main memory. CPU designers have a lot of freedom designing the interfaces of the caches. For programmers these design choices are invisible.
图3.2显示了三级缓存,并介绍了我们将在本文档的其余部分中使用的命名法。L1d代表1级数据高速缓存,L1i代表1级指令高速缓存,以此类推。注意这是一个原理图; 实际中,数据流可以在从核心到主存的路上不通过任何更高级别的高速缓存。CPU设计人员可以自由设计高速缓存的接口。对于程序员来说,这些设计选择是不可见的。

In addition we have processors which have multiple cores and each core can have multiple “threads”. The difference between a core and a thread is that separate cores have separate copies of (almost {Early multi-core processors even had separate 2nd level caches and no 3rd level cache.}) all the hardware resources. The cores can run completely independently unless they are using the same resources—e.g., the connections to the outside—at the same time. Threads, on the other hand, share almost all of the processor’s resources. Intel’s implementation of threads has only separate registers for the threads and even that is limited, some registers are shared. The complete picture for a modern CPU therefore looks like Figure 3.3.
此外,我们的处理器具有多个内核,每个内核可以有多个“线程”。核心和线程之间的区别在于,单独的核心具有(几乎{ 早期多核处理器甚至具有单独的第二级高速缓存而没有第三级高速缓存。 })所有硬件资源的单独副本。核心可以完全彼此独立运行,除非它们使用到了相同的资源——例如与外部的连接——并且在同一时刻用到。另一方面,线程几乎共享处理器的所有资源。英特尔的线程实现中,线程只有寄存器是各自使用的。即使这一点也是有限的,一些寄存器是共享的。因此,现代CPU的完整图片如图3.3所示。

Figure 3.3: Multi processor, multi-core, multi-thread
图3.3:多处理器,多核,多线程
In this figure we have two processors, each with two cores, each of which has two threads. The threads share the Level 1 caches. The cores (shaded in the darker gray) have individual Level 1 caches. All cores of the CPU share the higher-level caches. The two processors (the two big boxes shaded in the lighter gray) of course do not share any caches. All this will be important, especially when we are discussing the cache effects on multi-process and multi-thread applications.
在这个图中有两个处理器,每个处理器有两个内核,每个内核有两个线程。线程共享1级缓存。核心(深灰色阴影)具有单独的1级缓存。CPU的所有核心共享更高级别的缓存。两个处理器(两个浅灰色阴影的大框)不共享任何缓存。所有这些都很重要,尤其是当我们讨论多进程和多线程应用程序的缓存效果时。

3.2 Cache Operation at High Level
3.2高级别的缓存操作
To understand the costs and savings of using a cache we have to combine the knowledge about the machine architecture and RAM technology from Section 2 with the structure of caches described in the previous section.
要了解使用缓存的成本和优点,我们必须将第2节中的机器架构和RAM技术的知识与上一节中描述的缓存结构相结合

By default all data read or written by the CPU cores is stored in the cache. There are memory regions which cannot be cached but this is something only the OS implementers have to be concerned about; it is not visible to the application programmer. There are also instructions which allow the programmer to deliberately bypass certain caches. This will be discussed in Section 6.
默认情况下,CPU内核读取或写入的所有数据都存储在缓存中。有些内存区域无法缓存,但这只是操作系统实现人员必须关注的问题; 它对应用程序员来说是不可见的。还有一些指令允许程序员故意绕过某些缓存,将在第6节中讨论这些。

If the CPU needs a data word the caches are searched first. Obviously, the cache cannot contain the content of the entire main memory (otherwise we would need no cache), but since all memory addresses are cacheable, each cache entry is tagged using the address of the data word in the main memory. This way a request to read or write to an address can search the caches for a matching tag. The address in this context can be either the virtual or physical address, varying based on the cache implementation.
如果CPU需要一个字的数据,则首先搜索高速缓存。显然,缓存不能包含整个主存储器的内容(否则我们不需要缓存),但由于所有内存地址都是可缓存的,因此每个缓存条目都使用主存中数据字的地址进行标记。这样,读取或写入地址的请求可以在高速缓存中搜索匹配的标记。在这种情况下,地址可以是虚拟内存地址或物理地址,基于缓存的具体实现而变化。

Since the tag requires space in addition to the actual memory, it is inefficient to choose a word as the granularity of the cache. For a 32-bit word on an x86 machine the tag itself might need 32 bits or more. Furthermore, since spatial locality is one of the principles on which caches are based, it would be bad to not take this into account. Since neighboring memory is likely to be used together it should also be loaded into the cache together. Remember also what we learned in Section 2.2.1: RAM modules are much more effective if they can transport many data words in a row without a new CAS or even RAS signal. So the entries stored in the caches are not single words but, instead, “lines” of several contiguous words. In early caches these lines were 32 bytes long; now the norm is 64 bytes. If the memory bus is 64 bits wide this means 8 transfers per cache line. DDR supports this transport mode efficiently。
由于标签除了实际存储器之外还需要空间,因此选择字作为高速缓存的单位是低效的。对于x86机器上的32位字,标签本身大小可能需要32位或更多。此外,由于空间局部性是高速缓存的基本原则之一,也必须把这一点考虑进去。由于相邻内存区域可能被一起访问,因此它们也应该一起加载到高速缓存中。还要记住我们在第2.2.1节中学到的内容:如果RAM模块可以在不发送新CAS或甚至RAS信号的情况下连续传输多个数据字,则它们会更有效。因此,存储在高速缓存中的条目不是字,而是有着几个连续字的“行”。在早期的缓存中,这些行的长度为32个字节; 现在的规范是64字节。如果存储器总线是64位宽,则这意味着每个高速缓存行需要8次传输。DDR能有效地支持这种传输模式。