Memory Hierarchy¶
Learning Objectives: Organizing computer system memory hierarchically and implementing virtual memory.
In the first lecture, we introduced the general concept of a computer's memory hierarchy. At the top of this hierarchy are the processor's internal memory areas, such as registers, caches (and the stack). The closer the memory is to the processor, the faster it is, but the less of it there is, and the more expensive it is (per bit). The goal is to store data as close to the processor as possible to maximize performance.
In modern processors, the delays at different levels of the hierarchy are distributed approximately as follows:
Memory Type | Location | Latency (~clock cycles) |
1. CPU Registers | CPU | 0 |
2. TLB Registers | CPU | 0 |
3. L1 Cache: memory blocks (64-bit) | CPU | 4 |
4. L2 Cache: memory blocks (64-bit) | CPU | 10 |
5. L3 Cache: memory blocks (64-bit) | CPU/External | 50 |
6. Virtual Memory | RAM | 200 |
7. I/O Buffers (device caches) | RAM | 200 |
8. Hard Drive | Disk Controller | 100,000 |
It is noteworthy that even within caches, there are order-of-magnitude differences in latency. This is because, in modern processors, the L1 and L2 caches are typically integrated into the processor, while the L3 cache is external to the processor. This is also reflected in the size of the caches, ranging from (tens of) kilobytes to a few megabytes (L3). Typically, embedded systems use caches smaller than 1KB to keep microcontroller costs low.
Principle of Locality¶
In this course, memory addressing has so far been either direct (direct access) or indirect (indirect access), where the program specifies a memory address directly, or the processor calculates an address to access main memory.
Now, when examining memory access patterns, it has been observed that the placement of data in memory adheres to the principle of locality (temporal and spatial locality). This means that, at any given time, a program only needs a small part of the memory allocated to it, and memory accesses tend to target these areas. This locality can be observed in two dimensions:
- Temporal locality: the same memory addresses are accessed repeatedly (regularly).
- Spatial locality: memory accesses often target addresses that are close to one another.
-> It is therefore beneficial to keep only the most frequently accessed parts of memory in the cache.
When a program makes a memory access, the memory hierarchy is checked to see if the desired data is already in the cache. Once the data is found, an entire block of data is moved through the hierarchy closer to the processor. Since different levels of the hierarchy have varying amounts of memory, the block size decreases, and its content becomes more specific in line with the locality principles. Moving a block closer to the processor implements both temporal locality (keeping the most recent data close) and spatial locality (accessing nearby addresses).
Example: Code and data for loop structures would ideally be entirely in a nearby cache!
Hit Rate¶
To organize memory hierarchically, computational parameters can be used. A common metric is the hit rate (denoted as
h
), which is the ratio of memory accesses that hit the cache to the total number of memory accesses. For example, a hit rate
h=95%
means that 95% of memory accesses find the desired data in the cache, while 5% must fetch it from lower levels of the hierarchy. A higher hit rate means fewer slow memory accesses and faster program execution.From the hit rate
h
, the average memory access time (AMAT) can be calculated:Here,
Tcycle
is the clock cycle time, and Tmiss
is the (maximum) fetch time from memory:For example, if the clock cycle is 1ns and the memory read time is 20ns with a hit rate of 95%:
AMAT = 1ns + (1 - 0.95) * 20ns = 2ns
Memory Addressing¶
Previously, we mentioned that the address bus from the processor is used to read a memory address, and the corresponding word is fetched from memory. This is called linear memory addressing, where the address is used as is.
In modern computers, memory addressing often uses paging (engl. paging). The idea is that paging allows addressing a larger memory space than what the address width of the memory bus permits. Memory must therefore be divided into fixed-size pages (e.g., matching or smaller than the address bus width). Unused or additional address bits indicate which page contains the data. These bits are provided using registers and specific address memory locations and are represented in a page table (engl. page table). More on page tables shortly.
Example: In x86-based computers, memory is segmented into blocks with specific purposes. A set of registers (CS-GS, as shown in the introductory material) is reserved for memory addresses that indicate where each block begins. For instance, the CS segment (engl. code segment) holds program code, and the DS segment (engl. data segment) holds program data.
A paged memory address consists of three parts:
- Page directory (location in memory)
- Page table number/address
- Word address within the page.
Here, the memory address works as follows: first, the directory provides page table addresses; then the page table is fetched, which gives the page address; finally, a specific memory location on the page is retrieved. All these blocks can be scattered across physical memory.
In the figure, a 32-bit (linear) address is divided such that the page table's directory address uses 10 bits, the page location in the table uses another 10 bits, and the word location on the page uses 12 bits. The total memory size would then be
2^32 = 4GB
, with 2^20 = 1,048,576
pages, and each page would be 2^12 = 4096B
in size.(The processor's control logic could already infer from the components of the memory address whether the data is in the cache, as caches might, for instance, hold entire tables/blocks.)
Cache Organization¶
When handling memory, we aim to optimize the principles of locality, which is reflected in how the memory hierarchy is organized. Since the cache is always smaller in size than the level below it in the hierarchy, cache memory locations must be shared among accesses.
The figure illustrates the parts of a memory address from the perspective of cache organization (as presented in the course textbook). The rule is that caches are organized into sets, each containing a specified number of blocks, which in turn consist of a specified number of words.
- Tag: The block address/id, when the set contains 2^t blocks. A block is the smallest memory region fetched from main memory.
- Set: A group of memory blocks, when there are 2^s sets.
- Block Offset (index): The position of a word within a block, where a block contains 2^b words.
Example: In the 32-bit memory address illustrated above, the Set + Tag could be 10 + 10 bits, and the Block Offset 12 bits.
This structure of a memory address proves useful when using caches. The following explains methods of allocating cache memory locations among accesses.
Direct Mapping¶
The simplest addressing method is direct mapping, where the memory has 2^s sets, and each set has only one cache block/memory location. In this case, the Tag bits identify which block is currently in the cache location.
The cache location's Valid bit indicates whether the cache block is in use. (Otherwise, default values of 0 for the memory address and data in the location could be misinterpreted as values fetched into the cache.)
Example: With two Tag bits, we can address four blocks. When updating the cache, the Tag bits change according to the set being addressed. The cache location remains the same throughout.
Tag Set Block --- --- --- 00 000 110 01 000 101 10 000 000 11 000 001
The advantage of direct mapping is that the set does not need to be searched to locate the block (see below); it can be directly calculated from the address. The downside, however, is that memory locations are shared among blocks, potentially increasing memory fetches and leaving some cache locations underutilized.
Associative Cache¶
In an associative cache, a set has multiple cache locations for blocks, and a block can be placed in any of these locations.
Such a cache is generally an E-associative cache (engl. E-way set-associative cache). The cache's name is based on the number of blocks per set
E
. In the figure above, there are two blocks per set, so E=2
, making this a "2-associative cache."Now, to retrieve data from the cache, a search must be conducted across all cache locations in the set to locate the block. In real implementations, there could be up to 1024 blocks.
Fully Associative Cache¶
Direct mapping can be thought of as a 1-associative cache because there is only one possible cache location per set.
The other extreme is the fully associative cache, where a single set contains all cache locations. In such a cache, there are no Set bits in the memory address—only Tag bits. During memory access, every cache location must be searched to find the desired block.
One reason why this method exists today is that searches in such memory can now be programmatically optimized using search algorithms. More on this shortly under virtual memory.
Cache Updates¶
If the desired data is not found in the cache, a cache miss occurs. The data must then be fetched from the next level of cache or, eventually, from main memory.
Fetching data updates the cache, which can be done in various ways:
- In direct mapping, the targeted block is always overwritten.
- In associative caches, updating is more complex. Selecting which cache block to overwrite aims to optimize the principle of locality. The methods include:
* Empty block: If there are free blocks in the set, they are filled first with new data.
* Random: A block is selected for overwriting randomly.
* Least Frequently Used (LFU): The block with the fewest accesses is overwritten. This requires bookkeeping.
* Least Recently Used (LRU): The block that has not been accessed for the longest time is overwritten. This also requires bookkeeping.
* Random: A block is selected for overwriting randomly.
* Least Frequently Used (LFU): The block with the fewest accesses is overwritten. This requires bookkeeping.
* Least Recently Used (LRU): The block that has not been accessed for the longest time is overwritten. This also requires bookkeeping.
Before fetching new data into the cache, the existing data in the location (if any) may need to be written back to main memory because the program may have updated it. To manage this, cache locations have several bits that indicate the status of the location. The Valid bit above is one example, but there could also be bits indicating changes to the content or the freshness of the data.
To write updated data from the cache to lower levels of the hierarchy, two methods are used:
- Write-through: Data written to the cache is simultaneously written through all levels of the hierarchy.
- The advantage is simple implementation.
- The disadvantage is significantly longer write times. Writing to mass storage (e.g., hard drives) can be up to 100,000 times slower (although buffers can mitigate this for mass storage).
- Write-back: Data is written to memory only when its corresponding cache block is updated.
- The advantage is faster cache writes and that updating multiple data items in a block requires only one write operation to memory.
- However, the contents of the cache and main memory hierarchies can temporarily differ.
Implementations¶
In modern PCs, the two upper levels of the cache hierarchy, L1 and L2, are typically integrated into the processor, while the L3 cache is built using separate chips. The L3 cache is significantly larger than the higher levels.
The figure shows the general cache structure of Intel's Itanium processors.
In Intel's Xeon processor architectures, the L2 cache has its own parallel high-speed memory bus alongside the system bus. Since the L1 cache is integrated into the processor, the L2 cache is also made faster.
Virtual Memory¶
In modern computers, programs do not actually see where their memory is physically located, and the memory area can be larger than the part of main memory allocated to the program. This kind of memory space is called virtual memory. Virtual memory consists of physical main memory and external mass storage, such as a hard drive, organized using paging and caches so that programs perceive the memory space as contiguous.
Virtual memory uses virtual memory addresses, which are first translated into physical memory addresses before the memory is accessed. The operating system's services manage virtual memory addresses and perform memory operations, or processors may have dedicated MMU circuits (engl. Memory Management Unit) as part of bus control to handle memory operations and manage memory space.
The advantages include:
- Virtual memory allows programs to access a larger memory area than what is physically available in main memory.
- Naturally, virtual memory adheres to the principles of locality but on a much larger scale.
- Main memory can be shared among running programs based on their resource requirements.
- This is helpful because programs generally need only a small portion of their allocated memory space at a given time.
Memory Addressing¶
Virtual memory addressing is generally a slow operation because data must be fetched via caches and, in the worst case, from mass storage. Memory management can be done collaboratively by hardware and firmware, as the default assumption is that memory fetches are slow, leaving enough time for complex optimization operations!
Virtual memory is divided into pages (engl. page), much like main memory. The organization of virtual memory is fully associative, meaning that a page can be placed anywhere in main memory. Translating a virtual memory address to a physical one is done using an address translation mechanism, where the upper bits of the address point to the target physical memory.
Example: Assume a
32
-bit virtual memory address where the block size is 12
bits, meaning the block size is 2^12B = 4KB. The remaining 20
bits of the address indicate the memory page (i.e., set + tag). Thus, the total virtual memory size would be 2^20 * 4KB = 4GB
. The physical part of the virtual memory would consist of, for example, 2^18 * 4KB = 1GB
blocks in memory chips or mass storage.Fast Memory Addressing¶
The speed of inherently slow virtual memory accesses improves when the addressed memory is already present in the caches or main memory. According to the principle of locality, the virtual-to-physical address translations used frequently should be stored in advance in a cache called a page table.
However, even when using page tables, the problem persists that each memory access requires three lookups: one for the page directory, one for the page table, and one for the physical memory.
Virtual memory address translation can be accelerated by storing previously translated addresses in a register or a cache called the TLB (engl. Translation Lookaside Buffer). When the translated address is found in the TLB, the other cache lookups are unnecessary.
Since the TLB is a cache, it must have a Valid bit to indicate the state of the data in the memory location. Additionally, a D-bit (engl. dirty bit) indicates whether the corresponding virtual memory page has been updated by the program, in which case it should be written back to physical memory before loading a new page.
The translation process for virtual-to-physical memory using the TLB is as follows:
- Check whether the translation of the virtual memory page is found in the TLB register.
- If the translation is found, retrieve the data from the specified physical memory address.
- If the translation is not in the TLB but is in the page table, update the TLB during the lookup.
- If the page is not in the page table either, the page must be fetched via a page fault exception from mass storage.
Addresses are updated based on the LRU principle, where the TLB register and page tables use status bits (engl. reference bit) for bookkeeping to indicate whether and when a page was last accessed.
For write-back operations in virtual memory, the write-back method is used due to the slowness of mass storage, meaning the data is kept in a buffer and written to physical memory in bulk.
Memory Technologies¶
Several different memory technologies are currently used to implement the various levels of the memory hierarchy.
- SRAM (Static Random Access Memory)
- The memory access time is fixed and very close to the clock cycle time.
- Memory cells (i.e., bits) consist of a 6-8 transistor flip-flop that retains the bit value as long as the device is powered.
- DRAM (Dynamic Random Access Memory)
- DRAM memory cells consist of a transistor and a capacitor. DRAM is therefore cheaper and the memory chips are denser.
- The value of a bit is retained using the capacitor's charge, but since the charge dissipates over time, the cells must be refreshed at regular intervals. Refreshing involves reading the bit value into a buffer and then writing it back. This process is time-consuming for each bit in large memories, so refreshing is performed row-by-row in memory cells.
- Refreshing actually speeds up DRAM usage because the bits can now be read from the buffer instead of being fetched directly from memory.
- This is also beneficial for locality since adjacent bits can be read at once.
- SDRAM (Synchronous DRAM): Here, the processor and memory chip are synchronized with a separate clock, and data transfer can occur in bursts, meaning larger amounts of data are transferred at once.
- Flash: Based on EEPROM technology.
- Flash memory cells have a maximum number of writes they can endure, e.g., 100,000.
- Often used as program memory in embedded systems.
- Hard Drive: (e.g., hard disk)
- Seek time: Typically 3-13ms.
- Data read and write speeds depend on factors like rotational speed.
- Transfer rates are usually 100-200MB/s but can reach up to 750MB/s with an internal cache (as of 2012).
Implementations¶
Traditionally, the microchips of computers are placed side by side on a circuit board, making the board literally two-dimensional. A consequence of this is that the connections between microchips are often long, which, in modern microchip technologies, can already act as a limiting factor.
In memory chips, the new HBM technology (engl. High-Bandwidth Memory) stacks memory chips vertically so that connections run "vertically" through the memory chips. This arrangement saves circuit board space and achieves a wider memory bus, allowing for lower clock speeds.
Additional bibliography¶
Please refer to the course book Bryant & O'Hallaron, Computer Systems: A Programmer's Perspective, 3rd edition. Chapters 6 and 9.
Conclusion¶
"We are ... forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible."
- Burks, Goldstine & von Neumann (1946)
- Burks, Goldstine & von Neumann (1946)
While the idea of caches was introduced with the first modern computers in the 1940s, memory management is a highly complex operation in modern processors, involving various solutions distributed across the operating system, firmware, and hardware.
Give feedback on this content
Comments about this material