Cache overview

@MJ · 7 min read
Created Date · 2024년 08월 21일 12:08
Last Update · 2024년 10월 20일 18:10

contents: 0-1. CA Intro

Memory hierarchy

Locality

  • Temporal locality

    - Items accessed recently are likely to be accessed again soon

  • Spatial locality

    - Items near those accessed recently are likely to be accessed soon


Memory structure

  • SRAM (cache memory attached to CPU)

    - Fastest (o.5ns ~ 2.5ns)

    - most expensive

    - smallest

  • DRAM (main memory)

    - Faster (50ns ~ 70ns)

    - more expensive

    - smaller

  • Disk (HDD, SSD)

    - Slowest (5ms ~ 20ms)

    - cheapest

    - largest


Use hierarchy

  • Copy recently accessed (and nearby) items from disk to smaller DRAM memory

  • Copy more recently accessed (and nearby) items from DRAM to smaller SRAM memory


Term

  • Block

    - The minimum unit of information

    - It can be either present or not present in a cache

  • Hit

    - Accessed data is present

    - Hit ratio: # of hits / # of accesses

  • Miss

    - Accessed data is absent

    - Block is copied from lower level (Additional time taken)

    - Miss ratio: # of misses / # of access (= 1 - hit ratio)


Direct mapped cache

Each memory location can be mapped directly to exactly one location in the cache

01 10.jpg
01 10.jpg

  • Cache address = (Block address) modulo (# of blocks in cache)

  • Num of blocks in cache is power of 2 (e.g., 2, 4, 8, 16, 32, ...)

  • The cache address is determined by the low-order bits of block address

  • Tags contain the address information of the data (the high-order bits of the address)

  • To avoid using meaningless information, add a valid bit for each cache block

    - Valid bit = 1 (the cache block contains valid information)

    - Valid bit = 0 (the cache block contains invalid information)

    - Initially, the valid bits of all cache blocks are set to 0


Cache address

  • 32-bit addresses

  • num of cache blocks: 2n2^n blocks (the lowest n bits of the block address are used for the index)

  • Block size: 2m2^m words (2m+22^{m+2} bytes)

    - m bits are used for the word within the block, 2 bits are used for the byte within the word

02 10.jpg
02 10.jpg

Cache size

03 8.jpg
03 8.jpg

Cache size

= Cache table size

= Num of cache block ×\times (valid bit length + tag length + block size(data length))

Practice 1

  • 32-bit addresses

  • Num of cache blocks: 2102^{10} blocks

  • Block size: 202^0 words (222^2 bytes)

Cache size

= 210×(1+(32(10+0+2))+32)2^{10} \times (1 + (32 - (10 + 0 + 2)) + 32)

= 210×532^{10} \times 53 bits



Practice 2

  • 32-bit addresses

  • Num of cache blocks: 64 blocks

  • Block size: 4 words

Cache size

= 26×(1+(32(6+2+2))+128)2^{6} \times (1 + (32 - (6 + 2 + 2)) + 128)

= 26×1512^{6} \times 151 bits


More about

  • If we increase the size of blocks, this may help reduce miss rate due to spatial locality

  • But, Larger blocks -> a smaller number of cache blocks -> more competition -> increased miss rate

  • Increased miss penalty (the time for copying from lower level)


Handling cache misses

On cache hit, CPU proceeds normally. (requiring 1 clock cycle

But, on cache miss, the control unit of the CPU

  • Step 1: stalls the CPU pipeline

  • Step 2: copies a block from the next level of hierarchy (e.g., memory)

  • Step 3: does the stalled task

    - Restarts instruction fetch (IF stage)
if the cache miss happened when fetching an instruction from the instruction memory

    - Completes data access (MEM stage)
if the cache miss happened when loading data from the data memory


Handling writes

When will the newly-updated data in the cache be written to the lower-level memory (e.g., main memory)

  • Write-through: Update both cache and lower-level memory at the same time

  • Write-back: Just update cache

    - Keep track of which block is dirty (used)

    - When a dirty block is replaced, write it back to the lower-level memory

Pros Cons
Write-through - Consistency between cache and memory is guaranteed
- Easy to implement
- Slow write speed
Write-back - Fast write speed
(if there is no replacement)
- Consistency between cache and memory is not guaranteed
- Complex to implement

Write-through with write buffer

To reduce the delay of write operations on the write-through method,
we can use a write-buffer (much faster to access than memory)

  • A write buffer folds data waiting to be written to lower-level memory

    - Step 1: Update both cache and write buffer

    - Step 2: The processor continues the program execution without waiting

        - The memory takes the updated data from the write buffer

  • If the buffer is full, the processor must stall until there is an empty position

The write buffer can be also used to improve the performance of write-back

  • If all the blocks in cache are dirty

Handling write misses

If there is no requested block in the cache, it causes write misses

  • Write-allocate

    - First, fetch the block to cache

    - And then, handle the write operation

  • Write around

    - Just update the portion of the block in lower-level memory, but not put it in cache

    - It is good when we need to initialize memory space


Real-world example

Embedded MIPS processor with 12-stage pipeline

  • Split cache: I-cache (for instructions) and D-cache (for data)

  • Each 16KB: 256 blocks ×\times 16 words per block

04 6.jpg
04 6.jpg


Cache performance

CPU time = clock cycle ×\times clock period

= (CPU execution clock cycles + memory-stall clock cycles) ×\times clock period

  • CPU execution clock cycles

    - The clock cycles that CPU spends executing the program

    - Cache hit time is included

  • Memory-stall clock cycles

    - The clock cycles that CPU spends waiting for the memory access

    - Mainly from cache misses


Memory-stall clock cycles

Simplifying assumption: the read and write miss penalties are the same

Num of memory accesses ×\times Miss rate ×\times Miss penalty

= Num of misses ×\times Miss penalty

In MIPS, we have two different cache (instruction (I-cache) and data (D-cache))

Num of memory accesses ×\times Miss rate ×\times Miss penalty

= Num of instruction memory access ×\times I-cache miss rate ×\times I-cache miss penalty

++ Num of data memory access ×\times D-cache miss rate ×\times D-cache miss penalty


Practice 1

  • Base CPI (on cache hit) = 2

  • Instruction-cache miss rate = 2%

  • Data-cache miss rate = 4%

  • Miss penalty = 100 cycles

  • Load & stores are 36% of instructions


Miss-stall clock cycles (when there are I instructions)

  • For instructions: I ×\times 0.02 ×\times 100 = 2 ×\times I

  • For data: I ×\times 0.36 ×\times 0.04 ×\times 100 = 1.44 ×\times I


CPU time

  • Actual CPU time: (2 ×\times I + 2 ×\times I + 1.44 ×\times I) ×\times clock period = 5.44 ×\times I ×\times clock period

  • Ideal CPU time: (no cache misses = perfect cache): 2 ×\times I ×\times clock period


Ideal  performanceActual  performance=Actual  CPU  timeIdeal  CPU  time=I×actual  CPI×clock  periodI×ideal  CPI×clock  period=Actual  CPIIdeal  CPI=5.442 \frac{Ideal\;performance}{Actual\;performance} = \frac{Actual\;CPU\;time}{Ideal\;CPU\;time} = \frac{I \times actual\;CPI \times clock\;period}{I \times ideal\;CPI \times clock\;period} \\ = \frac{Actual\;CPI}{Ideal\;CPI} = \frac{5.44}{2}

    "The performance with the perfect cache is better by 2.72"


Practice 2

Suppose the processor is made faster, but the memory system is not

  • Base CPI (on cache hit) = 2 -> 1

  • Instruction-cache miss rate = 2%

  • Data-cache miss rate = 4%

  • Miss penalty = 100 cycles

  • Load & stores are 36% of instructions

Memory-stall clock cycle Memory-stall CPI
For I-cache 2 ×\times I 2
For D-cache 1.44 ×\times I 1.44
CPU time CPI
Actual 4.44 ×\times I ×\times CP 4.44
Ideal 1 ×\times I ×\times CP 1  

    "The performance with the perfect cache is better by 4.44"

    But the gap between actual and ideal case is same as 3.44