Improving cache performance

contents: 0-1. CA Intro

Associative caches

Support more flexible placement of blocks 01 11.jpg

Fully associative cache

A cache structure in which a block can be placed in any location in the cache
Requires all entries to be searched

N-way set associative cache

A cache that has N locations where each block can be placed
Requires N entries to be searched
The set containing a memory block
(Block address) modulo (num of sets in the cache)
Direct-mapped cache = 1-way set associative cache
Full-associative cache = M-way set associative cache
(where M is num of cache blocks in the cache)

Associative cache address

32-addresses
num of cache blocks: $2^n$ blocks
num of sets in the cache: $2^s$ sets (each set contains $2^{n-s}$ blocks: $2^{n-s}$ -way set associative cache)
Block size: $2^m$ words ( $2^{m+2}$ bytes)

02 11.jpg

Implementation of 4-way set associative cache

03 9.jpg

How much associativity

Increasing associativity makes flexibility & complexity increase.

We have to find moderate amount of set size.

Block replacement policy

In set associative cache, if we have to make room for another block, one of blocks in a cache set must be removed

Find non-valid entries and replace one of them
If there are no non-valid entries, then choose one of valid entries in the set based on

- Least-recently used (LRU)

- Random

Multi-level caches

Use primary (level-1, L-1) and secondary (level-2, L-2) caches

L-1 cache: placed close to (attached to) CPU

- Small, but very fast

L-2 cache: placed a little bit far from CPU

- Accessed on L-1 cache misses

- Larger and slower than L-1 cache, but still faster than main memory

- Main memory is accessed on L-2 cache misses

Practice 1

CPU base CPI = 1
L-1 miss rate = 2%
Main memory access time = 100ns
Clock period = 0.25ns

CPI of a single-level cache system = CPU base CPI + memory stall CPI

= CPU base CPI + L-1 miss rate * main memory access clock cycles

= CPU base CPI + L-1 miss rate * main memory access time / clock period

= $1 + 0.02 * (100 * 10^{-9}) / (0.25 * 10^{-9})$

= $1 + 8 = 9$

Practice 2

CPU base CPI = 1
L-1 miss rate = 2%
Global miss rate = 0.5% (Both L-1 & L-2 are missed)
L-2 access time = 5ns
Main memory access time = 100ns
Clock period = 0.25ns

CPI of a 2-level cache system = CPU base CPI + memory stall CPI

= CPU base CPI + L-1 miss rate * L-1 miss penalty + global miss rate * L-2 miss penalty

= CPU base CPI + L-1 miss rate * L-2 access time / clock period + global miss rate * main memory access time / clock period

= $1 + 0.02 * (5 * 10^{-9}) / (0.25 * 10^{-9}) + 0.005 * (100 * 10^{-9}) / (0.25 * 10^{-9})$

= $1 + 0.4 + 2 = 3.4$

Considerations

Designing L-1 cache

Focus on minimizing hit time -> to yield a short clock period
(the MEM stage time & IF stage time are determined based on the L-1 cache hit time)

Designing L-2 cache

Focus on minimizing miss rate -> to avoid main memory

Therefore, in multilevel caches,

L-1 cache and its blocks size is smaller than a single-level cache

- to reduce hit time and miss penalty

L-2 cache is larger than a single-level cache

- to reduce miss rate