Notes: Computer Architecture - wu-ys's personal website!

2022春交叉信息院《计算机系统结构》笔记，教师：高鸣宇

Notes of IIIS Computer Architecture course, Spring 2022, taught by Prof. Mingyu Gao.

01: basic concepts

Von Neumann Architecture: separate processor/memory

Code Compilation & Translation: High-level(*.c) -> Assembly(*.s) -> Machine object(*.o) -> executable

Processors: realizes ISA (RISC vs CISC)

Registers

Memory: Byte-Oriented Memory Accessing

Big-endian & little-endian (significant byte -> lower/higher addr.)

Memory mapping: Stack(calling, the bottom has high address), Heap(dynamic data)

02: Metrics

Performance: latency(time for one task) & throughput(# tasks per unit time)

Performance = (Execution time)$^{-1}$
Speedup of X over Y = Perf$_X$ / Perf$_Y$.
Execution time = #Cycles $\times$ CCT(clock cycle time)

Normally, we can use #Cycles to approximate exe. time.

CPI: cycles per instruction

#Cycles = IC $\times$ CPI, IC = instruction count in one program
Execution time = IC $\times$ CPI $\times$ CCT
Average CPI = $\sum$ IC / IC$_i$ $\times$ CPI$_i$.

Impact of memory, i.e., data accesses, could be reflected on CPI. Roofline model:

Means of metrics: Amean for absolutes; Hmean for rates; Gmean for ratios

Power & energy efficiency

Scaling:

Strong scaling: speedup on N processors with fixed total workload size
Weak scaling: speedup on N processors with fixed per-processor workload size
Limitation: imbalanced load; Amdhal’s Law: an optimization accelerates a fraction $f$ of a program by a factor of $S$, then speedup = $1/(1-f+f/S)$.

03 RISC-V ISA and Assembly

RISC-V system states: PC, 32 registers x0~x31(alias), memory from 0 to MSIZE-1 Special registers: ra, tp, sp, gp.

Some details:

12-bit signed offset in lw and sw instruction; (most case)
12-bit immediate in most instructions;
20-bit immediate in lui and auipc for loading the whole immediate.

04 RISC-V encoding

Six instruction formats:

R-format: register-register operations
I-format: register-immediate operations, loads, jalr (12-bit imm, 11:0)
S-format: stores (12-bit imm, 11:0)
B-format: branches (12-bit imm, 12:1)
J-format: jumps (20-bit imm, 20:1)
U-format: upper immediate instructions (20-bit imm, 31:12)

ISA for parallelism

SIMD: data-level parallelism
- Example: Intel x86 SSE and AVX Extensions
- SIMD intrinsics: use C functions instead of inline assembly to call SSE/AVX instructions
MIMD: thread-level parallelism
- process and thread
- Shared-memory multi-core processors
- ISA: Atomic instructions (A extension)
- Another possibility: using locks

05 Decoding & processor

Simple format of RISC-V $\to$ simple instruction decoding

Multi-level decoding:

Datapath & control bits

Structure & critical path:

Pipelining: 5 stages IF/ID/EX/MEM/WB

06 Avoiding hazards and stalls

Stalls: a pipeline “bubble”, not moving forward in that cycle

How to stall: insert nop instructions, or let the hardware stall as needed.
Causes of stalls: hazards

Hazards:

Structural hazard: A required resource is busy and cannot be used in this cycle
- different read/write ports; half-cycle register access
- cancel stage bypassing
- seperated memory(IM, DM)
- let multi-stage instructions go
Data hazard: Must wait previous dependent instructions to produce/consume data (RAW in typical)
- Solution: forwarding
- EX, MEM, WB $\to$ EX, MEM
- forwarding control logic
- exchange the order of instructions to avoid stalls
Control hazard: Next PC depends on previous instruction
- Resolving branch/jump earlier (in ID stage)
- Furthermore: predict the next PC (usually, PC+4)
- In case of misprediction, we need to nullify the operations and restart as soon as possible.
- Additional forwarding to ID for branches

07 Advanced Techniques

Branch prediction

Key idea of hardware-based prediction: predict based on recent history

Predicting direction:

A hardware branch history table (BHT) records each branch’s history
Each branch maintains a saturate counter to predict the next direction
- Counter +1 for an actually taken branch, -1 for not-taken
- 1-bit states: 0: predict not-taken, 1: predict taken
- 2-bit states: 00: predict not-taken, 01: not-taken?, 10: taken?, 11: taken
- Simple 2-bit BHT already achieves > 90% accuracy for most programs

Predicting target: Branch target buffer (BTB)

Record previously executed branches and their most recent target addresses
Such a hardware table is implemented using SRAM
- Address is branch instruction address, i.e., current PC
- Data is the most recent target, i.e., next PC

Overall workflow

At IF stage, if BHT predicts to be not taken, then PC = PC + 4
Else, use current PC to index into BTB
If there is an entry, then PC = target address in BTB, else PC = PC + 4

Superscalar and Out-of-Order

ILP: Instruction-Level Parallelism = critical path / # instructions

Superscalar execution: multi-instructions per cycle

Out-of-Order (OoO) Execution

Use diversified functional units to finish instructions ASAP to avoid blocking
Use inter-stage buffers for decoupling stages and reordering instructions

Stages: Fetch

decode: also read registers
Dispatch/rename/allocation: Resolve dependencies and prepare for OoO execution
Issue/schedule: Determine whether instruction is ready to execute
Execute: actual execution of instructions, e.g., ALU, memory access
Commit/retire/write-back: Update architectural states (registers or memory)
Each stage can be one or more cycles

08 Cache Design

Idea: Locality (Temporal & Spatial) in programs, both data and instructions; Faster access of a small subset. Cache: Implemented as a processor component (SRAM on chip), managed by hardware, multi-level.

Miss Handling: Stall the pipeline, fetch (may replace) a block. $\text{CPI penalty = Miss rate} \times \text{Miss penalty}$. $AMAT = \text{Hit latence} + \text{Miss rate} \times \text{Miss penalty}$.
Organization:
1. Fully-Associative: Tag = block address = Byte Address / Block Size (remove offset bits); high hit latency
2. Directly-mapped: (typical) index = lower bits in block address; high miss rate
3. $N$-way Set-Associative: Each block can go to one of $N$ (Associativity) entries in cache (a set):
$\text{Set Index} = (\text{Block Address}) \mod (\text{\# Sets in cache}) = \text{Lower bits in block address}.$
1. Extra components: Valid bits, Dirty bits (if necessary), replacement policy.
Physical Organization & accessing
- Tag, index, offset
- Use index bits to locate a set
- For each entry in the set: Check if valid bit is 1, Compare if tag matches.
- If found one such entry, hit, return data, use offset to select bytes in block
- If not found, miss, fetch data from memory, return data, and replace

Types (3C) of misses: Compulsory/cold, Capacity, Conflict

Thrashing: Sequential scanning for multiple times, causing insufficient capacity.

Replacement policies: random, LRU, LFU

Implementation bits
Goal: Balady’s optimal: choose the one used again furthest in the future. Match locality charateristics.

Write policies:

Write through: write in both cache (if exists) and memory. write-alloc vs no-write-alloc.
Write back: only write in cache; need a dirty bit per block to write back at eviction; fetch if not in cache.

Write buffer: a FIFO buffer between cache and memory; absorb small bursts; might serve for data misses.

Multi-level Caches: AMAT = L1 hit latency $\times$ L1 miss rate $+$ L2 AMAT = $\cdots$.

Typical Structure: L1-D, L1-I, L2 unified per core, L3 shared last-level, Main memory
Inclusive vs Exclusive vs Non-inclusive

Private & Shared Caches: Large caches are usually banked; lower hit latency, more parallelism;

Public Cache banking:

Dynamic block placing: a block is preferred to be placed in the bank closer to the core with the most accesses to it, and may also be migrated between banks; at any time, a block still can only be in one bank; minimize physical distance

Private Cache coherence: a read access should return the most recently written value

Cache cohenrence protocol: Single-writer, multiple reader (SWMR)
MSI protocol: [Modified-Shared-Invalid]
- Extend the valid bit to represent 3 states
Each private cache needs to be told the actions of others
- Snooping: All caches broadcast their misses through the interconnect
- Directory-based: Use a directory to track the state of each block

Non-Blocking Caches: hit-to-miss, hit-under-miss, miss-under-miss Miss status handler register (MSHR): keeps track of outstanding cache misses, pending load & stores refering to that block.

Fields of an MSHR: Valid bit, block address, multiple pending load/store entries(Valid, type&format, offset, dst/src register)
Operations

Prefetching: predict and fetch data into cache before processors request it. Reduce cold misses; higher bandwidth consumption; require OoO and non-blocking caches.

Software prefetch instructions / Hardware prefetchers.
Stream fetching, Strided fetching(PC-based stride detection)
Metrics: Accuracy, Coverage, Timeliness

Code-level optimization: loop reordering, loop blocking, false sharing

09 Main memory: DRAM

Hierarchy: CPU - Channel - Rank - Chip - bank

Physical address $\to$ <channel x, rank y, bank z>

Channels are fully independent, accessed with separate buses
- burst transfer: double-data-rate, transfer on both clk edges.
- NUMA: non-uniform memory access, faster to access local channels.
Ranks in one channel share I/O, but with independent access
- Each access goes to one rank by “ChipSelect” signal;
- Different ranks in one channel share command/address/data lines (I/O)
Fully sychronous access across all chips in a rank, providing wide data bus from narrow chip pins
- Chips in one rank share command/address lines, but separate data lines. Fixed granularity.
Banks in one chip share I/O, but with independent access.
- Banks operate independently, but share command/address/data buses; support concurrent accesses.

DRAM Bank: $2^n$ rows, $2^m$ columns with $2^m \times w$ bitlines, $w$ bits I/O width, $2^m\times w$ bits row buffer Overfetch: accessing one cell must activate the entire row to row buffer. row hit & row miss. Operations: ACTIVATE(open a row), READ, WRITE, PRECHARGE(close the row, clear row buffer), REFRESH

Memory Controller: inside the processor chip, decides address mapping and translate commands, manage row buffer and DRAM refresh.

DRAM access latency: CPU$\to$memory controller tranfer time, controller latency, DRAM bank latency = $[t(\text{row to column command delay}), t(\text{column access strobe}), t(\text{precharge})]$, DRAM data transer time, memory controller$\to$CPU transfer time.
DRAM address mapping: Coarse-grained partitioning: Columns are in lower part for spatial locality; banks are in lower part for parallelism.

Row Buffer Management: Open-page policy vs Closed-page policy

DRAM Scheduling Policy: FCFS(first come first served), FR-FCFS: Row hit first, then oldest first.

DRAM Refresh: serving for capacitor charge leaking. Refresh = ACTIVATE + PRECHARGE periodically

10 Virtual memory

Virtual addresses are mapped to physical address space in the unit of pages. Translation: page table. Each process has own page table; managed by OS; used by hardware; stored in memory while address stored in a special base register.

One entry (PTE) per virtual page: Valid bit, access rights, physical page address (frame number).

Usage: memory management, memory protection

DRAM as data cache for disk/SSD: only the most frequently used data are stored in DRAM (swapped-out status, i.e. a location on disk in page table)

Page fault: if we meet a swapped-out page (not in DRAM)
Demand Paging: when page fault occurs, the pages needs to be paged in. Invoke OS page fault handle to mov edata from storage into memory (extremely slow)
Synonym: a process may use different virtual address to the same phhysical address (e.g. copy-on-write)
Homonym: different processses can use the same virtual address, but translated to different physical address
Page sharing: different pages can share a page by pointing to the same physical address.

Multi-level page tables: solving the problem of page table size.

Only the top level is stored in memory, remaining part can be stored in memory or disk
or unallocated if there’s no PTE in this range.
Example: radix tree. The unallocated subtree can be saved.

Translation look-aside buffer(TLB): locatily in accesses$\to$locality in translations

A hardware cache containing PTEs (for page table).
TLB entry Organization: a PTE, valid bit, tag, LRU info, …
Miss handling: Page Table Walk, access PTE in memory, and then call for that page.

TLB Context switching: when multiple processes share a processor

Choice 1: flush all TLB entries
Choice 2: add special PID in each entry - coexist

Limited TLB reach: multi-level TLBs, larger or multiple page sizes.

Cache index & tag: virtual or physical?

Phycial cache: access caches and DRAM after the translation.
Virtually indexed, physically tagged: translation & cache access in parallel, cache index from page offset only, and check tag using (translated) full physical address.
Virtual cache: homonym and synonym issues.

11 Disks & SSDs

Magnetic Hard Disks: long-term, non-volatile, large, cheap, but slow:

$t(\text{Queuing delay}) + t(\text{Seek time}) + t(\text{Rotational delay, half cycle on average}) + t(\text{Data Transer})$

The second and third term dominates for small amounts of data; require locality and better scheduling.

Solid State Drive (SSD/Flash): an array of flash chips; faster but more expensive; life limited

Need to erase a whole block before writing; slow and limited times of erasing; no inplace modification.

FTL: flash translation layer, maintain logical to physical address mapping, so as to write pages in a block sequentially. May perform wear leveling to avoid hotspots.

12 I/O

Challenges: Diversity ($\to$ standardized interface), Slow&unpredictable behaviors(new access approaches)

Software interfaces: register-based and memory-mapped I/O

Memory-Mapped I/O: part of physical address space assigned to each I/O device, protected by VM.

Standardizing I/O interface: provide uniform interface, allow device drivers for different devices.

Types: Block/Stream, Blocking/Nonblocking/Asynchronous

Hardware Designs: buses and switched interconnects

Bus: A set of parallel common lines connecting multiple devices. Time-multiplexed. Accessing Bus: Master-Slave arrangement; multiple Masters by alternating masters (scheduling) Bus transactions: specifies <command, address, data for write>, then slaves respond. Typical buses: processor-memory bus, processor-processor bus, I/O bus
Synchronous bus (short, fixed clock rate, adapt device rate) vs Asynchronous bus (wide range, no-skew, complex handshaking protocol) Serial bus: avoid data skew between parallel wires, one bit a time at high frequency; Ser/Des device. Pipelined & Split transaction buses;
Switched Point-to-point Links: distributed. The switcher handles routing and arbitration on each link.

I/O Notification: infrequent, unexpected events. Polling vs Interrupts.

Polling: I/O device places information in a statuts register, the OS periodically checks the register;
Interrupts: Whenever I/O devices needs attention, it interrupts the processor. The OS finds the reason in specified places and handles it (save state, handle, restore state and return).

Large Data Transfer: Direct Memory Access(DMA), no need to bother the processor.

CPU sends a data transfer task <device and operation, src/dst address, length>. DMA accepts data transfer descriptors in a queue, arbitrates for bus and does the task, and notifies the processor when done (interrupts).

Issues: memory pinning to avoid OS swapping out pages;
Non-contiguous virtual memory$\to$request in the unit of pages; DMA using virtual address;
Cache cohenrence: flush cache/force write back before I/O, or search the cache for copies.

13 Multi-threading

Hardware multithreading: support multiple threads on a single core. lantency is tolerated rether than avoided. Context switch: require fast context switch supported by hardware. Types: Conventional, Coarse-grained, Fine-grained, Simultanous(different threads use different units)

Some hardware are replicated(e.g. PC, registers, PT pointer), while some are shared(e.g. ALU, cache, TLB)

Distributed Clusters: Network of independent, loosely coupled computers (nodes), connected using I/O. Suitable for independent tasks or very large datasets

Communication: Massage Passing, each processor has its own physical address space (different to share memory), so we need hardware of software to send/receive messages (sychronization) MP Interface: a standard lib for message passing

14 GPU & Accelerator

Idea: Use massive computational capacity and memory bandwith of GPU, to accelarate general compute kernels.

GPU thread model: Single program, multiple data (SPMD), single instruction, multiple threads (SIMT) Each thread has its own local memory; Parallel threads packed in blocks, access to per-block memory, synchronize threads using barriers; Blocks organized into grids, access to global memory, synchronize by separate kernel lauching.

Instruction & Thread Scheduling: scheduled in the unit of warps with multiple threads

Threads in one warp execute in lockstep with same instructions.
Once one warp is stalled, we can switch to another warp; all warps keep their states on chip for fast switch.
Branch divergence: tracked by hardware internally with stack-like structure; dynamic warp grouping
Thread Sychronization: Barrier synch. within a thread block (negligible latency; liit parallelism); Kernel relaunch at grid level; Atomic operations to L2/global memory.

Streamed Processor: One thread block is mapped to one SM.

MicroArchitecture: SIMT frontend, SIMD backend

frontend: find-grained multi–threading
datapath: multiple SIMD functional units
large register file (for all threads)

GPU Memory Hierarchy: Each SM uses a unified address space.

96kB capacity split between L1 and shared memory
L1 Cache: managed by hardware
Shared memory: explicitly controlled, managed by software (Important for performance)
Global L2 cache and DRAM

Limitation: registers and shared memory per SM are limited. limit parallelism

Accessing off-chip memory: large off-chip bandwidth, allow long latency by switching between warps, and hardware tries to maximize aligned and coalesced accesses.

GPU in System: as a device hosted by CPU, connected via PCIe bus. Work as accelerator or co-processor.

Accelerators: customized for one algorithm/one application domain, efficient but not flexible.

Memory bottleneck: we need to minimize memory access, match perf & bandwidth: line buffer & double buffer

Managing Accelerators: As an I/O device. Provide input, initiate computation, wait until done or error.

Acceleration should make up for extra data transfer time. Is it worth it?

Reconfigurable Architectures: spatial architectures (e.g. FPGA)

flexible, cheaper than custom hardware, more efficient than processor (perf/W). Circuit emulation
less efficient than custom hardware, less programmable than process, slow programming chain

Domain-Specific Hardware: Similar computations within a domain. Map various operators onto same hardware.

PREVIOUSNotes: Game Theory

NEXTNotes: Probability Theory (Chinese)