Architecture

Trend

Moore’s Law
Dark silicon

Metrics

How to measure performance:
- Latency or response time
- Bandwidth or throughput - at least square of latency’s growth rate
- Benchmark suit: A set of representative programs that are likely relevant to the user
- Average:
  - Time: Arithmetic mean
  - Harmonic Mean: Thoughput
  - Geometric Mean: speedup
Processor performance:
- clock cycle time T
- cycles per instruction(CPI or IPC - instruction per cycle) (depends on the instruction set in question)
- Instruction count(IC)
- CPU time ICCPIT
comparison
- Speedup/slowdown
- percentage increase in performance
- reduction in execution time
locality/parallelism
Amdahl’s Law: speedup of a fraction has diminishing returns
Power&Energy
- power (\(P = VI\))
- Energy (\(E = PT = P_{sta} + P_{dyn}\))
- Static - constant power draw
  - Power sta = Voltage * Current sta
- dynamic - consumed at every switching event due to charging and dischargeing load capacities
  - Power dyn = activity * capacitance * frequency * voltage^2 \(P=CV^2Af = AV(\frac{CV}T)\)
- CPU power is the rate of heat generated
- excessive peak power may result in burning the chip
- Opportunistically power gating
Cost of integrated Circuit
- Cost of die \(\frac{\text{wafer cost}}{\text{dies per wafer} * \text{die yield}}\)
- Yield of die \(\frac{\text{wafer yield}}{(1+\text{defect per unit area} * \text{die area})^N}\)
- process-compexity factor specified by chip manufacturer
Dependability
- System reliability and availability
- Mean time to Failure MTTF
- Mean time to Repair MTTR
- System availability = \(\frac{MTTF}{MTTF+MTTR}\)

Instruction set architecture

interfacing contract between hardware and software
Defined
- functional operation of units
- How to use each functional unit
Does not define
- how functional unit implemented
- Execution time of operation
- energy consumption
Programmer’s perspective
- internal machine states: program counter - next instruction to be executed
- operations
- addressing modes
Operand location
CISC
- may reduce IC, increase CPI, and increase CT
- CPU time may increase
- complex operations
- variable length
- limited registers
RISC
- simple operations
- fixed length
- limited instruction formats
Memory addressing
- Register
- Immediate
- Displacement
- Register Indirect
- indexed
- direct
- memory indirect @
- auto decrement/increment
- scaled
Instruction format
- MISP: I-type load store conditional R-type ALU J-type
- Opcode : RS RT : RD ShAmnt Funct
Processing
- Instruction fetch
- Instruction decode
- Register read
- Execute instructions
- Memory access
- Register write back
Example: Single-cycle RISC architecture
Pipelining
- Reusing idle resources: each processing step finishes in a fraction of a cycle
- Improving throughput at the expense of latency
- Delay: \(D=T+n\delta\) - ideal case
- Throughput: \(IPS = n/D\)
Example: 5-stage MIPS pipeline
- Instruction Fetch: read instruction from memory (I-Cache)
  - Program counter (PC)
  - Compute NPC
  - Update pipeline operator
- Instruction Decode
  - generate control signal
  - read source operands from register file
  - update pipeline register
    - send operand and immediate value to next stage
    - pass control signal and NPC to next stage
- Execution Stage
  - Perform ALU operation
    - result of ALU
    - Compute branch target(assuming branch taken): Target = NPC + immediate
  - update pipeline registers
    - Control signals, branch target, ALU results, and destination
- Memory Access
  - Return effective addr from branch
  - Load/Store
- Register Write Back
Pipeline Hazards
- Structural hazards - multiple instructions compete for the same resources
  - unified memory for Ins/Dat
  - register file with shared read/write access port - subcycles
- Data hazards - a dependent instruction cannot proceed because it needs a value
  - wait or arrange other Ins in between
  - forwarding - produce values to bypass register file to be used directly
- Control hazards - the next instruction cannot be fetched because the outcome of an earlier branch is unknown
  - wait or arrange other Ins in between
  - predict the branch outcome: assume the branch is taken or not taken, may need to cancel wrong path
  - jump and function call can be resolved in the decode stage¹
Multicycle instructions
- FP operation typically need more time
- Not all ALU operation complete in one cycle
- Structural hazards: potential multiple RF writes
- Data hazards:
  - more read-after-write hazards
  - potential write-after-read hazards
  - potential write-after-write hazards
- Imprecise exception
  - instructions do not necessarily complete in program order
  - state of the processor must be kept updated with respect to the program order

ILP: Instruction level parallelism

Stalls slows down the pipeline, especially when fully stalled, CPU pays the penalty for latch latency
ILP²
- increase overlap among instructions in the pipeline
- decrease stalls
Potential overlap among instructions
- property of the program dataflow
- influenced by compiler
- ILP: \(\frac{NumInstructions}{NumCycles}\)
- IPC is the exploited ILP
Exploitation method:
- Dynamic scheduling in hardware
- static scheduling in software(compiler)

Big picture

Goal: improving performance \((IPC\times F)/IC\)
- software: ILP and IC
- hardware: IPC
- Increase IPC: improve ILP and exploit more ILP
- Increas F: deeper pipeline and faster tech
Key performance limitation
- number of instructions fetched per second is limited
- deeper fetch and wider fetch

Compiler-based techniques

Loop book-keeping overhead
Diverse impact of stall cycles on performance
branch delay slots: an instruction of the loop moved to after the branch and executed to completion

Loop optimization

Re-ordering and changing immediate values
Reducing loop overhead by unrolling
Scalar pipelines
- Upper bound on throughput \(IPC<=1\)
- Unified pipeline for all functional units: underutilized resources
- Inefficient freeze policy: stall cycle delays all the following cycles
- Pipeline hazards: stall cycles result in limited throughput
Superscalar pipelines
- Separate integer and floating point pipelines \(IPC<=2\)
  - instuction packet that is using VLIW(Very long instruction word)
  - inst. packet has one fp. and one int. slots
Software pipelining: mix-and-match instructions from different iterations

Control Flow

Impact of branches
Program’s basic blocks:
- only one entry and one exit point per basic block
branches
- conditional vs. unconditional: check conditions, jump call return
- target address: absolute, relative
penalty due to unknown outcome: direction and target. - branch prediction
- profiling the program
- predict based on common cases
Goal of branch prediction
- avoid stall cycles in fetch stage
- static prediction
- dynamic prediction
- prediction accuracy: \(correct/total\)
Compute slowdown for misprediction
Dynamic branch prediction:
- prediction logic: direction and target address
- outcome validation and training: outcome is computed regardless of prediction
- recovery from misprediction: nullify the effect of instructions on the wrong path
- One-bit branch predictor: keep trach of and use the outcome of last executed branch
  - accuracy: single predictor shared by multiple branches, two misprediction for loop(one in one out)

Branch Predictors

goal avoiding stall cycles caused by branches
solution: static or dynamic branch predictor
- prediction
- validation and training
- recovery from misprediction
performance is influenced by the frequency of branches \(b\), prediction accuracy\(a\) and misprediction cost \(c\)
- \(Speedup = \frac{1+bc}{1+(1-a)bc}\)
- Note that originally every branch is assumed a misprediction
Bimodal branch predictors
- one-bit branch predictor
  - keep track of an use the outcome of last branch
  - shared predictor
  - two mispredictions per loop
- Two-bit branch predictor
  - increment if taken, decrement if untaken, split in between
  - one misprediction on loop exit
using multiple counters
- Decode history table(DHT): reduced HW with aliasing
  - least significant bits are used to select a counter
- Branch history table(BHT): precisely tracking branches
  - Most significant bits are used as tags(tags + address)
- Combined BHT and DHT
  - BHT is used on a hit
  - DHT is used/updated on a miss
Correlating branch predictor
- Executed branches of a program stream may be correlated
- Global History register: an r-bit shift register that maintains outcome history
- GHR is merged with PC bits to choose a counter(XOR)
Local predictor may work well for some while global predictor works well for some other programs
- could include both to be safe
Dedicated predictor per branch
- Program counter is used for assigning predictors to branches
Capturing correlation among branches
- Shift register is used to track history
Predicting branch direction is not enough
- Which instruction to be fetched if taken
Storing the target instruction can eliminate fetching
- Extra hardware is required

Branch target buffer

Store tags and target addresses for each branch
only for taken branch and send out the predicted PC to minimize stalls

Dynamic scheduling

Instruction scheduling can remove unnecessary stall cycles in the execution/memory stage
- static scheduling: complex compiler, unable to resolve all data hazards(no access to runtime details)
- dynamic scheduling: completely done in hardware
Creating an instruction schedule based on runtime information
- hardware managed instruction reordering
- instructions are executed in data flow order
- ID is split into:
  - Issue: ensure no structural hazard
  - Read operands: ensure no data hazards
Register renaming:
- Eliminating WAR and WAW hazards
- change the mapping between architectural registers and physical storage locations
- allocate a free physical location for new register, find the most recently allocated location for the register³
dynamic issue: issue dynamic instructions out of program order while maintaining data flow

Tomasulo algorithm:

dispatch instructions to functional units: use reservation stations(RS) - buffer the operands of instructions waiting to issue
execute an instruction as soon as all of its operands are ready: watch the common data bus(CDB)
remove false(anti- and output-) data dependence: rename destination register to RS name
Reservation station(RS) entry:
- busy - indicates that this reservation station and its accompanying functional units are occupied
- op
- vj
- vk
- qj - tags refer to the reservation stations taht will produce the corresponding source operand; 0 means available or not needed
- qk
- addr - immediate field or effective address for load or store
Qi of register file: the number of reservation station that contains the operation whose result should be stored into this register
Reorder Buffer(ROB) entry: additional registers in the same way as the reservation stations, supply for values until instructions commit
- valid
- result
- exception
- pc
three-step Tomasulo algorithm
- Issue: take an instruction from the instruction queue
  - if there are free RS without structural hazards, rename and read/send operands or RS names
- Execute: operate on oprands when ready
  - if all operands are ready, or watch CDB
- Write result: update the register values
  - write the result through CDB to all waiting RS and register file, release RS entry
Summary:
- Data hazards
  - RAW is handled by forwarding over CDB
  - WAR and WAW are removed by RS-based renaming
- Structural hazards
  - Multiple FUs may be accessing CDB simultaneously: delay conflicting instructions at issue and RS
- Precise exeception handling
  - not possible because OoO writeback to register file: maintain the desination value in ROB
Four-step Tomasulo algorithm
- Issue(dispatch): if RS and ROB slots are free; read/rename operands
- Execution: execute operation as soon as the operand values are ready
- write result: send result to ROB and reservation stations via CDB
- commit(retire): update register file for the head of ROB
  - write back head of the ROB
  - branch prediction: correct
  - branch prediction incorrect: flush ROB and restart execution at the correct successor of the branch

Hardware speculation

out-of-order execute/complete
in-order fetch/decode and commit
distributed reservation stations: in-order(for each functional unit) issue/dispatch
out of order issue/dispatch to functional units
- two step wakeup and select logic
Register renaming
- Register aliasing table for fast lookup(RAT)
- Free register list for fast register renaming
- Proceed only if there is free space in IQ, ROB and Free List
Branch prediction and recovery: squash all mispredicted entries - commit speculative instructions iff prediction was correct
Double RAT architectural: use two RAT and only one enlarged Register file
- use retire RAT when commiting entries in ROB
- free entries when there is no reference left(Issue Queue, IQ and RAT)

Out-of-order load and store

IQ with MA units:
- dedicated queue only for load/store instuctions, LSQ
- two steps:
  - compute the effective address when register is available
  - send the request to memory if there is no memory hazards
load : only loads that are not following any unknown stores can be issued(RAW)
store: only when there is no instructions before (WAW and WAR)
can we predict memory dependence?
- issue/execute load instructions even if they are following unresolved stores

Memory hierarchy

Memory hierarchy design

basics
- why
- techs
- locality
caches
- block, index, tag, address, cache policies
- Performance metric(AMAT) and optimization techniques
Virtual memory
- paging, address translation, TLB
Memory hierarchy
- Burks, Goldstine, and von Neumann: greater capacity less quickly accessible
The memory wall
- processor-memory performance gap increased over 50% per year
Memory technology
- RAM technology
  - access time ~same for all locations(not so true anymore)
  - static RAM(SRAM)
    - typically used for caches
    - 6T/bit; fast but - low density, high power, expensive
    - cell structure: 6: maintains data while power on
  - Dynamic RAM(DRAM)
    - typically used for main memory
    - 1T/bit; inexpensive, hight density, low power - but slow
    - cell structure 1T-1C: needs refresh regularly to preserve data
Cache hierarchy:
- software: scratchpad
- hardware: caches
Principle of locality
- memory references exhibit localized accesses
- Spatial: probability of access to \(A+\delta\) at time \(t+\epsilon\) highest when \(\delta \rightarrow 0\)
- temporal: probability of accessing \(A+\epsilon\) at time \(t+\delta\) highest when \(\epsilon \rightarrow 0\)
Cache terminology
- Block(cache line): unit of data access
- Hit: accessed data found at current level, Miss
- Hit rate, hit time: time to acces data on a hit
- Miss rate, miss penalty
Cache performance:
- Average Memory Access Time(AMAT): \[AMAT=r_ht_h+r_m(t_h+t_p) = t_h + r_mt_p\]

Cache Architecture

Addressing and lookup

Addressing:
- Main memory address
- Simplest: direct-mapped cache(mid-bits)
- tag-index-byte, Tag, byte offset, valid flag(v): whether content is meaningful
- data and tag are always accessed
Cache size only includes data

Cache optimization

How to improve cache performance
- Reduce hit time: memory tech, critical access path
- improve hit rate: size, associativity, placement/replacement policies
- Reduce miss penalty: multi level caches, data prefetching
Set associative caches
- Improve cache hit rate by allowing a memory location to be placed in more than one cache block
- N-way set associative cache, fully associative
- for a fix capacity, higher associativity typically leads to higher hit rates
n-Way
- index into cache sets
- multiple tag comparisons
- multiple data reads
- Special: direct mapped, fully associative
Cache miss classifications
- cold(compulsory): even ideal cache will have it
  - larger blocks, prefetching
- Capacity: due to cache’s size limit
  - larger cache
- Conflict: due to less associativity
  - larger cache
  - more associativity

Cache optimization

cache replacement policies
- which block to replace on a miss?
- Ideal replacment (Belady’s algorithm)
  - replace the block accessed farthest in the future
- Least recently used: LRU
  - replace the block accessed farthest in the past
- Moset recently used: MRU
  - replace the block accessed nearest in the past
- Random replacement
Cache write policies
- write vs. read
  - Data and tag are accessed for both read and write
  - only for write data array needs to be updated
  - miss: write no allocate, write allocate
  - hit: write back, write through
- Write back:
  - on a write access write to cache only
  - write cache block to memory only when replaced from cache
  - dramatically decreases bus bandwidth usage
  - keep a bit(dirty bit) per cache block
- Write through
  - write to both cache and memory(or next level) through use of a write buffer
  - improved miss penalty⁴
  - more reliable because of maintaining two copies
  - will stall processor to allow memory to catch up incase write buffer fills up
- Write allocate
  - allocate a cache line for the new data and replace old line
- Write no allocate
  - do not allocate space in the cache for the data
  - only really makes sense in systems with write buffers
Reducing miss penalty
- inevitable misses: serve as quickly as possible
- Multilevel caches
- read misses priority over writes
- sub-block placement
- critical word first
Victim cache
- reduce conflict misses: larger capacity and more associativity
- Associativity is expensive: complex hardware, longer hit time, more energy
- Observation: conflict misses do not occur in all sets, increase associativity on the fly
- Small fully associative cache, on eviction move the victim block here
Cache inclusion
- how to reduce the number of accesses that miss in all cache levels?
- show a block be allocated in all levels?
  - yes: inclusive, no: non-inclusive/exclusive
- Modern processors:
  - L3: inclusive of L1 and L2
  - L2: non-inclusive of L1 (large victim cache)

Virtual Memory

Memory hierarchy: lower levels provide greater capacity but longer time
- will a program fit in main memory/what if running multiple programs
virtual memory: use main memory as a “cache” for secondary memory
- allow efficient and safe sharing the physical main memory among multiple programs
virtual memory systems
- provides illusion of very large memory
  - address space of each program larger than the physical main memory
- Memory menagement unit MMU
  - between main and secondary mem.
  - address translation
    - virtual address space to physical address space
- every virtual address is translated to a physical address with the help of hardware
  - where, what to do on table miss(page fault), cost
Address translation cost
- page walk: look up the physical address in the page table
multi-level page table
- the virtual(logical) address space is broken down in to multiple pages
Translation lookaside buffer
- exploit locality to reduce address translation time: keep the translation in a buffer
- fully associative/set associative/direct
- typically faster than cache access: TLBs are smaller(typically not more than 128 to 256 entries)
Content addressable memory(CAM) based TLB
- data in address out and put into a RAM to get the physical page number/dirty bits/status
physically indexed caches:
- increased critical path from sequential TLB and cache access
virtually indexed caches
- parallel lookup and compare between tags from TLB and Tag Array

Main memory systems

SRAM for caches
- fast and leaky, 6 transistors per bit, CMOS process
- static volatile, retain data when powered on
DRAM for main memory
- slow and dense, 1 transistor per bit, DRAM process
- dynamic volatile, periodic refreshing is required to retain data
DRAM is organized as rows X colums
DRAM load is destructive: discharging capacitors
Row access strobe(RAS) and column access strobe(CAS)
all reads and writes are performed through DRAM row buffer(RB)
- Row buffer holds a single row of the array: typical 8KB
- entire row moved but only a block is accessed
- row buffer access possibilities
  - row buffer hit: no need for precharge and activate 20
  - row buffer miss: activate are needed: empty row 40 / row conflict 6
DRAM refresh
- current leakage: both ways
- DRAM requires the cells’ contents to be read and written periodically
  - Burst refresh: all at once
  - distributed refresh: a group at a time

DRAM controller

DRAM operations:
- Precharge bitlines to prepare subarray for activating a wordline
- Activate a row by connecting DRAM cells to the bitline and start sensing
- Read the contents of a data block from the row buffer
- Write new contents for data block into the row buffer
- Refresh DRAM cells: can be done through a precharge followed by an activate
DRAM control
- external controller, may be integrated on CPU(modern)
Basic timings:
- \(t_{CAS/CL}\): column access strobe (RD->DATA)
- \(t_{RAS}\): row active strobe (ACT->PRE)
- \(t_{RP}\): row precharge (PRE->ACT)
- \(t_{RC}\): row cycle (ACT->PRE->ACT)
- \(t_{RCD}\): row to column delay(ACT->RD/WT)
Access time:
- Row hit: \(t_{CAS}\)
- Row empty: \(t_{RCD} + t_{CAS}\)
- Row conflict: \(t_{RP} + t_{RCD} + t_{CAS}\)
Memory channels: fully parallel access
- separate data, control, and address buses
- Goal: keep data bus fully utilized
DRAM organization:
- Independently accessed through dedicated data, address, and command buses
- Physically broken down into DIMM(dual in-line memory modules)
- logically divided into ranks, a collection of DRAM chips responding to the same memory request
Memory controller: connects CPU and DRAM
- receives requests after LLC misses
- DRAM refresh management, command scheduling, Row-buffer management policies, address mapping schemes
Refresh management: recently accessed rows need not to be refreshed
Command scheduling:
- write buffering: writes can wait until reads are done
- Controller queues DRAM commands: per-bank queues, reordering ops meant for same bank
- Policies:
  - First-come-first-served: oldest request first
  - First-Ready first-come-first-served
    - prioritize column changes over row changes
    - skip over older conflicting requests
    - find row hits(on queued request)
Row-buffer management policies
- open-page policy: good for lots of locality
- close-page policy: good for random access

DRAM Control

Tasks:
- Refresh management
- Address mapping
- Request scheduling
- Power management
- Error detection/correction
Address mapping
- Memory request: Type:Address:Data
- Row:Channel:Rank:Bank:Column
Command scheduling
- Write buffering: wait till reads are done
- Controller queues DRAM commands: per-bank, reordering ops
- Policies: FCFS FR-FCFS(First-Ready)
  - Prioritize column accesses over row changes
  - Skip over older conflicting requests
  - Find row hits: if no conflicts with in-progress request
Error detection/correction
- Reason: leakage, alpha particles, hard error
- How: parity bits, ECC
- Single-error correction, Double-Error Detection(SECDED)
Programmable memory controller
- Multiple performance obj
- app-specific optimizations
- patches and in-field updates
Judiciois division of labor between specialized hardware and firmware
- Request and transaction processing in firmware
- confiugrable timing validation in hardware
RISC ISA: metadata(app hints, control flow), address(control flow, address mapping)
- Queue management with instruction flags: R/T for request and transaction
- Two five-stage pipelines and one configurable timing validation circuit(request processor, transaction processor)

Data level parallelism

IPC hardly achieves more than 2
DLP: data level parallelism
- Vector processors(Cray machines), SIMD(Intel MMX), and GPUs(NVIDIA)
TLP: thread level parallelism, multiprocessors and hardware multithreading
RLP: request level parallelism: data centers

Vector processors

Improves throughput rather than latency
Vector register file
- 2D register array
- number of elements per register determines the max vector length
Vector functional units
- single opcode for multiple units
- Integer/floating point/load/store
Single instruction defines multiple operations
- lower instruction fetch/decode/issue cost
Operations executed in parallel
- no dependency simple hardware
Predictable memory access pattern
- improve performance by prefetching
- simple memory scheduling policy
- multi banking may be used for improving bandwidth
Fixed length - narrow is common as wide is not efficient
Variable length - vector length register with MVL(maximum VL)
Conditional execution
- by prediction: masks flag vectors with single-bit elements, vector compare to produce flags, use mask control next vector operation

Graphics processing unit

Flynn’s taxonomy: Data vs instruction streams

	Instruction Stream
Data Stream	Single	Multiple
Single	SISD uniprocessors	MISD systolic arrays
Multiple	SIMD vector procrssors	MIMD Multiprocessors

GPU: initially developed as graphics accelerator
- receives geometry information from the CPU and provides a picture
- Host interface -> Vector processing -> Triangle setup -> pixel processing -> memory interface
Host interface: communication bridge
- receives commands from cpu and pulls geometry info from sysmemory
- outputs a stream of vertices in object space with all their associated information
Vertex processing:object space to screen space
Pixel processing: rasterize triangle to pixels, texture mapping and math operations
Programmin gpus: exectued for every vertex or every fragment, fully customizable geometry and shading effects
Memory interface
- framebuffer(fragment color from previous stage)
- before final writes some fragments rejected by zbufer, stencil and alpha tests
- z and color are compressed to reduce framebuffer bandwidth(not size)
General-purpose GPUs(GPGPUs)
- heterogeneous system
- simple in-order pipelines that rely on thread-level parallelism to hide long latencies
- many registers(~1K) per pipeline to support many active warps
Single instruction multiple threads SIMT
- thread blocks
- warps
- in-order pipelines
CPU
- optimized for low-latency access to cached data sets
- control logic for out-of-order and speculative execution
GPU
- optimized for data-parallel throughput computation
- architecture tolerant of memory latency
- more transistors dedicated to computation
Compiling CUDA
- call nvcc
- CPU code
- parallel threads eXecution(PTX)
  - virtual machine and ISA
- PTX to device-specific binary object
Memory hierarchy
- Throughput-oriented main memory
- GDDR: wide channel(256-bit), lower clock rate than DDR(double data rate-transfer data on rise and fall)
- 1.5MB shared L2
- 48KB read-only data cache
  - compiler controlled
- Wide buses

Thread level parallelism

Thread: single sequential flow of control within a program(instructions + state)
- Register state: Thread context
Single threaded vs multithreaded
Multitasking: OS to load context of a new thread while old thread’s context is back to memory
explicitly multi-threaded programs
parallel languages and libraries
Architecture for exploiting threa-level parallelism
- hardware multithreading: multiple threads run on the same processor pipeline
  - multithreading levels
    - coarse grained CGMT
    - fine grained FGMT
    - simultaneous SMT
- multiprocessing: different threads run on different processors
  - two types:
    - symmetric multiprocessors(SMP): single cpu per chip
    - Chip multiprocessors (CMP): multiple cpu per chip
CPU idle due to latencies
- utilize idle resources
- challenge: energy and performance costs of context switching
CGMT:
- runs until a costly stall: llc miss
- another thread starts during first’s stall(pipeline fill time several cycles)
- at a given time only one thread is in the pipeline
- problem: hardware support(PC, register for each threads). short stalls
- vs superscalar
FGMT:
- two or more threads interleave instructions: RR. Skip stalled threads
  - hardware support: separate PC and register file, alternating pattern
  - naturally hides delays but does not make full use of multi-issue architecture
SMT:
- instruction from multiple threads issued on same cycle: register renaming and dynamic scheduling of multi-issue architecture
- more hardware support: RF,PC temorary result and distribution
- Maximizes utilization of execution units
SMP: multiple CPU share same memory
- OS: same cpus and equally accessible to main memory
- each cpu has its own power distribution and cooling
Chip multiprocessor: SMP on a single chip(with cores)
- shared higher level caches, last level lower latency improved bandwidth
- not necessarily homogeneous cores
CMP vs SMP
- lower costs single memory interface, single socket
- less off-chip comm: lower power and energy consumption,improved AMAT
- use additional transistors made available: more cores/more complicated pipelines
Efficiency: ideally n cores -> nx performance
- if goal: provide the same performance as uniprocessor, frequency can be 1/n

Parallel memory architecture

Amdahl’s law for theoretical speedup
Shared memory vs message passing
- multiple threads shared mem vs explicit communication through interconnection
- easy to program vs simple hardware
UMA, NUMA
shared network, p2p network
- low latency low bandwidth simple control
- high latency high bandwidth, complex control
shared memory challenges
- memory consistency: program order
- cache coherence: cache should be transparent to programmer
cache coherence
- multiple copies becomes inconsistent when writes
- simple solution is to propagate writes from one core to others
- key op: update/invalidate sent to all or a subset of the cores
  - software based management: flush dirty blocks to memory, invalidate: make all the blocks invalid
  - hardware based management: update or invalidate other copies on every write or send the data to everyone
- invalidation based protocol is better
Snoopy protocol
- relying on a broadcast infrastructure among caches
- every cache monitors(snoop) the traffic on the shared media to keep the states of the cache block up to date
- Simple version:
  - write-through and write no-allocate with multiple readers allowed and writes invalidate replicas
  - Simple state machine for each cache unit: one-bit valid flag FSM
  - processor actions: load, store, evict(V->I)⁵
  - bus trafic: busrd, buswr(note that the data is also on this bus)
  - nomenclature: action taken/action on bus

Shared memory system

Inconsistent vs consistent data
Snooping with writeback policy:
- writes are not propagated to memory until eviction(cache data maybe different from main mem)
- solution: owner of the most recently updated replica
  - only one owner, only owner can update, reader can share the data but have to gain ownership to write
  - three states MSI: invalid(no replica), shared(read-only), modified(writable and only valid copy)
  - processor actions: load,store,evict
  - bus messages: Busrd, BusRdX(read exclusive), BusInv, BusWB, BusReply
- MESI: illinois protocol
  - enter exclusive first then shared, so no invalidation traffic on write-hits in the E state
  - lower overheads in sequential applications
  - more complex protocol, longer memory latency due to the protocol
Alternatives to snoopy protocols
- not scalable: shared bus bandwidth is limited, every node broadcasts message and monitors the bus
- limit the traffic using directory structure: keep track of sharers of each block
mem op reordering: mem consistency model specifies constraints on the order in which mem op must appear to be performed
- Dekker’s algorithm: mutex critical region, any time only one process allowed:

lock: A = 1;
if (B != 0) {
    A = 0;
    goto lock;
}
// region
A = 0;

Sequential consistency
- within a program order is preserved, each instruction is atomic
- instructions from different threads can be interleaved arbitrarily
relaxed consistency model
- real processors: not all instructions need to be executed in program order
- fence instruction can be used to enforce ordering among memory instructions
- use fence to protect lock_acquire

?↩
what is potential ILP↩
parallel with SSA?↩
?↩
shouldn’t need this↩