Architecture
Trend
- Moore’s Law
- Dark silicon
Metrics
- How to measure performance:
- Latency or response time
- Bandwidth or throughput - at least square of latency’s growth rate
- Benchmark suit: A set of representative programs that are likely relevant to the user
- Average:
- Time: Arithmetic mean
- Harmonic Mean: Thoughput
- Geometric Mean: speedup
- Processor performance:
- clock cycle time T
- cycles per instruction(CPI or IPC - instruction per cycle) (depends on the instruction set in question)
- Instruction count(IC)
- CPU time ICCPIT
- comparison
- Speedup/slowdown
- percentage increase in performance
- reduction in execution time
- locality/parallelism
- Amdahl’s Law: speedup of a fraction has diminishing returns
- Power&Energy
- power (\(P = VI\))
- Energy (\(E = PT = P_{sta} + P_{dyn}\))
- Static - constant power draw
- Power sta = Voltage * Current sta
- dynamic - consumed at every switching event due to charging and dischargeing load capacities
- Power dyn = activity * capacitance * frequency * voltage^2 \(P=CV^2Af = AV(\frac{CV}T)\)
- CPU power is the rate of heat generated
- excessive peak power may result in burning the chip
- Opportunistically power gating
- Cost of integrated Circuit
- Cost of die \(\frac{\text{wafer cost}}{\text{dies per wafer} * \text{die yield}}\)
- Yield of die \(\frac{\text{wafer yield}}{(1+\text{defect per unit area} * \text{die area})^N}\)
- process-compexity factor specified by chip manufacturer
- Dependability
- System reliability and availability
- Mean time to Failure MTTF
- Mean time to Repair MTTR
- System availability = \(\frac{MTTF}{MTTF+MTTR}\)
Instruction set architecture
- interfacing contract between hardware and software
- Defined
- functional operation of units
- How to use each functional unit
- Does not define
- how functional unit implemented
- Execution time of operation
- energy consumption
- Programmer’s perspective
- internal machine states: program counter - next instruction to be executed
- operations
- addressing modes
Operand location
- CISC
- may reduce IC, increase CPI, and increase CT
- CPU time may increase
- complex operations
- variable length
- limited registers
- RISC
- simple operations
- fixed length
- limited instruction formats
- Memory addressing
- Register
- Immediate
- Displacement
- Register Indirect
- indexed
- direct
- memory indirect @
- auto decrement/increment
- scaled
- Instruction format
- MISP: I-type load store conditional R-type ALU J-type
- Opcode : RS RT : RD ShAmnt Funct
- Processing
- Instruction fetch
- Instruction decode
- Register read
- Execute instructions
- Memory access
- Register write back
- Example: Single-cycle RISC architecture
- Pipelining
- Reusing idle resources: each processing step finishes in a fraction of a cycle
- Improving throughput at the expense of latency
- Delay: \(D=T+n\delta\) - ideal case
- Throughput: \(IPS = n/D\)
- Example: 5-stage MIPS pipeline
- Instruction Fetch: read instruction from memory (I-Cache)
- Program counter (PC)
- Compute NPC
- Update pipeline operator
- Instruction Decode
- generate control signal
- read source operands from register file
- update pipeline register
- send operand and immediate value to next stage
- pass control signal and NPC to next stage
- Execution Stage
- Perform ALU operation
- result of ALU
- Compute branch target(assuming branch taken): Target = NPC + immediate
- update pipeline registers
- Control signals, branch target, ALU results, and destination
- Perform ALU operation
- Memory Access
- Return effective addr from branch
- Load/Store
- Register Write Back
- Instruction Fetch: read instruction from memory (I-Cache)
- Pipeline Hazards
- Structural hazards - multiple instructions compete for the same resources
- unified memory for Ins/Dat
- register file with shared read/write access port - subcycles
- Data hazards - a dependent instruction cannot proceed because it needs a value
- wait or arrange other Ins in between
- forwarding - produce values to bypass register file to be used directly
- Control hazards - the next instruction cannot be fetched because the outcome of an earlier branch is unknown
- wait or arrange other Ins in between
- predict the branch outcome: assume the branch is taken or not taken, may need to cancel wrong path
- jump and function call can be resolved in the decode stage1
- Structural hazards - multiple instructions compete for the same resources
- Multicycle instructions
- FP operation typically need more time
- Not all ALU operation complete in one cycle
- Structural hazards: potential multiple RF writes
- Data hazards:
- more read-after-write hazards
- potential write-after-read hazards
- potential write-after-write hazards
- Imprecise exception
- instructions do not necessarily complete in program order
- state of the processor must be kept updated with respect to the program order
ILP: Instruction level parallelism
- Stalls slows down the pipeline, especially when fully stalled, CPU pays the penalty for latch latency
- ILP2
- increase overlap among instructions in the pipeline
- decrease stalls
- Potential overlap among instructions
- property of the program dataflow
- influenced by compiler
- ILP: \(\frac{NumInstructions}{NumCycles}\)
- IPC is the exploited ILP
- Exploitation method:
- Dynamic scheduling in hardware
- static scheduling in software(compiler)
Big picture
- Goal: improving performance \((IPC\times F)/IC\)
- software: ILP and IC
- hardware: IPC
- Increase IPC: improve ILP and exploit more ILP
- Increas F: deeper pipeline and faster tech
- Key performance limitation
- number of instructions fetched per second is limited
- deeper fetch and wider fetch
Compiler-based techniques
- Loop book-keeping overhead
- Diverse impact of stall cycles on performance
- branch delay slots: an instruction of the loop moved to after the branch and executed to completion
Loop optimization
- Re-ordering and changing immediate values
Reducing loop overhead by unrolling
- Scalar pipelines
- Upper bound on throughput \(IPC<=1\)
- Unified pipeline for all functional units: underutilized resources
- Inefficient freeze policy: stall cycle delays all the following cycles
- Pipeline hazards: stall cycles result in limited throughput
- Superscalar pipelines
- Separate integer and floating point pipelines \(IPC<=2\)
- instuction packet that is using VLIW(Very long instruction word)
- inst. packet has one fp. and one int. slots
- Separate integer and floating point pipelines \(IPC<=2\)
Software pipelining: mix-and-match instructions from different iterations
Control Flow
- Impact of branches
- Program’s basic blocks:
- only one entry and one exit point per basic block
- branches
- conditional vs. unconditional: check conditions, jump call return
- target address: absolute, relative
- penalty due to unknown outcome: direction and target. - branch prediction
- profiling the program
- predict based on common cases
- Goal of branch prediction
- avoid stall cycles in fetch stage
- static prediction
- dynamic prediction
- prediction accuracy: \(correct/total\)
- Compute slowdown for misprediction
- Dynamic branch prediction:
- prediction logic: direction and target address
- outcome validation and training: outcome is computed regardless of prediction
- recovery from misprediction: nullify the effect of instructions on the wrong path
- One-bit branch predictor: keep trach of and use the outcome of last executed branch
- accuracy: single predictor shared by multiple branches, two misprediction for loop(one in one out)
Branch Predictors
- goal avoiding stall cycles caused by branches
- solution: static or dynamic branch predictor
- prediction
- validation and training
- recovery from misprediction
- performance is influenced by the frequency of branches \(b\), prediction accuracy\(a\) and misprediction cost \(c\)
- \(Speedup = \frac{1+bc}{1+(1-a)bc}\)
- Note that originally every branch is assumed a misprediction
- Bimodal branch predictors
- one-bit branch predictor
- keep track of an use the outcome of last branch
- shared predictor
- two mispredictions per loop
- Two-bit branch predictor
- increment if taken, decrement if untaken, split in between
- one misprediction on loop exit
- one-bit branch predictor
- using multiple counters
- Decode history table(DHT): reduced HW with aliasing
- least significant bits are used to select a counter
- Branch history table(BHT): precisely tracking branches
- Most significant bits are used as tags(tags + address)
- Combined BHT and DHT
- BHT is used on a hit
- DHT is used/updated on a miss
- Decode history table(DHT): reduced HW with aliasing
- Correlating branch predictor
- Executed branches of a program stream may be correlated
- Global History register: an r-bit shift register that maintains outcome history
- GHR is merged with PC bits to choose a counter(XOR)
- Local predictor may work well for some while global predictor works well for some other programs
- could include both to be safe
- Dedicated predictor per branch
- Program counter is used for assigning predictors to branches
- Capturing correlation among branches
- Shift register is used to track history
- Predicting branch direction is not enough
- Which instruction to be fetched if taken
- Storing the target instruction can eliminate fetching
- Extra hardware is required
Branch target buffer
- Store tags and target addresses for each branch
- only for taken branch and send out the predicted PC to minimize stalls
Dynamic scheduling
- Instruction scheduling can remove unnecessary stall cycles in the execution/memory stage
- static scheduling: complex compiler, unable to resolve all data hazards(no access to runtime details)
- dynamic scheduling: completely done in hardware
- Creating an instruction schedule based on runtime information
- hardware managed instruction reordering
- instructions are executed in data flow order
- ID is split into:
- Issue: ensure no structural hazard
- Read operands: ensure no data hazards
- Register renaming:
- Eliminating WAR and WAW hazards
- change the mapping between architectural registers and physical storage locations
- allocate a free physical location for new register, find the most recently allocated location for the register3
- dynamic issue: issue dynamic instructions out of program order while maintaining data flow
Tomasulo algorithm:
- dispatch instructions to functional units: use reservation stations(RS) - buffer the operands of instructions waiting to issue
- execute an instruction as soon as all of its operands are ready: watch the common data bus(CDB)
remove false(anti- and output-) data dependence: rename destination register to RS name
- Reservation station(RS) entry:
- busy - indicates that this reservation station and its accompanying functional units are occupied
- op
- vj
- vk
- qj - tags refer to the reservation stations taht will produce the corresponding source operand; 0 means available or not needed
- qk
- addr - immediate field or effective address for load or store
- Qi of register file: the number of reservation station that contains the operation whose result should be stored into this register
- Reorder Buffer(ROB) entry: additional registers in the same way as the reservation stations, supply for values until instructions commit
- valid
- result
- exception
- pc
- three-step Tomasulo algorithm
- Issue: take an instruction from the instruction queue
- if there are free RS without structural hazards, rename and read/send operands or RS names
- Execute: operate on oprands when ready
- if all operands are ready, or watch CDB
- Write result: update the register values
- write the result through CDB to all waiting RS and register file, release RS entry
- Issue: take an instruction from the instruction queue
- Summary:
- Data hazards
- RAW is handled by forwarding over CDB
- WAR and WAW are removed by RS-based renaming
- Structural hazards
- Multiple FUs may be accessing CDB simultaneously: delay conflicting instructions at issue and RS
- Precise exeception handling
- not possible because OoO writeback to register file: maintain the desination value in ROB
- Data hazards
- Four-step Tomasulo algorithm
- Issue(dispatch): if RS and ROB slots are free; read/rename operands
- Execution: execute operation as soon as the operand values are ready
- write result: send result to ROB and reservation stations via CDB
- commit(retire): update register file for the head of ROB
- write back head of the ROB
- branch prediction: correct
- branch prediction incorrect: flush ROB and restart execution at the correct successor of the branch
Hardware speculation
- out-of-order execute/complete
- in-order fetch/decode and commit
- distributed reservation stations: in-order(for each functional unit) issue/dispatch
- out of order issue/dispatch to functional units
- two step wakeup and select logic
- Register renaming
- Register aliasing table for fast lookup(RAT)
- Free register list for fast register renaming
- Proceed only if there is free space in IQ, ROB and Free List
- Branch prediction and recovery: squash all mispredicted entries - commit speculative instructions iff prediction was correct
- Double RAT architectural: use two RAT and only one enlarged Register file
- use retire RAT when commiting entries in ROB
- free entries when there is no reference left(Issue Queue, IQ and RAT)
Out-of-order load and store
- IQ with MA units:
- dedicated queue only for load/store instuctions, LSQ
- two steps:
- compute the effective address when register is available
- send the request to memory if there is no memory hazards
- load : only loads that are not following any unknown stores can be issued(RAW)
- store: only when there is no instructions before (WAW and WAR)
- can we predict memory dependence?
- issue/execute load instructions even if they are following unresolved stores
Memory hierarchy
Memory hierarchy design
- basics
- why
- techs
- locality
- caches
- block, index, tag, address, cache policies
- Performance metric(AMAT) and optimization techniques
- Virtual memory
- paging, address translation, TLB
- Memory hierarchy
- Burks, Goldstine, and von Neumann: greater capacity less quickly accessible
- The memory wall
- processor-memory performance gap increased over 50% per year
- Memory technology
- RAM technology
- access time ~same for all locations(not so true anymore)
- static RAM(SRAM)
- typically used for caches
- 6T/bit; fast but - low density, high power, expensive
- cell structure: 6: maintains data while power on
- Dynamic RAM(DRAM)
- typically used for main memory
- 1T/bit; inexpensive, hight density, low power - but slow
- cell structure 1T-1C: needs refresh regularly to preserve data
- RAM technology
- Cache hierarchy:
- software: scratchpad
- hardware: caches
- Principle of locality
- memory references exhibit localized accesses
- Spatial: probability of access to \(A+\delta\) at time \(t+\epsilon\) highest when \(\delta \rightarrow 0\)
- temporal: probability of accessing \(A+\epsilon\) at time \(t+\delta\) highest when \(\epsilon \rightarrow 0\)
- Cache terminology
- Block(cache line): unit of data access
- Hit: accessed data found at current level, Miss
- Hit rate, hit time: time to acces data on a hit
- Miss rate, miss penalty
- Cache performance:
- Average Memory Access Time(AMAT): \[AMAT=r_ht_h+r_m(t_h+t_p) = t_h + r_mt_p\]
Cache Architecture
Addressing and lookup
- Addressing:
- Main memory address
- Simplest: direct-mapped cache(mid-bits)
- tag-index-byte, Tag, byte offset, valid flag(v): whether content is meaningful
- data and tag are always accessed
- Cache size only includes data
Cache optimization
- How to improve cache performance
- Reduce hit time: memory tech, critical access path
- improve hit rate: size, associativity, placement/replacement policies
- Reduce miss penalty: multi level caches, data prefetching
- Set associative caches
- Improve cache hit rate by allowing a memory location to be placed in more than one cache block
- N-way set associative cache, fully associative
- for a fix capacity, higher associativity typically leads to higher hit rates
- n-Way
- index into cache sets
- multiple tag comparisons
- multiple data reads
- Special: direct mapped, fully associative
- Cache miss classifications
- cold(compulsory): even ideal cache will have it
- larger blocks, prefetching
- Capacity: due to cache’s size limit
- larger cache
- Conflict: due to less associativity
- larger cache
- more associativity
- cold(compulsory): even ideal cache will have it
Cache optimization
- cache replacement policies
- which block to replace on a miss?
- Ideal replacment (Belady’s algorithm)
- replace the block accessed farthest in the future
- Least recently used: LRU
- replace the block accessed farthest in the past
- Moset recently used: MRU
- replace the block accessed nearest in the past
- Random replacement
- Cache write policies
- write vs. read
- Data and tag are accessed for both read and write
- only for write data array needs to be updated
- miss: write no allocate, write allocate
- hit: write back, write through
- Write back:
- on a write access write to cache only
- write cache block to memory only when replaced from cache
- dramatically decreases bus bandwidth usage
- keep a bit(dirty bit) per cache block
- Write through
- write to both cache and memory(or next level) through use of a write buffer
- improved miss penalty4
- more reliable because of maintaining two copies
- will stall processor to allow memory to catch up incase write buffer fills up
- Write allocate
- allocate a cache line for the new data and replace old line
- Write no allocate
- do not allocate space in the cache for the data
- only really makes sense in systems with write buffers
- write vs. read
- Reducing miss penalty
- inevitable misses: serve as quickly as possible
- Multilevel caches
- read misses priority over writes
- sub-block placement
- critical word first
- Victim cache
- reduce conflict misses: larger capacity and more associativity
- Associativity is expensive: complex hardware, longer hit time, more energy
- Observation: conflict misses do not occur in all sets, increase associativity on the fly
- Small fully associative cache, on eviction move the victim block here
- Cache inclusion
- how to reduce the number of accesses that miss in all cache levels?
- show a block be allocated in all levels?
- yes: inclusive, no: non-inclusive/exclusive
- Modern processors:
- L3: inclusive of L1 and L2
- L2: non-inclusive of L1 (large victim cache)
Virtual Memory
- Memory hierarchy: lower levels provide greater capacity but longer time
- will a program fit in main memory/what if running multiple programs
- virtual memory: use main memory as a “cache” for secondary memory
- allow efficient and safe sharing the physical main memory among multiple programs
- virtual memory systems
- provides illusion of very large memory
- address space of each program larger than the physical main memory
- Memory menagement unit MMU
- between main and secondary mem.
- address translation
- virtual address space to physical address space
- every virtual address is translated to a physical address with the help of hardware
- where, what to do on table miss(page fault), cost
- provides illusion of very large memory
- Address translation cost
- page walk: look up the physical address in the page table
- multi-level page table
- the virtual(logical) address space is broken down in to multiple pages
- Translation lookaside buffer
- exploit locality to reduce address translation time: keep the translation in a buffer
- fully associative/set associative/direct
- typically faster than cache access: TLBs are smaller(typically not more than 128 to 256 entries)
- Content addressable memory(CAM) based TLB
- data in address out and put into a RAM to get the physical page number/dirty bits/status
- physically indexed caches:
- increased critical path from sequential TLB and cache access
- virtually indexed caches
- parallel lookup and compare between tags from TLB and Tag Array
Main memory systems
- SRAM for caches
- fast and leaky, 6 transistors per bit, CMOS process
- static volatile, retain data when powered on
- DRAM for main memory
- slow and dense, 1 transistor per bit, DRAM process
- dynamic volatile, periodic refreshing is required to retain data
- DRAM is organized as rows X colums
- DRAM load is destructive: discharging capacitors
- Row access strobe(RAS) and column access strobe(CAS)
- all reads and writes are performed through DRAM row buffer(RB)
- Row buffer holds a single row of the array: typical 8KB
- entire row moved but only a block is accessed
- row buffer access possibilities
- row buffer hit: no need for precharge and activate 20
- row buffer miss: activate are needed: empty row 40 / row conflict 6
- DRAM refresh
- current leakage: both ways
- DRAM requires the cells’ contents to be read and written periodically
- Burst refresh: all at once
- distributed refresh: a group at a time
DRAM controller
- DRAM operations:
- Precharge bitlines to prepare subarray for activating a wordline
- Activate a row by connecting DRAM cells to the bitline and start sensing
- Read the contents of a data block from the row buffer
- Write new contents for data block into the row buffer
- Refresh DRAM cells: can be done through a precharge followed by an activate
- DRAM control
- external controller, may be integrated on CPU(modern)
- Basic timings:
- \(t_{CAS/CL}\): column access strobe (RD->DATA)
- \(t_{RAS}\): row active strobe (ACT->PRE)
- \(t_{RP}\): row precharge (PRE->ACT)
- \(t_{RC}\): row cycle (ACT->PRE->ACT)
- \(t_{RCD}\): row to column delay(ACT->RD/WT)
- Access time:
- Row hit: \(t_{CAS}\)
- Row empty: \(t_{RCD} + t_{CAS}\)
- Row conflict: \(t_{RP} + t_{RCD} + t_{CAS}\)
- Memory channels: fully parallel access
- separate data, control, and address buses
- Goal: keep data bus fully utilized
- DRAM organization:
- Independently accessed through dedicated data, address, and command buses
- Physically broken down into DIMM(dual in-line memory modules)
- logically divided into ranks, a collection of DRAM chips responding to the same memory request
- Memory controller: connects CPU and DRAM
- receives requests after LLC misses
- DRAM refresh management, command scheduling, Row-buffer management policies, address mapping schemes
- Refresh management: recently accessed rows need not to be refreshed
- Command scheduling:
- write buffering: writes can wait until reads are done
- Controller queues DRAM commands: per-bank queues, reordering ops meant for same bank
- Policies:
- First-come-first-served: oldest request first
- First-Ready first-come-first-served
- prioritize column changes over row changes
- skip over older conflicting requests
- find row hits(on queued request)
- Row-buffer management policies
- open-page policy: good for lots of locality
- close-page policy: good for random access
DRAM Control
- Tasks:
- Refresh management
- Address mapping
- Request scheduling
- Power management
- Error detection/correction
- Address mapping
- Memory request: Type:Address:Data
- Row:Channel:Rank:Bank:Column
- Command scheduling
- Write buffering: wait till reads are done
- Controller queues DRAM commands: per-bank, reordering ops
- Policies: FCFS FR-FCFS(First-Ready)
- Prioritize column accesses over row changes
- Skip over older conflicting requests
- Find row hits: if no conflicts with in-progress request
- Error detection/correction
- Reason: leakage, alpha particles, hard error
- How: parity bits, ECC
- Single-error correction, Double-Error Detection(SECDED)
- Programmable memory controller
- Multiple performance obj
- app-specific optimizations
- patches and in-field updates
- Judiciois division of labor between specialized hardware and firmware
- Request and transaction processing in firmware
- confiugrable timing validation in hardware
- RISC ISA: metadata(app hints, control flow), address(control flow, address mapping)
- Queue management with instruction flags: R/T for request and transaction
- Two five-stage pipelines and one configurable timing validation circuit(request processor, transaction processor)
Data level parallelism
- IPC hardly achieves more than 2
- DLP: data level parallelism
- Vector processors(Cray machines), SIMD(Intel MMX), and GPUs(NVIDIA)
- TLP: thread level parallelism, multiprocessors and hardware multithreading
- RLP: request level parallelism: data centers
Vector processors
- Improves throughput rather than latency
- Vector register file
- 2D register array
- number of elements per register determines the max vector length
- Vector functional units
- single opcode for multiple units
- Integer/floating point/load/store
- Single instruction defines multiple operations
- lower instruction fetch/decode/issue cost
- Operations executed in parallel
- no dependency simple hardware
- Predictable memory access pattern
- improve performance by prefetching
- simple memory scheduling policy
- multi banking may be used for improving bandwidth
- Fixed length - narrow is common as wide is not efficient
- Variable length - vector length register with MVL(maximum VL)
- Conditional execution
- by prediction: masks flag vectors with single-bit elements, vector compare to produce flags, use mask control next vector operation
Graphics processing unit
- Flynn’s taxonomy: Data vs instruction streams
Instruction Stream |
||
Data Stream |
Single |
Multiple |
Single |
SISD uniprocessors |
MISD systolic arrays |
Multiple |
SIMD vector procrssors |
MIMD Multiprocessors |
- GPU: initially developed as graphics accelerator
- receives geometry information from the CPU and provides a picture
- Host interface -> Vector processing -> Triangle setup -> pixel processing -> memory interface
- Host interface: communication bridge
- receives commands from cpu and pulls geometry info from sysmemory
- outputs a stream of vertices in object space with all their associated information
- Vertex processing:object space to screen space
- Pixel processing: rasterize triangle to pixels, texture mapping and math operations
- Programmin gpus: exectued for every vertex or every fragment, fully customizable geometry and shading effects
- Memory interface
- framebuffer(fragment color from previous stage)
- before final writes some fragments rejected by zbufer, stencil and alpha tests
- z and color are compressed to reduce framebuffer bandwidth(not size)
- General-purpose GPUs(GPGPUs)
- heterogeneous system
- simple in-order pipelines that rely on thread-level parallelism to hide long latencies
- many registers(~1K) per pipeline to support many active warps
- Single instruction multiple threads SIMT
- thread blocks
- warps
- in-order pipelines
- CPU
- optimized for low-latency access to cached data sets
- control logic for out-of-order and speculative execution
- GPU
- optimized for data-parallel throughput computation
- architecture tolerant of memory latency
- more transistors dedicated to computation
- Compiling CUDA
- call nvcc
- CPU code
- parallel threads eXecution(PTX)
- virtual machine and ISA
- PTX to device-specific binary object
- Memory hierarchy
- Throughput-oriented main memory
- GDDR: wide channel(256-bit), lower clock rate than DDR(double data rate-transfer data on rise and fall)
- 1.5MB shared L2
- 48KB read-only data cache
- compiler controlled
- Wide buses
Thread level parallelism
- Thread: single sequential flow of control within a program(instructions + state)
- Register state: Thread context
- Single threaded vs multithreaded
- Multitasking: OS to load context of a new thread while old thread’s context is back to memory
- explicitly multi-threaded programs
- parallel languages and libraries
- Architecture for exploiting threa-level parallelism
- hardware multithreading: multiple threads run on the same processor pipeline
- multithreading levels
- coarse grained CGMT
- fine grained FGMT
- simultaneous SMT
- multithreading levels
- multiprocessing: different threads run on different processors
- two types:
- symmetric multiprocessors(SMP): single cpu per chip
- Chip multiprocessors (CMP): multiple cpu per chip
- two types:
- hardware multithreading: multiple threads run on the same processor pipeline
- CPU idle due to latencies
- utilize idle resources
- challenge: energy and performance costs of context switching
- CGMT:
- runs until a costly stall: llc miss
- another thread starts during first’s stall(pipeline fill time several cycles)
- at a given time only one thread is in the pipeline
- problem: hardware support(PC, register for each threads). short stalls
- vs superscalar
- FGMT:
- two or more threads interleave instructions: RR. Skip stalled threads
- hardware support: separate PC and register file, alternating pattern
- naturally hides delays but does not make full use of multi-issue architecture
- two or more threads interleave instructions: RR. Skip stalled threads
- SMT:
- instruction from multiple threads issued on same cycle: register renaming and dynamic scheduling of multi-issue architecture
- more hardware support: RF,PC temorary result and distribution
- Maximizes utilization of execution units
- SMP: multiple CPU share same memory
- OS: same cpus and equally accessible to main memory
- each cpu has its own power distribution and cooling
- Chip multiprocessor: SMP on a single chip(with cores)
- shared higher level caches, last level lower latency improved bandwidth
- not necessarily homogeneous cores
- CMP vs SMP
- lower costs single memory interface, single socket
- less off-chip comm: lower power and energy consumption,improved AMAT
- use additional transistors made available: more cores/more complicated pipelines
- Efficiency: ideally n cores -> nx performance
- if goal: provide the same performance as uniprocessor, frequency can be 1/n
Parallel memory architecture
- Amdahl’s law for theoretical speedup
- Shared memory vs message passing
- multiple threads shared mem vs explicit communication through interconnection
- easy to program vs simple hardware
- UMA, NUMA
- shared network, p2p network
- low latency low bandwidth simple control
- high latency high bandwidth, complex control
- shared memory challenges
- memory consistency: program order
- cache coherence: cache should be transparent to programmer
- cache coherence
- multiple copies becomes inconsistent when writes
- simple solution is to propagate writes from one core to others
- key op: update/invalidate sent to all or a subset of the cores
- software based management: flush dirty blocks to memory, invalidate: make all the blocks invalid
- hardware based management: update or invalidate other copies on every write or send the data to everyone
- invalidation based protocol is better
- Snoopy protocol
- relying on a broadcast infrastructure among caches
- every cache monitors(snoop) the traffic on the shared media to keep the states of the cache block up to date
- Simple version:
- write-through and write no-allocate with multiple readers allowed and writes invalidate replicas
- Simple state machine for each cache unit: one-bit valid flag FSM
- processor actions: load, store, evict(V->I)5
- bus trafic: busrd, buswr(note that the data is also on this bus)
- nomenclature: action taken/action on bus
Shared memory system
- Inconsistent vs consistent data
- Snooping with writeback policy:
- writes are not propagated to memory until eviction(cache data maybe different from main mem)
- solution: owner of the most recently updated replica
- only one owner, only owner can update, reader can share the data but have to gain ownership to write
- three states MSI: invalid(no replica), shared(read-only), modified(writable and only valid copy)
- processor actions: load,store,evict
- bus messages: Busrd, BusRdX(read exclusive), BusInv, BusWB, BusReply
- MESI: illinois protocol
- enter exclusive first then shared, so no invalidation traffic on write-hits in the E state
- lower overheads in sequential applications
- more complex protocol, longer memory latency due to the protocol
- Alternatives to snoopy protocols
- not scalable: shared bus bandwidth is limited, every node broadcasts message and monitors the bus
- limit the traffic using directory structure: keep track of sharers of each block
- mem op reordering: mem consistency model specifies constraints on the order in which mem op must appear to be performed
- Dekker’s algorithm: mutex critical region, any time only one process allowed:
lock: A = 1;
if (B != 0) {
A = 0;
goto lock;
}
// region
A = 0;
- Sequential consistency
- within a program order is preserved, each instruction is atomic
- instructions from different threads can be interleaved arbitrarily
- relaxed consistency model
- real processors: not all instructions need to be executed in program order
- fence instruction can be used to enforce ordering among memory instructions
- use fence to protect lock_acquire