Architecture

Trend

  • Moore’s Law
  • Dark silicon

Metrics

  • How to measure performance:
    • Latency or response time
    • Bandwidth or throughput - at least square of latency’s growth rate
    • Benchmark suit: A set of representative programs that are likely relevant to the user
    • Average:
      • Time: Arithmetic mean
      • Harmonic Mean: Thoughput
      • Geometric Mean: speedup
  • Processor performance:
    • clock cycle time T
    • cycles per instruction(CPI or IPC - instruction per cycle) (depends on the instruction set in question)
    • Instruction count(IC)
    • CPU time ICCPIT
  • comparison
    • Speedup/slowdown
    • percentage increase in performance
    • reduction in execution time
  • locality/parallelism
  • Amdahl’s Law: speedup of a fraction has diminishing returns
  • Power&Energy
    • power (\(P = VI\))
    • Energy (\(E = PT = P_{sta} + P_{dyn}\))
    • Static - constant power draw
      • Power sta = Voltage * Current sta
    • dynamic - consumed at every switching event due to charging and dischargeing load capacities
      • Power dyn = activity * capacitance * frequency * voltage^2 \(P=CV^2Af = AV(\frac{CV}T)\)
    • CPU power is the rate of heat generated
    • excessive peak power may result in burning the chip
    • Opportunistically power gating
  • Cost of integrated Circuit
    • Cost of die \(\frac{\text{wafer cost}}{\text{dies per wafer} * \text{die yield}}\)
    • Yield of die \(\frac{\text{wafer yield}}{(1+\text{defect per unit area} * \text{die area})^N}\)
    • process-compexity factor specified by chip manufacturer
  • Dependability
    • System reliability and availability
    • Mean time to Failure MTTF
    • Mean time to Repair MTTR
    • System availability = \(\frac{MTTF}{MTTF+MTTR}\)

Instruction set architecture

  • interfacing contract between hardware and software
  • Defined
    • functional operation of units
    • How to use each functional unit
  • Does not define
    • how functional unit implemented
    • Execution time of operation
    • energy consumption
  • Programmer’s perspective
    • internal machine states: program counter - next instruction to be executed
    • operations
    • addressing modes
  • Operand location

  • CISC
    • may reduce IC, increase CPI, and increase CT
    • CPU time may increase
    • complex operations
    • variable length
    • limited registers
  • RISC
    • simple operations
    • fixed length
    • limited instruction formats
  • Memory addressing
    • Register
    • Immediate
    • Displacement
    • Register Indirect
    • indexed
    • direct
    • memory indirect @
    • auto decrement/increment
    • scaled
  • Instruction format
    • MISP: I-type load store conditional R-type ALU J-type
    • Opcode : RS RT : RD ShAmnt Funct
  • Processing
    • Instruction fetch
    • Instruction decode
    • Register read
    • Execute instructions
    • Memory access
    • Register write back
  • Example: Single-cycle RISC architecture
  • Pipelining
    • Reusing idle resources: each processing step finishes in a fraction of a cycle
    • Improving throughput at the expense of latency
    • Delay: \(D=T+n\delta\) - ideal case
    • Throughput: \(IPS = n/D\)
  • Example: 5-stage MIPS pipeline
    • Instruction Fetch: read instruction from memory (I-Cache)
      • Program counter (PC)
      • Compute NPC
      • Update pipeline operator
    • Instruction Decode
      • generate control signal
      • read source operands from register file
      • update pipeline register
        • send operand and immediate value to next stage
        • pass control signal and NPC to next stage
    • Execution Stage
      • Perform ALU operation
        • result of ALU
        • Compute branch target(assuming branch taken): Target = NPC + immediate
      • update pipeline registers
        • Control signals, branch target, ALU results, and destination
    • Memory Access
      • Return effective addr from branch
      • Load/Store
    • Register Write Back
  • Pipeline Hazards
    • Structural hazards - multiple instructions compete for the same resources
      • unified memory for Ins/Dat
      • register file with shared read/write access port - subcycles
    • Data hazards - a dependent instruction cannot proceed because it needs a value
      • wait or arrange other Ins in between
      • forwarding - produce values to bypass register file to be used directly
    • Control hazards - the next instruction cannot be fetched because the outcome of an earlier branch is unknown
      • wait or arrange other Ins in between
      • predict the branch outcome: assume the branch is taken or not taken, may need to cancel wrong path
      • jump and function call can be resolved in the decode stage1
  • Multicycle instructions
    • FP operation typically need more time
    • Not all ALU operation complete in one cycle
    • Structural hazards: potential multiple RF writes
    • Data hazards:
      • more read-after-write hazards
      • potential write-after-read hazards
      • potential write-after-write hazards
    • Imprecise exception
      • instructions do not necessarily complete in program order
      • state of the processor must be kept updated with respect to the program order

ILP: Instruction level parallelism

  • Stalls slows down the pipeline, especially when fully stalled, CPU pays the penalty for latch latency
  • ILP2
    • increase overlap among instructions in the pipeline
    • decrease stalls
  • Potential overlap among instructions
    • property of the program dataflow
    • influenced by compiler
    • ILP: \(\frac{NumInstructions}{NumCycles}\)
    • IPC is the exploited ILP
  • Exploitation method:
    • Dynamic scheduling in hardware
    • static scheduling in software(compiler)

Big picture

  • Goal: improving performance \((IPC\times F)/IC\)
    • software: ILP and IC
    • hardware: IPC
    • Increase IPC: improve ILP and exploit more ILP
    • Increas F: deeper pipeline and faster tech
  • Key performance limitation
    • number of instructions fetched per second is limited
    • deeper fetch and wider fetch

Compiler-based techniques

  • Loop book-keeping overhead
  • Diverse impact of stall cycles on performance
  • branch delay slots: an instruction of the loop moved to after the branch and executed to completion

Loop optimization

  • Re-ordering and changing immediate values
  • Reducing loop overhead by unrolling

  • Scalar pipelines
    • Upper bound on throughput \(IPC<=1\)
    • Unified pipeline for all functional units: underutilized resources
    • Inefficient freeze policy: stall cycle delays all the following cycles
    • Pipeline hazards: stall cycles result in limited throughput
  • Superscalar pipelines
    • Separate integer and floating point pipelines \(IPC<=2\)
      • instuction packet that is using VLIW(Very long instruction word)
      • inst. packet has one fp. and one int. slots
  • Software pipelining: mix-and-match instructions from different iterations

Control Flow

  • Impact of branches
  • Program’s basic blocks:
    • only one entry and one exit point per basic block
  • branches
    • conditional vs. unconditional: check conditions, jump call return
    • target address: absolute, relative
  • penalty due to unknown outcome: direction and target. - branch prediction
    • profiling the program
    • predict based on common cases
  • Goal of branch prediction
    • avoid stall cycles in fetch stage
    • static prediction
    • dynamic prediction
    • prediction accuracy: \(correct/total\)
  • Compute slowdown for misprediction
  • Dynamic branch prediction:
    • prediction logic: direction and target address
    • outcome validation and training: outcome is computed regardless of prediction
    • recovery from misprediction: nullify the effect of instructions on the wrong path
    • One-bit branch predictor: keep trach of and use the outcome of last executed branch
      • accuracy: single predictor shared by multiple branches, two misprediction for loop(one in one out)

Branch Predictors

  • goal avoiding stall cycles caused by branches
  • solution: static or dynamic branch predictor
    • prediction
    • validation and training
    • recovery from misprediction
  • performance is influenced by the frequency of branches \(b\), prediction accuracy\(a\) and misprediction cost \(c\)
    • \(Speedup = \frac{1+bc}{1+(1-a)bc}\)
    • Note that originally every branch is assumed a misprediction
  • Bimodal branch predictors
    • one-bit branch predictor
      • keep track of an use the outcome of last branch
      • shared predictor
      • two mispredictions per loop
    • Two-bit branch predictor
      • increment if taken, decrement if untaken, split in between
      • one misprediction on loop exit
  • using multiple counters
    • Decode history table(DHT): reduced HW with aliasing
      • least significant bits are used to select a counter
    • Branch history table(BHT): precisely tracking branches
      • Most significant bits are used as tags(tags + address)
    • Combined BHT and DHT
      • BHT is used on a hit
      • DHT is used/updated on a miss
  • Correlating branch predictor
    • Executed branches of a program stream may be correlated
    • Global History register: an r-bit shift register that maintains outcome history
    • GHR is merged with PC bits to choose a counter(XOR)
  • Local predictor may work well for some while global predictor works well for some other programs
    • could include both to be safe
  • Dedicated predictor per branch
    • Program counter is used for assigning predictors to branches
  • Capturing correlation among branches
    • Shift register is used to track history
  • Predicting branch direction is not enough
    • Which instruction to be fetched if taken
  • Storing the target instruction can eliminate fetching
    • Extra hardware is required

Branch target buffer

  • Store tags and target addresses for each branch
  • only for taken branch and send out the predicted PC to minimize stalls

Dynamic scheduling

  • Instruction scheduling can remove unnecessary stall cycles in the execution/memory stage
    • static scheduling: complex compiler, unable to resolve all data hazards(no access to runtime details)
    • dynamic scheduling: completely done in hardware
  • Creating an instruction schedule based on runtime information
    • hardware managed instruction reordering
    • instructions are executed in data flow order
    • ID is split into:
      • Issue: ensure no structural hazard
      • Read operands: ensure no data hazards
  • Register renaming:
    • Eliminating WAR and WAW hazards
    • change the mapping between architectural registers and physical storage locations
    • allocate a free physical location for new register, find the most recently allocated location for the register3
  • dynamic issue: issue dynamic instructions out of program order while maintaining data flow

Tomasulo algorithm:

  • dispatch instructions to functional units: use reservation stations(RS) - buffer the operands of instructions waiting to issue
  • execute an instruction as soon as all of its operands are ready: watch the common data bus(CDB)
  • remove false(anti- and output-) data dependence: rename destination register to RS name

  • Reservation station(RS) entry:
    • busy - indicates that this reservation station and its accompanying functional units are occupied
    • op
    • vj
    • vk
    • qj - tags refer to the reservation stations taht will produce the corresponding source operand; 0 means available or not needed
    • qk
    • addr - immediate field or effective address for load or store
  • Qi of register file: the number of reservation station that contains the operation whose result should be stored into this register
  • Reorder Buffer(ROB) entry: additional registers in the same way as the reservation stations, supply for values until instructions commit
    • valid
    • result
    • exception
    • pc
  • three-step Tomasulo algorithm
    • Issue: take an instruction from the instruction queue
      • if there are free RS without structural hazards, rename and read/send operands or RS names
    • Execute: operate on oprands when ready
      • if all operands are ready, or watch CDB
    • Write result: update the register values
      • write the result through CDB to all waiting RS and register file, release RS entry
  • Summary:
    • Data hazards
      • RAW is handled by forwarding over CDB
      • WAR and WAW are removed by RS-based renaming
    • Structural hazards
      • Multiple FUs may be accessing CDB simultaneously: delay conflicting instructions at issue and RS
    • Precise exeception handling
      • not possible because OoO writeback to register file: maintain the desination value in ROB
  • Four-step Tomasulo algorithm
    • Issue(dispatch): if RS and ROB slots are free; read/rename operands
    • Execution: execute operation as soon as the operand values are ready
    • write result: send result to ROB and reservation stations via CDB
    • commit(retire): update register file for the head of ROB
      • write back head of the ROB
      • branch prediction: correct
      • branch prediction incorrect: flush ROB and restart execution at the correct successor of the branch

Hardware speculation

  • out-of-order execute/complete
  • in-order fetch/decode and commit
  • distributed reservation stations: in-order(for each functional unit) issue/dispatch
  • out of order issue/dispatch to functional units
    • two step wakeup and select logic
  • Register renaming
    • Register aliasing table for fast lookup(RAT)
    • Free register list for fast register renaming
    • Proceed only if there is free space in IQ, ROB and Free List
  • Branch prediction and recovery: squash all mispredicted entries - commit speculative instructions iff prediction was correct
  • Double RAT architectural: use two RAT and only one enlarged Register file
    • use retire RAT when commiting entries in ROB
    • free entries when there is no reference left(Issue Queue, IQ and RAT)

Out-of-order load and store

  • IQ with MA units:
    • dedicated queue only for load/store instuctions, LSQ
    • two steps:
      • compute the effective address when register is available
      • send the request to memory if there is no memory hazards
  • load : only loads that are not following any unknown stores can be issued(RAW)
  • store: only when there is no instructions before (WAW and WAR)
  • can we predict memory dependence?
    • issue/execute load instructions even if they are following unresolved stores

Memory hierarchy

Memory hierarchy design

  • basics
    • why
    • techs
    • locality
  • caches
    • block, index, tag, address, cache policies
    • Performance metric(AMAT) and optimization techniques
  • Virtual memory
    • paging, address translation, TLB
  • Memory hierarchy
    • Burks, Goldstine, and von Neumann: greater capacity less quickly accessible
  • The memory wall
    • processor-memory performance gap increased over 50% per year
  • Memory technology
    • RAM technology
      • access time ~same for all locations(not so true anymore)
      • static RAM(SRAM)
        • typically used for caches
        • 6T/bit; fast but - low density, high power, expensive
        • cell structure: 6: maintains data while power on
      • Dynamic RAM(DRAM)
        • typically used for main memory
        • 1T/bit; inexpensive, hight density, low power - but slow
        • cell structure 1T-1C: needs refresh regularly to preserve data
  • Cache hierarchy:
    • software: scratchpad
    • hardware: caches
  • Principle of locality
    • memory references exhibit localized accesses
    • Spatial: probability of access to \(A+\delta\) at time \(t+\epsilon\) highest when \(\delta \rightarrow 0\)
    • temporal: probability of accessing \(A+\epsilon\) at time \(t+\delta\) highest when \(\epsilon \rightarrow 0\)
  • Cache terminology
    • Block(cache line): unit of data access
    • Hit: accessed data found at current level, Miss
    • Hit rate, hit time: time to acces data on a hit
    • Miss rate, miss penalty
  • Cache performance:
    • Average Memory Access Time(AMAT): \[AMAT=r_ht_h+r_m(t_h+t_p) = t_h + r_mt_p\]

Cache Architecture

Addressing and lookup

  • Addressing:
    • Main memory address
    • Simplest: direct-mapped cache(mid-bits)
    • tag-index-byte, Tag, byte offset, valid flag(v): whether content is meaningful
    • data and tag are always accessed
  • Cache size only includes data

Cache optimization

  • How to improve cache performance
    • Reduce hit time: memory tech, critical access path
    • improve hit rate: size, associativity, placement/replacement policies
    • Reduce miss penalty: multi level caches, data prefetching
  • Set associative caches
    • Improve cache hit rate by allowing a memory location to be placed in more than one cache block
    • N-way set associative cache, fully associative
    • for a fix capacity, higher associativity typically leads to higher hit rates
  • n-Way
    • index into cache sets
    • multiple tag comparisons
    • multiple data reads
    • Special: direct mapped, fully associative
  • Cache miss classifications
    • cold(compulsory): even ideal cache will have it
      • larger blocks, prefetching
    • Capacity: due to cache’s size limit
      • larger cache
    • Conflict: due to less associativity
      • larger cache
      • more associativity

Cache optimization

  • cache replacement policies
    • which block to replace on a miss?
    • Ideal replacment (Belady’s algorithm)
      • replace the block accessed farthest in the future
    • Least recently used: LRU
      • replace the block accessed farthest in the past
    • Moset recently used: MRU
      • replace the block accessed nearest in the past
    • Random replacement
  • Cache write policies
    • write vs. read
      • Data and tag are accessed for both read and write
      • only for write data array needs to be updated
      • miss: write no allocate, write allocate
      • hit: write back, write through
    • Write back:
      • on a write access write to cache only
      • write cache block to memory only when replaced from cache
      • dramatically decreases bus bandwidth usage
      • keep a bit(dirty bit) per cache block
    • Write through
      • write to both cache and memory(or next level) through use of a write buffer
      • improved miss penalty4
      • more reliable because of maintaining two copies
      • will stall processor to allow memory to catch up incase write buffer fills up
    • Write allocate
      • allocate a cache line for the new data and replace old line
    • Write no allocate
      • do not allocate space in the cache for the data
      • only really makes sense in systems with write buffers
  • Reducing miss penalty
    • inevitable misses: serve as quickly as possible
    • Multilevel caches
    • read misses priority over writes
    • sub-block placement
    • critical word first
  • Victim cache
    • reduce conflict misses: larger capacity and more associativity
    • Associativity is expensive: complex hardware, longer hit time, more energy
    • Observation: conflict misses do not occur in all sets, increase associativity on the fly
    • Small fully associative cache, on eviction move the victim block here
  • Cache inclusion
    • how to reduce the number of accesses that miss in all cache levels?
    • show a block be allocated in all levels?
      • yes: inclusive, no: non-inclusive/exclusive
    • Modern processors:
      • L3: inclusive of L1 and L2
      • L2: non-inclusive of L1 (large victim cache)

Virtual Memory

  • Memory hierarchy: lower levels provide greater capacity but longer time
    • will a program fit in main memory/what if running multiple programs
  • virtual memory: use main memory as a “cache” for secondary memory
    • allow efficient and safe sharing the physical main memory among multiple programs
  • virtual memory systems
    • provides illusion of very large memory
      • address space of each program larger than the physical main memory
    • Memory menagement unit MMU
      • between main and secondary mem.
      • address translation
        • virtual address space to physical address space
    • every virtual address is translated to a physical address with the help of hardware
      • where, what to do on table miss(page fault), cost
  • Address translation cost
    • page walk: look up the physical address in the page table
  • multi-level page table
    • the virtual(logical) address space is broken down in to multiple pages
  • Translation lookaside buffer
    • exploit locality to reduce address translation time: keep the translation in a buffer
    • fully associative/set associative/direct
    • typically faster than cache access: TLBs are smaller(typically not more than 128 to 256 entries)
  • Content addressable memory(CAM) based TLB
    • data in address out and put into a RAM to get the physical page number/dirty bits/status
  • physically indexed caches:
    • increased critical path from sequential TLB and cache access
  • virtually indexed caches
    • parallel lookup and compare between tags from TLB and Tag Array

Main memory systems

  • SRAM for caches
    • fast and leaky, 6 transistors per bit, CMOS process
    • static volatile, retain data when powered on
  • DRAM for main memory
    • slow and dense, 1 transistor per bit, DRAM process
    • dynamic volatile, periodic refreshing is required to retain data
  • DRAM is organized as rows X colums
  • DRAM load is destructive: discharging capacitors
  • Row access strobe(RAS) and column access strobe(CAS)
  • all reads and writes are performed through DRAM row buffer(RB)
    • Row buffer holds a single row of the array: typical 8KB
    • entire row moved but only a block is accessed
    • row buffer access possibilities
      • row buffer hit: no need for precharge and activate 20
      • row buffer miss: activate are needed: empty row 40 / row conflict 6
  • DRAM refresh
    • current leakage: both ways
    • DRAM requires the cells’ contents to be read and written periodically
      • Burst refresh: all at once
      • distributed refresh: a group at a time

DRAM controller

  • DRAM operations:
    • Precharge bitlines to prepare subarray for activating a wordline
    • Activate a row by connecting DRAM cells to the bitline and start sensing
    • Read the contents of a data block from the row buffer
    • Write new contents for data block into the row buffer
    • Refresh DRAM cells: can be done through a precharge followed by an activate
  • DRAM control
    • external controller, may be integrated on CPU(modern)
  • Basic timings:
    • \(t_{CAS/CL}\): column access strobe (RD->DATA)
    • \(t_{RAS}\): row active strobe (ACT->PRE)
    • \(t_{RP}\): row precharge (PRE->ACT)
    • \(t_{RC}\): row cycle (ACT->PRE->ACT)
    • \(t_{RCD}\): row to column delay(ACT->RD/WT)
  • Access time:
    • Row hit: \(t_{CAS}\)
    • Row empty: \(t_{RCD} + t_{CAS}\)
    • Row conflict: \(t_{RP} + t_{RCD} + t_{CAS}\)
  • Memory channels: fully parallel access
    • separate data, control, and address buses
    • Goal: keep data bus fully utilized
  • DRAM organization:
    • Independently accessed through dedicated data, address, and command buses
    • Physically broken down into DIMM(dual in-line memory modules)
    • logically divided into ranks, a collection of DRAM chips responding to the same memory request
  • Memory controller: connects CPU and DRAM
    • receives requests after LLC misses
    • DRAM refresh management, command scheduling, Row-buffer management policies, address mapping schemes
  • Refresh management: recently accessed rows need not to be refreshed
  • Command scheduling:
    • write buffering: writes can wait until reads are done
    • Controller queues DRAM commands: per-bank queues, reordering ops meant for same bank
    • Policies:
      • First-come-first-served: oldest request first
      • First-Ready first-come-first-served
        • prioritize column changes over row changes
        • skip over older conflicting requests
        • find row hits(on queued request)
  • Row-buffer management policies
    • open-page policy: good for lots of locality
    • close-page policy: good for random access

DRAM Control

  • Tasks:
    • Refresh management
    • Address mapping
    • Request scheduling
    • Power management
    • Error detection/correction
  • Address mapping
    • Memory request: Type:Address:Data
    • Row:Channel:Rank:Bank:Column
  • Command scheduling
    • Write buffering: wait till reads are done
    • Controller queues DRAM commands: per-bank, reordering ops
    • Policies: FCFS FR-FCFS(First-Ready)
      • Prioritize column accesses over row changes
      • Skip over older conflicting requests
      • Find row hits: if no conflicts with in-progress request
  • Error detection/correction
    • Reason: leakage, alpha particles, hard error
    • How: parity bits, ECC
    • Single-error correction, Double-Error Detection(SECDED)
  • Programmable memory controller
    • Multiple performance obj
    • app-specific optimizations
    • patches and in-field updates
  • Judiciois division of labor between specialized hardware and firmware
    • Request and transaction processing in firmware
    • confiugrable timing validation in hardware
  • RISC ISA: metadata(app hints, control flow), address(control flow, address mapping)
    • Queue management with instruction flags: R/T for request and transaction
    • Two five-stage pipelines and one configurable timing validation circuit(request processor, transaction processor)

Data level parallelism

  • IPC hardly achieves more than 2
  • DLP: data level parallelism
    • Vector processors(Cray machines), SIMD(Intel MMX), and GPUs(NVIDIA)
  • TLP: thread level parallelism, multiprocessors and hardware multithreading
  • RLP: request level parallelism: data centers

Vector processors

  • Improves throughput rather than latency
  • Vector register file
    • 2D register array
    • number of elements per register determines the max vector length
  • Vector functional units
    • single opcode for multiple units
    • Integer/floating point/load/store
  • Single instruction defines multiple operations
    • lower instruction fetch/decode/issue cost
  • Operations executed in parallel
    • no dependency simple hardware
  • Predictable memory access pattern
    • improve performance by prefetching
    • simple memory scheduling policy
    • multi banking may be used for improving bandwidth
  • Fixed length - narrow is common as wide is not efficient
  • Variable length - vector length register with MVL(maximum VL)
  • Conditional execution
    • by prediction: masks flag vectors with single-bit elements, vector compare to produce flags, use mask control next vector operation

Graphics processing unit

  • Flynn’s taxonomy: Data vs instruction streams

Instruction Stream

Data Stream

Single

Multiple

Single

SISD uniprocessors

MISD systolic arrays

Multiple

SIMD vector procrssors

MIMD Multiprocessors

  • GPU: initially developed as graphics accelerator
    • receives geometry information from the CPU and provides a picture
    • Host interface -> Vector processing -> Triangle setup -> pixel processing -> memory interface
  • Host interface: communication bridge
    • receives commands from cpu and pulls geometry info from sysmemory
    • outputs a stream of vertices in object space with all their associated information
  • Vertex processing:object space to screen space
  • Pixel processing: rasterize triangle to pixels, texture mapping and math operations
  • Programmin gpus: exectued for every vertex or every fragment, fully customizable geometry and shading effects
  • Memory interface
    • framebuffer(fragment color from previous stage)
    • before final writes some fragments rejected by zbufer, stencil and alpha tests
    • z and color are compressed to reduce framebuffer bandwidth(not size)
  • General-purpose GPUs(GPGPUs)
    • heterogeneous system
    • simple in-order pipelines that rely on thread-level parallelism to hide long latencies
    • many registers(~1K) per pipeline to support many active warps
  • Single instruction multiple threads SIMT
    • thread blocks
    • warps
    • in-order pipelines
  • CPU
    • optimized for low-latency access to cached data sets
    • control logic for out-of-order and speculative execution
  • GPU
    • optimized for data-parallel throughput computation
    • architecture tolerant of memory latency
    • more transistors dedicated to computation
  • Compiling CUDA
    • call nvcc
    • CPU code
    • parallel threads eXecution(PTX)
      • virtual machine and ISA
    • PTX to device-specific binary object
  • Memory hierarchy
    • Throughput-oriented main memory
    • GDDR: wide channel(256-bit), lower clock rate than DDR(double data rate-transfer data on rise and fall)
    • 1.5MB shared L2
    • 48KB read-only data cache
      • compiler controlled
    • Wide buses

Thread level parallelism

  • Thread: single sequential flow of control within a program(instructions + state)
    • Register state: Thread context
  • Single threaded vs multithreaded
  • Multitasking: OS to load context of a new thread while old thread’s context is back to memory
  • explicitly multi-threaded programs
  • parallel languages and libraries
  • Architecture for exploiting threa-level parallelism
    • hardware multithreading: multiple threads run on the same processor pipeline
      • multithreading levels
        • coarse grained CGMT
        • fine grained FGMT
        • simultaneous SMT
    • multiprocessing: different threads run on different processors
      • two types:
        • symmetric multiprocessors(SMP): single cpu per chip
        • Chip multiprocessors (CMP): multiple cpu per chip
  • CPU idle due to latencies
    • utilize idle resources
    • challenge: energy and performance costs of context switching
  • CGMT:
    • runs until a costly stall: llc miss
    • another thread starts during first’s stall(pipeline fill time several cycles)
    • at a given time only one thread is in the pipeline
    • problem: hardware support(PC, register for each threads). short stalls
    • vs superscalar
  • FGMT:
    • two or more threads interleave instructions: RR. Skip stalled threads
      • hardware support: separate PC and register file, alternating pattern
      • naturally hides delays but does not make full use of multi-issue architecture
  • SMT:
    • instruction from multiple threads issued on same cycle: register renaming and dynamic scheduling of multi-issue architecture
    • more hardware support: RF,PC temorary result and distribution
    • Maximizes utilization of execution units
  • SMP: multiple CPU share same memory
    • OS: same cpus and equally accessible to main memory
    • each cpu has its own power distribution and cooling
  • Chip multiprocessor: SMP on a single chip(with cores)
    • shared higher level caches, last level lower latency improved bandwidth
    • not necessarily homogeneous cores
  • CMP vs SMP
    • lower costs single memory interface, single socket
    • less off-chip comm: lower power and energy consumption,improved AMAT
    • use additional transistors made available: more cores/more complicated pipelines
  • Efficiency: ideally n cores -> nx performance
    • if goal: provide the same performance as uniprocessor, frequency can be 1/n

Parallel memory architecture

  • Amdahl’s law for theoretical speedup
  • Shared memory vs message passing
    • multiple threads shared mem vs explicit communication through interconnection
    • easy to program vs simple hardware
  • UMA, NUMA
  • shared network, p2p network
    • low latency low bandwidth simple control
    • high latency high bandwidth, complex control
  • shared memory challenges
    • memory consistency: program order
    • cache coherence: cache should be transparent to programmer
  • cache coherence
    • multiple copies becomes inconsistent when writes
    • simple solution is to propagate writes from one core to others
    • key op: update/invalidate sent to all or a subset of the cores
      • software based management: flush dirty blocks to memory, invalidate: make all the blocks invalid
      • hardware based management: update or invalidate other copies on every write or send the data to everyone
    • invalidation based protocol is better
  • Snoopy protocol
    • relying on a broadcast infrastructure among caches
    • every cache monitors(snoop) the traffic on the shared media to keep the states of the cache block up to date
    • Simple version:
      • write-through and write no-allocate with multiple readers allowed and writes invalidate replicas
      • Simple state machine for each cache unit: one-bit valid flag FSM
      • processor actions: load, store, evict(V->I)5
      • bus trafic: busrd, buswr(note that the data is also on this bus)
      • nomenclature: action taken/action on bus

Shared memory system

  • Inconsistent vs consistent data
  • Snooping with writeback policy:
    • writes are not propagated to memory until eviction(cache data maybe different from main mem)
    • solution: owner of the most recently updated replica
      • only one owner, only owner can update, reader can share the data but have to gain ownership to write
      • three states MSI: invalid(no replica), shared(read-only), modified(writable and only valid copy)
      • processor actions: load,store,evict
      • bus messages: Busrd, BusRdX(read exclusive), BusInv, BusWB, BusReply
    • MESI: illinois protocol
      • enter exclusive first then shared, so no invalidation traffic on write-hits in the E state
      • lower overheads in sequential applications
      • more complex protocol, longer memory latency due to the protocol
  • Alternatives to snoopy protocols
    • not scalable: shared bus bandwidth is limited, every node broadcasts message and monitors the bus
    • limit the traffic using directory structure: keep track of sharers of each block
  • mem op reordering: mem consistency model specifies constraints on the order in which mem op must appear to be performed
    • Dekker’s algorithm: mutex critical region, any time only one process allowed:
lock: A = 1;
if (B != 0) {
    A = 0;
    goto lock;
}
// region
A = 0;
  • Sequential consistency
    • within a program order is preserved, each instruction is atomic
    • instructions from different threads can be interleaved arbitrarily
  • relaxed consistency model
    • real processors: not all instructions need to be executed in program order
    • fence instruction can be used to enforce ordering among memory instructions
    • use fence to protect lock_acquire

  1. ?

  2. what is potential ILP

  3. parallel with SSA?

  4. ?

  5. shouldn’t need this