Parallel programming
Introduction
Cuda benefits for HPC
- Exposes architecutre execution model
- Must be able to achieve high performance
- Portable to different platforms(cuda is not, opencl)
- Easy migration path for existing applications
How to programming cuda
- Kernels
- thread program
- How to launch thread program
- Where to place data
- synchronization across threads
Programming model
Three types of functions
- Host
- global (that run on CPU calls GPU)
- device
Memory model
- global
- shared
Parallel model
- Two levels
- Block level : Streaming multiprocessors that shares global memory
- thread level : Streaming processors(cores)
Execution Model
- Parallel programming model
- software tech for expressing parallel algorithms
- defines semantics of software running on the hardware
- Parallel execution model
- abstraction of hardware execution in a class of architectures
- How to structure and name data and instruction
- utilization
- SIMD execution of warpsize=M threads
- Mulithreaded execution
- Each block mapped to single Straming multiprocessors
SIMD
SIMT(thread) - SIMD + Multithreading
- SIMD : a single thread of control
- MIMD : multiple threads of control
- SPMP : multiple threads of control but each processor execute same code
SIMD
- Motivation
- data-parallel computation map well to architeutres
- Conserve control units and simplify coordination
- No unified model for SIMD
Multithreading
- All thread execute the same kernel program
- programmer declare blocks
- threads in the same block can be synchronized
in different blocks cannot cooperate(except global sync)
- units
- Load operands
- Compute
- Store result
- Thread-level parallelism
- single thread -> under utilization
- Multithreading
- Multiprocessing
- Multithreading
- quick context switching
- own PC
- shared function units and cache
Scheduling instructions for SIMD, multithreaded multiprocessors
- CUDA
- large register file
- each thread assigned a window (entire thread blocks registers do not exceed capacity)
- shared memort not exceed capacity
- SM warp scheduling
- Zero-overhead warp scheduling, warp as a whole
- Warp scheduling
- fetch instruction
- operands scoreboarding
- next executed warp
- multiple Block on a SM, not exceeds total limits
Synchronization
How to determine if its correct
- comparing results
- debugger
Prevent race conditions
- at least on is write
- Race condition: result of an execution depends on the timing of two or more events
- Data dependence: the order of operations must be preserved, may refer to same location, and one is write
parallelization(safe), locality optimization(speed)
Data dependence
- true dependence
- anti-dependence
- output dependence
- input dependence(locality)
dependence i to i’
- at least one is write
- may refer to the same element
- i before i’
any reordering transformation that preserves every dependence in a program preserves the meaning of that program
no dependences cross the iteration boundary of a parallel loop, make it internal to part of thread program or host program or add synchronization
- memory-based
control-based
- mutex: not in cuda
- atomicity: involes memory contoller, contention
- barrier : only block level sync : __syncthreads();
- Memory fence
- threadfence_block() waits till all global memory and shared memory accesses prior to the call are visible to all threads
Locality and data placement
- GPU Memory
- device / global memory, ~100
- shared memory: on SM for threads in block, less than 10. 48k
- data cache, less than 10. 16k, size can be swapped with shared memory
- constant memory: may be cached in constant cache, 1~100
- register file
- cache/surface memory (marked, macro access)
- Data resue
- same or nearby data access multiple times
- Data locality
- same or nearby data is access from fast memory
- Reorder computation
- reorder data layout
Spatial reuse
- C: row major order, every row is stored adjacently
- Fortran: col major
tiling
strip mining & permuting
better cache util
- Matrix-vector multiply
- Matrix multiplication
Target of Memory Hierarchy Optimization
- Reduce memory latency
- memory bandwidth
manage overhead
- scheduling unit
- execution unit
memory unit
memory alignemnt
128 bytes alignment
Using shared memory to coelesce the accesses
memory band in shared memory, illregular tile dimmensions can fix bank conflict
dense array-based computation
Irregular computations
Sparse Matrix-times Vector multiplication
general sparse matrix representation:
- primarily only represent the nonzero data values
Auxiliary data structures describe placement of nonzeros in “dense matrix”
- partitioning
- access patterns
parallel reduction
- Compressed sparse row representation
- coordinate representation
- Blocked Compressed sparse row
ELLPACK
Scalar expansion
Stencil
- Scalar replacement
floating point
- historically single precision
- movement to double, Prior to GTX and Tesla, only single precision
- only uses double precision when needed
- Float point representation
- Double & single performance varies a lot(single is significantly faster)
function units
- Hardware intrinsics implemented in special functional units faster but less precise than software implementation
parallel sort
Key issues
- Data movement requires significant memory bandwidth
- Managing global list may require global synchronization
- very little computation, memory bound
Hybrid soring algorithms
- different strategy for different number of elements
- sort sublist in shared memory
global memory to create pivots
- histogram constructed in global mem to identify pivots
- Bucket sort into separate bucket
- MergeSort within bucket
- vector sort of float4 entries
- vector sort of pair of float4s
parallel radix sort
- count
- prefix sum
- reorder