Parallel programming

Introduction

Cuda benefits for HPC

Exposes architecutre execution model
Must be able to achieve high performance
Portable to different platforms(cuda is not, opencl)
Easy migration path for existing applications

How to programming cuda

Kernels
- thread program
How to launch thread program
Where to place data
synchronization across threads

Programming model

Three types of functions

Host
global (that run on CPU calls GPU)
device

Memory model

global
shared

Parallel model

Two levels
- Block level : Streaming multiprocessors that shares global memory
- thread level : Streaming processors(cores)

Execution Model

Parallel programming model
- software tech for expressing parallel algorithms
- defines semantics of software running on the hardware
Parallel execution model
- abstraction of hardware execution in a class of architectures
- How to structure and name data and instruction
- utilization
SIMD execution of warpsize=M threads
Mulithreaded execution
Each block mapped to single Straming multiprocessors

SIMD

SIMT(thread) - SIMD + Multithreading

SIMD : a single thread of control
MIMD : multiple threads of control
SPMP : multiple threads of control but each processor execute same code

SIMD

Motivation
- data-parallel computation map well to architeutres
- Conserve control units and simplify coordination
No unified model for SIMD

Multithreading

All thread execute the same kernel program
programmer declare blocks
threads in the same block can be synchronized
in different blocks cannot cooperate(except global sync)
units
- Load operands
- Compute
- Store result
Thread-level parallelism
- single thread -> under utilization
- Multithreading
- Multiprocessing
Multithreading
- quick context switching
- own PC
- shared function units and cache

Scheduling instructions for SIMD, multithreaded multiprocessors

CUDA
- large register file
- each thread assigned a window (entire thread blocks registers do not exceed capacity)
- shared memort not exceed capacity
SM warp scheduling
- Zero-overhead warp scheduling, warp as a whole
Warp scheduling
- fetch instruction
- operands scoreboarding
- next executed warp
multiple Block on a SM, not exceeds total limits

Synchronization

How to determine if its correct

comparing results
debugger
Prevent race conditions
at least on is write
Race condition: result of an execution depends on the timing of two or more events
Data dependence: the order of operations must be preserved, may refer to same location, and one is write
parallelization(safe), locality optimization(speed)

Data dependence

true dependence
anti-dependence
output dependence
input dependence(locality)

dependence i to i’

at least one is write
may refer to the same element
i before i’

any reordering transformation that preserves every dependence in a program preserves the meaning of that program

no dependences cross the iteration boundary of a parallel loop, make it internal to part of thread program or host program or add synchronization

memory-based
control-based
mutex: not in cuda
atomicity: involes memory contoller, contention
barrier : only block level sync : __syncthreads();
Memory fence
- threadfence_block() waits till all global memory and shared memory accesses prior to the call are visible to all threads

Locality and data placement

GPU Memory
- device / global memory, ~100
- shared memory: on SM for threads in block, less than 10. 48k
- data cache, less than 10. 16k, size can be swapped with shared memory
- constant memory: may be cached in constant cache, 1~100
- register file
- cache/surface memory (marked, macro access)
Data resue
- same or nearby data access multiple times
Data locality
- same or nearby data is access from fast memory
Reorder computation
reorder data layout

Spatial reuse

C: row major order, every row is stored adjacently
Fortran: col major

tiling

strip mining & permuting

better cache util

Matrix-vector multiply
Matrix multiplication

Target of Memory Hierarchy Optimization

Reduce memory latency
memory bandwidth
manage overhead
scheduling unit
execution unit
memory unit

memory alignemnt

128 bytes alignment

Using shared memory to coelesce the accesses

memory band in shared memory, illregular tile dimmensions can fix bank conflict

dense array-based computation

Irregular computations

Sparse Matrix-times Vector multiplication

general sparse matrix representation:

primarily only represent the nonzero data values
Auxiliary data structures describe placement of nonzeros in “dense matrix”
partitioning
access patterns
parallel reduction
Compressed sparse row representation
coordinate representation
Blocked Compressed sparse row
ELLPACK
Scalar expansion

Stencil

Scalar replacement

floating point

historically single precision
movement to double, Prior to GTX and Tesla, only single precision
only uses double precision when needed
Float point representation
Double & single performance varies a lot(single is significantly faster)

function units

Hardware intrinsics implemented in special functional units faster but less precise than software implementation

parallel sort

Key issues

Data movement requires significant memory bandwidth
Managing global list may require global synchronization
very little computation, memory bound

Hybrid soring algorithms

different strategy for different number of elements
sort sublist in shared memory
global memory to create pivots
histogram constructed in global mem to identify pivots
Bucket sort into separate bucket
MergeSort within bucket
- vector sort of float4 entries
- vector sort of pair of float4s

parallel radix sort

count
prefix sum
reorder