Parallel programming

Introduction

Cuda benefits for HPC

  • Exposes architecutre execution model
  • Must be able to achieve high performance
  • Portable to different platforms(cuda is not, opencl)
  • Easy migration path for existing applications

How to programming cuda

  • Kernels
    • thread program
  • How to launch thread program
  • Where to place data
  • synchronization across threads

Programming model

Three types of functions

  • Host
  • global (that run on CPU calls GPU)
  • device

Memory model

  • global
  • shared

Parallel model

  • Two levels
    • Block level : Streaming multiprocessors that shares global memory
    • thread level : Streaming processors(cores)

Execution Model

  • Parallel programming model
    • software tech for expressing parallel algorithms
    • defines semantics of software running on the hardware
  • Parallel execution model
    • abstraction of hardware execution in a class of architectures
    • How to structure and name data and instruction
    • utilization
  • SIMD execution of warpsize=M threads
  • Mulithreaded execution
  • Each block mapped to single Straming multiprocessors

SIMD

SIMT(thread) - SIMD + Multithreading

  • SIMD : a single thread of control
  • MIMD : multiple threads of control
  • SPMP : multiple threads of control but each processor execute same code

SIMD

  • Motivation
    • data-parallel computation map well to architeutres
    • Conserve control units and simplify coordination
  • No unified model for SIMD

Multithreading

  • All thread execute the same kernel program
  • programmer declare blocks
  • threads in the same block can be synchronized
  • in different blocks cannot cooperate(except global sync)

  • units
    • Load operands
    • Compute
    • Store result
  • Thread-level parallelism
    • single thread -> under utilization
    • Multithreading
    • Multiprocessing
  • Multithreading
    • quick context switching
    • own PC
    • shared function units and cache

Scheduling instructions for SIMD, multithreaded multiprocessors

  • CUDA
    • large register file
    • each thread assigned a window (entire thread blocks registers do not exceed capacity)
    • shared memort not exceed capacity
  • SM warp scheduling
    • Zero-overhead warp scheduling, warp as a whole
  • Warp scheduling
    • fetch instruction
    • operands scoreboarding
    • next executed warp
  • multiple Block on a SM, not exceeds total limits

Synchronization

How to determine if its correct

  • comparing results
  • debugger
  • Prevent race conditions

  • at least on is write
  • Race condition: result of an execution depends on the timing of two or more events
  • Data dependence: the order of operations must be preserved, may refer to same location, and one is write
  • parallelization(safe), locality optimization(speed)

Data dependence

  • true dependence
  • anti-dependence
  • output dependence
  • input dependence(locality)

dependence i to i’

  • at least one is write
  • may refer to the same element
  • i before i’

any reordering transformation that preserves every dependence in a program preserves the meaning of that program

no dependences cross the iteration boundary of a parallel loop, make it internal to part of thread program or host program or add synchronization

  • memory-based
  • control-based

  • mutex: not in cuda
  • atomicity: involes memory contoller, contention
  • barrier : only block level sync : __syncthreads();
  • Memory fence
    • threadfence_block() waits till all global memory and shared memory accesses prior to the call are visible to all threads

Locality and data placement

  • GPU Memory
    • device / global memory, ~100
    • shared memory: on SM for threads in block, less than 10. 48k
    • data cache, less than 10. 16k, size can be swapped with shared memory
    • constant memory: may be cached in constant cache, 1~100
    • register file
    • cache/surface memory (marked, macro access)
  • Data resue
    • same or nearby data access multiple times
  • Data locality
    • same or nearby data is access from fast memory
  • Reorder computation
  • reorder data layout

Spatial reuse

  • C: row major order, every row is stored adjacently
  • Fortran: col major

tiling

strip mining & permuting

better cache util

  • Matrix-vector multiply
  • Matrix multiplication

Target of Memory Hierarchy Optimization

  • Reduce memory latency
  • memory bandwidth
  • manage overhead

  • scheduling unit
  • execution unit
  • memory unit

memory alignemnt

128 bytes alignment

Using shared memory to coelesce the accesses

memory band in shared memory, illregular tile dimmensions can fix bank conflict

dense array-based computation

Irregular computations

Sparse Matrix-times Vector multiplication

general sparse matrix representation:

  • primarily only represent the nonzero data values
  • Auxiliary data structures describe placement of nonzeros in “dense matrix”

  • partitioning
  • access patterns
  • parallel reduction

  • Compressed sparse row representation
  • coordinate representation
  • Blocked Compressed sparse row
  • ELLPACK

  • Scalar expansion

Stencil

  • Scalar replacement

floating point

  • historically single precision
  • movement to double, Prior to GTX and Tesla, only single precision
  • only uses double precision when needed
  • Float point representation
  • Double & single performance varies a lot(single is significantly faster)

function units

  • Hardware intrinsics implemented in special functional units faster but less precise than software implementation

parallel sort

Key issues

  • Data movement requires significant memory bandwidth
  • Managing global list may require global synchronization
  • very little computation, memory bound

Hybrid soring algorithms

  • different strategy for different number of elements
  • sort sublist in shared memory
  • global memory to create pivots

  • histogram constructed in global mem to identify pivots
  • Bucket sort into separate bucket
  • MergeSort within bucket
    • vector sort of float4 entries
    • vector sort of pair of float4s

parallel radix sort

  • count
  • prefix sum
  • reorder