2025년 11월 9일 일요일

atomicAdd in GPU program(CUDA fortran)

When a subroutine is called in GPU such as 

call sum_array<<<numBlocks, blockSize>>>(A_d, n, total_d)

Basically, all individual threads works independently. 

To make a sum(or reduction) across blocks or across threads withn a block,

one have to use global device memory or block shared memory variable. 

However, if read-modify-write operation is done independently,

there will be a race-condition and result can be wrong. 

Atomic operation is a serialized operation across blocks or threads in a block. 

Scope of operation is determined by the memory location of variable. 

Note if the sums are partially done in each block, one have to be sure to call atomicAdd within only one thread in a block to get correct global sums. 

     ! Atomic reduction to shared memory

    !    dummy = atomicAdd(variable, value)

    !    atomic(indivisible) operation : read-modify-write as a single hardware instruction
    !                                    avoids race conditions. (ie. serialized by the hardware)
    !                                    variable have to be global in device or shared in block    
    !                                    variable have to be integer or real (not complex).  
    !    scope of operation(within block or across all blocks) is determined by the memory space.
    !       REAL(8),DEVICE or argument = Global memory, shared by all blocks
    !       REAL(8),SHARED =  Shared memory within the same block
    !       REAL(8)        = local memory within a thread.

댓글 없음:

댓글 쓰기