When a subroutine is called in GPU such as
call sum_array<<<numBlocks, blockSize>>>(A_d, n, total_d)
Basically, all individual threads works independently.
To make a sum(or reduction) across blocks or across threads withn a block,
one have to use global device memory or block shared memory variable.
However, if read-modify-write operation is done independently,
there will be a race-condition and result can be wrong.
Atomic operation is a serialized operation across blocks or threads in a block.
Scope of operation is determined by the memory location of variable.
Note if the sums are partially done in each block, one have to be sure to call atomicAdd within only one thread in a block to get correct global sums.
! Atomic reduction to shared memory
! dummy = atomicAdd(variable, value)
댓글 없음:
댓글 쓰기