## GPU-accelerated vs. CPU-only Applications





the CPU **DATA CPU** a=np.arange(n) a=do\_work(a) Time



...and all work is performed serially on

...and all work is performed serially on the CPU **DATA CPU** a=do\_work(a) verify(a) a=np.arange(n) Time



both host and device memory. **DATA GPU** a=np.arange(n) Time



In accelerated applications there is













# CUDA Thread Hierarchy











CUDA can process thousands of threads in parallel. The sizes are greatly reduced in these images for simplicity.













### Kernels are **launched** with an **execution configuration**













## CUDA-Provided Thread Hierarchy Variables

Inside kernel definitions, CUDAprovided variables describe its executing thread, block, and grid



gridDim.x is the number of blocks in
the grid, in this case 2











Inside a kernel blockDim.x describes the number of threads in a block. In this case 4





















# Coordinating Parallel Threads











0 4

1 | 5

2 6

3 || '

**GPU** 

# do\_work[2, 4](d\_a)















0 | 4

1 || 5

2 | 6

3 ||

7

There is an idiomatic way to calculate this value, however. Recall that each thread has access to the size of its block via blockDim.x





GPU DATA 0 | 4

1 | 5

2 | 6

3 || -

...and the index of its block within the grid via blockIdx.x





GPU DATA 0 | 4

1 || 5

2 | 6

3 || .

...and its own index within its block via threadIdx.x





0 | 4

1 || 5

2 | 6

3 || '

Using these variables, the formula threadIdx.x + blockIdx.x \* blockDim.x will return the thread's unique index in the whole grid, which we can then map to data elements.





0 | 4

0

threadIdx.x | + | blockIdx.x | \* | blockDim.x

0

4

1

5

6

data\_index

2 ||

3

7











0 4

threadIdx.x | + | blockIdx.x | \* | blockDim.x

data\_index







3





0

2

threadIdx.x + blockIdx.x blockDim.x

0

5

data\_index

6

3











0 4

3

threadIdx.x | + | blockIdx.x | \* | blockDim.x

0

4

1 |

5

data\_index

2

6

3

7









0 4

0

threadIdx.x | + | blockIdx.x | \* | blockDim.x

1

4

1 |

5

data\_index

2

6

?

3

7









0 4

1

threadIdx.x + blockIdx.x \* blockDim.x

4

1

5

data\_index

2 |

3

6

7





3





0 4

2

threadIdx.x | + | blockIdx.x | \* | blockDim.x

1

4

1 || 5

5

6

data\_index

2 |

3

7









0

3

threadIdx.x + blockIdx.x blockDim.x

5

data\_index

6

3











As a convenience, Numba provides the `cuda.grid()` function, which will return a thread's unique index in the grid.

blockDim.x

blockIdx.x





# **Grid-Stride Loops**



Often there are more data elements than there are threads in the grid









... or else work is left undone









One way to address this programmatically is with a grid-stride loop

















Numba provides another convenience function for this common calculation:

cuda.gridsize(),

returning the number of threads in the grid

## **GPU**

















With all threads working in parallel using a grid stride loop...

\_

do\_work[2, 4](d\_a)
0 1 2 3 0 1 2 3











With all threads working in this way, all elements are covered with the performance advantage of memory coalescing











With all threads working in this way, all elements are covered with the performance advantage of memory coalescing











With all threads working in this way, all elements are covered with the performance advantage of memory coalescing











With all threads working in this way, all elements are covered with the performance advantage of memory coalescing







