

# Accelerating Applications with CUDA C/C++

## TOPICS

GPU-accelerated vs. CPU-only Applications CUDA Kernel Execution Parallel Memory Access Appendix: Glossary

### **GPU-accelerated vs. CPU-only Applications**

|      |                         | In <b>CPU-only applications</b> data is<br>allocated on CPU |
|------|-------------------------|-------------------------------------------------------------|
|      |                         |                                                             |
| DATA |                         |                                                             |
|      |                         |                                                             |
|      |                         |                                                             |
|      |                         |                                                             |
| CDU  |                         |                                                             |
| CPU  | <pre>initialize()</pre> |                                                             |
|      | Time                    |                                                             |



...and all work is performed on CPU





...and all work is performed on CPU









|      |              | where it can be accessed and<br>worked on by the CPU |
|------|--------------|------------------------------------------------------|
| DATA | GPU<br>CPU   |                                                      |
| GPU  |              |                                                      |
| CPU  | initialize() |                                                      |
|      | Time         |                                                      |

DEEP LEARNING INVIDIA.











DEEP LEARNING INSTITUTE





### **CUDA Kernel Execution**





GPUs do work in parallel





GPU work is done in a thread





Many threads run in parallel





A collection of threads is a **block** 





There are many blocks





A collection of blocks is a grid





GPU functions are called kernels





Kernels are **launched** with an **execution configuration** 





The execution configuration defines the number of blocks in the grid





... as well as the number of threads in each block





Every block in the grid contains the same number of threads





### **CUDA-Provided Thread Hierarchy Variables**

Inside kernels definitions, CUDAprovided variables describe its executing thread, block, and grid





gridDim.x is the number of blocks in
 the grid, in this case 2







**blockIdx.x** is the index of the current block within the grid, in this case **0** 





**blockIdx.x** is the index of the current block within the grid, in this case 1





Inside a kernel **blockDim.x** describes the number of threads in a block. In this case **4** 







All blocks in a grid contain the same number of threads



































0

1

0

1

2

3







# **Coordinating Parallel Threads**















Assume data is in a 0 indexed vector

performWork<<2, 4>>>()











Recall that each thread has access to the size of its block via **blockDim.x** 

performWork<<<2, 4>>>()







...and the index of its block within the grid via **blockIdx.x** 











performWork<<2, 4>>>() 1 0 1 2 3 0 1 2 3 4 4 4







Using these variables, the formula threadIdx.x + blockIdx.x \* blockDim.x will map each thread to one element in the vector























































































DATA















DATA





GPU

DEEP LEARNING INVIDIA. INSTITUTE









DATA





GPU

DEEP LEARNING INVIDIA. INSTITUTE

## Grid Size Work Amount Mismatch



DEEP LEARNING INVIDIA. INSTITUTE













Attempting to access non-existent elements can result in a runtime error































DEEP LEARNING INVIDIA. INSTITUTE

























## **Grid-Stride Loops**



Often there are more data elements than there are threads in the grid











In such scenarios threads cannot work on only one element















One way to address this programmatically is with a grid-stride loop









In a grid-stride loop, the thread's first element is calculated as usual, with threadIdx.x + blockIdx.x \* blockDim.x

performWork<<2, 4>>>()
0 1 2 3 0 1 2 3





The thread then strides forward by the number of threads in the grid (blockDim.x \* gridDim.x), in this case 8

performWork<<2, 4>>>() 0 1 2 3 0 1 2 3





It continues in this way until its data index is greater than the number of data elements







It continues in this way until its data index is greater than the number of data elements

perforaWork<<2, 4>>>() 0 1 2 3 0 1 2 3 0 1 2 3





With all threads working in this way, all elements are covered











With all threads working in this way, all elements are covered

performWork<<2, 4>>>() 0 1 2 3 0 1 2 3





With all threads working in this way, all elements are covered







With all threads working in this way, all elements are covered

performWork<<2, 4>>>() 0
1
2
3
0
1
2
3





With all threads working in this way, all elements are covered







With all threads working in this way, all elements are covered







With all threads working in this way, all elements are covered

performWork<<2, 4>>>() 0 1 2 3 0 1 2 3











With all threads working in this way, all elements are covered

performWork<<2, 4>>>() 0 1 2 3 0 1 2 3



CUDA runs as many blocks in parallel at once as the GPU hardware supports, for massive parallelization























DEEP LEARNING INSTITUTE















## Glossary

- cudaMallocManaged(): CUDA function to allocate memory accessible by both the CPU and GPUs. Memory allocated this way is called *unified memory* and is automatically migrated between the CPU and GPUs as needed.
- cudaDeviceSynchronize(): CUDA function that will cause the CPU to wait until the GPU is finished working.
- Kernel: A CUDA function executed on a GPU.
- **Thread:** The unit of execution for CUDA kernels.
- **Block:** A collection of threads.
- Grid: A collection of blocks.
- Execution context: Special arguments given to CUDA kernels when launched using the <<<...>>> syntax. It defines the number of blocks in the grid, as well as the number of threads in each block.
- gridDim.x: CUDA variable available inside executing kernel that gives the number of blocks in the grid
- blockDim.x: CUDA variable available inside executing kernel that gives the number of threads in the thread's block
- blockIdx.x: CUDA variable available inside executing kernel that gives the index the thread's block within the grid
- **threadIdx.x:** CUDA variable available inside executing kernel that gives the index the thread within the block
- threadIdx.x + blockIdx.x \* blockDim.x: Common CUDA technique to map a thread to a data element
- Grid-stride loop: A technique for assigning a thread more than one data element to work on when there are more elements than the number of threads in the grid. The stride is calculated by gridDim.x \* blockDim.x, which is the number of threads in the grid.



