# Memory Coalescing

Recall that thread blocks are divided into **warps** of 32 threads



Recall that thread blocks are divided into **warps** of 32 threads



Recall that thread blocks are divided into **warps** of 32 threads







For space on these slides, we will treat just 4 threads as a warp





Data is transferred to and from global device memory in 32-byte segments\*







(\* If the data is in the L1 cache it will be transferred in 128-byte cache lines – see the notebook for details)









For these slides we will treat 4 data elements as one of these fixed-length lines of contiguous memory









The memory subsystem will attempt to minimize the number of lines required to fulfill the read/write requirements of the warp

#### Warp

























## When this occurs, the memory access is fully **coalesced**















## And more of the data being transferred will go unused









# **Row and Column Sum Comparison**

Consider a kernel that stores the sum of each row of a matrix (which here is 4 contiguous data elements) in a result vector

#### Warp









#### Warp

















### Warp





### Warp





### Warp







#### 

Warp





This seems natural, but look at what happens when we consider the parallel execution within the warp









Each thread in the warp is requesting data in a different line of memory







Note that increments to threadIdx.x are mapping to increments in the data along the y axis

## Warp





Which means (in our example) 4 lines of data will need to be loaded, and 75% of the data loaded will be unused

#### Warp







Unfortunately, as each thread iterates over its row, the same uncoalesced pattern continues

## Warp





Unfortunately, as each thread iterates over its row, the same uncoalesced pattern continues

## Warp





Unfortunately, as each thread iterates over its row, the same uncoalesced pattern continues

## Warp





In this example we transferred 16 memory lines, and used 25% of the data for each line transferred

#### Warp







Let's compare a kernel that stores the sum of each **column** of a matrix in a result vector

#### Warp









### Warp

































#### 

Warp

















## Warp





# Warp 12 13 15 14







A useful tip to keep in mind is that increments to threadIdx.x should map to increments in data in the direction of fastest changing index – in this case the x axis





In this example we transferred 4 memory lines (compared to 16), and used 100% of the data for each line transferred (compared to 25%)

### Warp







## Using Shared Memory to Support Coalesced Memory Access

We will examine a matrix transpose to demonstrate how shared memory can be used to promote coalesced data transfers to and from global memory





Here we have a (2,2) grid, with each block containing (2,2) threads as well as (4,4) input and output matrices

| 0  | 1  | 2  | З  |
|----|----|----|----|
| 4  | 5  | 6  | 7  |
| 8  | 9  | 10 | 11 |
| 12 | 13 | 14 | 15 |



Input

Output







For these slides we will define a warp as 2 threads, and a memory segment as 2 data elements wide

| 0  | 1  | 2  | 3  |
|----|----|----|----|
| 4  | 5  | 6  | 7  |
| 8  | 9  | 10 | 11 |
| 12 | 13 | 14 | 15 |



Output







Our goal is to transpose the input by rotating all elements around the diagonal, writing the transposed elements to output







| Grid |  |  |
|------|--|--|
|      |  |  |
|      |  |  |
|      |  |  |
|      |  |  |

A naïve approach is to launch a grid with threads equal to input elements, and to have each thread read 1 element, then write it to output in the transposed location

| x, $y = cuda.grid(2)$ |
|-----------------------|
| out[x][y] = in[y][x]  |

| 0  | 1  | 2  | 3  |
|----|----|----|----|
| 4  | 5  | 6  | 7  |
| 8  | 9  | 10 | 11 |
| 12 | 13 | 14 | 15 |



Output





| Grid |  |  |  |
|------|--|--|--|
|      |  |  |  |
|      |  |  |  |
|      |  |  |  |
|      |  |  |  |

Observing the behavior of a single warp, is it the case that memory reads are coalesced? Let's dig into answering that question

| x, $y = cuda.grid(2)$ |
|-----------------------|
| out[x][y] = in[y][x]  |

| 0  | 1  | 2  | 3  |
|----|----|----|----|
| 4  | 5  | 6  | 7  |
| 8  | 9  | 10 | 11 |
| 12 | 13 | 14 | 15 |



Output







out[x][y] = in[y][x]

| 0  | 1  | 2  | 3  |
|----|----|----|----|
| 4  | 5  | 6  | 7  |
| 8  | 9  | 10 | 11 |
| 12 | 13 | 14 | 15 |



Output







out[x][y] = in[y][x]



| 0  | 1  | 2  | 3  |
|----|----|----|----|
| 4  | 5  | 6  | 7  |
| 8  | 9  | 10 | 11 |
| 12 | 13 | 14 | 15 |

Output







Therefore, it makes sense that reads from input are coalesced

out[x][y] = in[y][x]



Output





| Grid |  |  |
|------|--|--|
|      |  |  |
|      |  |  |
|      |  |  |
|      |  |  |

What about this warp's writes to output, will they be coalesced?

| 0  | 1  | 2  | 3  |
|----|----|----|----|
| 4  | 5  | 6  | 7  |
| 8  | 9  | 10 | 11 |
| 12 | 13 | 14 | 15 |







Output







8

12

Output





y = blockIdx.y \* blockDim.y + threadIdx.y

```
out[x][y] = in[y][x]
```







We can use shared memory to make coalesced reads and writes. Here, each block will allocate a (2,2) shared memory tile

| tile = | <pre>cuda.shared.array(2,2)</pre> |
|--------|-----------------------------------|
|        |                                   |
|        |                                   |
|        |                                   |

| 0  | 1  | 2  | 3  |
|----|----|----|----|
| 4  | 5  | 6  | 7  |
| 8  | 9  | 10 | 11 |
| 12 | 13 | 14 | 15 |



Output





(It is worth reminding that in our slides, to preserve space, 2 threads is a warp length. A real warp is 32 threads)

| tile = | <pre>cuda.shared.array(2,2)</pre> |
|--------|-----------------------------------|
|        |                                   |
|        |                                   |
|        |                                   |

| 0  | 1  | 2  | 3  |
|----|----|----|----|
| 4  | 5  | 6  | 7  |
| 8  | 9  | 10 | 11 |
| 12 | 13 | 14 | 15 |



Output







Output



Now we can make coalesced reads

from input, and write the values to the block's shared memory tile

tile = cuda.shared.array(2,2)

tile[tIdx.y][tIdx.x] = in[y][x]

x, y = cuda.grid(2)





Because each shared memory tile is local to the block (not the grid) we index into it using thread indices, not grid indices

| <pre>tile = cuda.shared.array(2,2)</pre>   |
|--------------------------------------------|
| x, $y = cuda.grid(2)$                      |
| <pre>tile[tIdx.y][tIdx.x] = in[y][x]</pre> |
|                                            |



Output







After synchronizing on all threads in the block, the tile will contain all the data this block needs to begin the writes

| <pre>tile = cuda.shared.array(2,2) x, y = cuda.grid(2)</pre>  |
|---------------------------------------------------------------|
| <pre>tile[tIdx.y][tIdx.x] = in[y][x] cuda.syncthreads()</pre> |







Output

















Output







Output







Output







Output







Output





Output



Output



Output



Output



Output





Output













Output







14

15

13

Output





Output







| 0  | 1  | 2  | 3  |
|----|----|----|----|
| 4  | 5  | 6  | 7  |
| 8  | 9  | 10 | 11 |
| 12 | 13 | 14 | 15 |

| 0 | 4 | 8  | 12 |
|---|---|----|----|
| 1 | 5 | 9  | 13 |
| 2 | 6 | 10 | 14 |
| 3 | 7 | 11 | 15 |

Output



## Shared Memory Bank Conflicts

Shared memory is physically stored in **banks** 



Logical Shared Memory cuda.shared.array(4,4)



in 4 banks



Warp



Actual shared memory is 32 4-byte wide banks. For space in these slides, we will portray shared memory as having 4 banks (**A**, **B**, **C**, **D**) and a warp as having 4 threads

| 0  | 1  | 2  | 3  |
|----|----|----|----|
| 4  | 5  | 6  | 7  |
| 8  | 9  | 10 | 11 |
| 12 | 13 | 14 | 15 |

Logical Shared Memory cuda.shared.array(4,4)









Successive 4-byte words (1 box in these slides) will belong to different banks



Logical Shared Memory cuda.shared.array(4,4)













Memory accesses in the same bank result in the access operations being serialized. We call this a **bank conflict**.

| 0  | 1  | 2  | 3  |
|----|----|----|----|
| 4  | 5  | 6  | 7  |
| 8  | 9  | 10 | 11 |
| 12 | 13 | 14 | 15 |

Logical Shared Memory cuda.shared.array(4,4)







In this scenario, we have a 2-way bank conflict that would require the memory access to be serialized over 2 cycles.

D

3

7

11

15



Here have a 4-way bank conflict that would require the memory access to be serialized over 4 cycles.



Here is a technique we can use to avoid bank conflicts when we know we need to make columnar access to shared memory







First, when we allocate our shared memory tile, we will pad it with an extra column



Logical Shared Memory cuda.shared.array(4,5)







Next, when we write to the tile, we act as if the tile is (4,4) and only write to addresses in the range [0:4][0:4]



Logical Shared Memory cuda.shared.array(4,5)







DEEP LEARNING INVIDIA.





Warp

Logical Shared Memory cuda.shared.array(4,5)







D



Warp

Logical Shared Memory cuda.shared.array(4,5)



С

В

Α



| So if we consider how the array is laid |
|-----------------------------------------|
| out within the memory banks, we see     |
| the following:                          |



Warp

Logical Shared Memory cuda.shared.array(4,5)





| Now when we access a column of       |
|--------------------------------------|
| shared memory, each element resides  |
| in a different bank and there are no |
| bank conflicts                       |



Warp



Logical Shared Memory cuda.shared.array(4,5)















| Warp                                            | Worth mentioning that to use this<br>technique for this example, the only<br>change we had to make to our code<br>was add one extra column to our<br>shared memory allocation |
|-------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| _                                               | A   B   C   D                                                                                                                                                                 |
|                                                 | 0 1 2 3                                                                                                                                                                       |
| 0 1 2 3                                         |                                                                                                                                                                               |
| 4 5 6 7                                         | 7 8 9                                                                                                                                                                         |
| 8 9 10 11                                       | 10 11 12                                                                                                                                                                      |
| 12 13 14 15                                     | 13 14 15                                                                                                                                                                      |
| Logical Shared Memory<br>cuda.shared.array(4,5) | Physical Shared Memory<br>in 4 banks                                                                                                                                          |







```
o x = bId.y*bDim.y + tId.x
o y = bId.x*bDim.x + tId.y
o[o_y][o_x] = tile[tIdx.x][tIdx.y]
```

| 0  | 1  | 2  | 3  |
|----|----|----|----|
| 4  | 5  | 6  | 7  |
| 8  | 9  | 10 | 11 |
| 12 | 13 | 14 | 15 |

12 9 13 10 2 6 14 15 11

Output

**DVIDIA** 

DEEP LEARNING

O.



www.nvidia.com/dli