

### Simulation using MIC co-processor on Helios

### Serhiy Mochalskyy, Roman Hatzky

PRACE PATC Course: Intel MIC Programming Workshop

High Level Support Team Max-Planck-Institut für Plasmaphysik Boltzmannstr. 2, D-85748 Garching, Germany



- MIC general architecture
- MIC network performance on the Robin cluster with one IB port
- MIC network performance on the Helios supercomputer with two IB ports
- Host, offload and native computation mode of the test N-Body code
- Micro OpenMP overhead benchmark



#### > MIC general architecture

- MIC network performance on the Robin cluster with one IB port
- MIC network performance on the Helios supercomputer with two IB ports
- Host, offload and native computation mode of the test N-Body code
- Micro OpenMP overhead benchmark



Helios is a computer system dedicated to large-scale and high performance simulations in fusion science and engineering research.

| CPU              | Intel Xeon E5<br>processor, Sandy-<br>Bridge EP 2.7GHz |
|------------------|--------------------------------------------------------|
| Nodes            | 4410                                                   |
| Peak performance | 1.52 Pflops (70 <sup>th</sup> in top 500, June 2016)   |

| MIC   | Knights Corner |
|-------|----------------|
| Nodes | 180            |



| Processor             | Sandy Bridge | Xeon Phi   |
|-----------------------|--------------|------------|
| Number of cores       | 8            | 60(1)      |
| Memory                | 32 GB        | 8–16 GB    |
| Peak performance      | 173 GFlops/s | 1 TFlops/s |
| Memory bandwidth      | 40 GB/s      | 160 GB/s   |
| Instruction execution | Out-of-order | In-order   |

- ~x4 increase in memory bandwidth
- ~x6 increase in peak performance
- ~x30 and ~x1.3 decrease in memory and performance per core
- In-order-execution requires 2–4 threads per core to fill the pipelines

### **MIC nodes – general architecture**





### **MIC nodes – general architecture**







#### MIC general architecture

- MIC network performance on the Robin cluster with one IB port
- MIC network performance on the Helios supercomputer with two IB ports
- Host, offload and native computation mode of the test N-Body code
- Micro OpenMP overhead benchmark

#### **MIC network performance on the Robin cluster**





### Intel MPI Benchmark suite: Ping-Pong test

Intra-node



#### PCIe+QPI+PCIe+IB+PCIe+QPI+PCIe

# **MIC network performance on the Robin** cluster





PCIe+ +PCIe+IB+PCIe+ +PCIe

### Intel MPI Benchmark suit: Ping-Pong test Intra-node

|              | CPU1 |       |       |      |
|--------------|------|-------|-------|------|
| host0        | MIC0 | 4.90  | 2.73  |      |
|              | MIC1 | 4.31  | 7.56  | 3.12 |
| Latency (µs) |      | CPU1  | MICO  | MIC1 |
|              |      | host0 |       |      |
| Inter-node   |      |       |       |      |
|              | CPU1 | 2.20  |       |      |
| host0        | MICO | 4.71  | 9.04  |      |
|              | MIC1 | 4.66  | 7.93  | 6.92 |
| Latency (µs) |      | CPU1  | MIC0  | MIC1 |
|              |      |       | host1 |      |

# **MIC network performance on the Robin** cluster





#### Intra-node

|           | CPU1 |       |      |      |
|-----------|------|-------|------|------|
| host0     | MIC0 | 456   | 2016 |      |
|           | MIC1 | 1609  | 416  | 2004 |
| Bandwidth |      | CPU1  | MIC0 | MIC1 |
| (MB/s)    |      | host0 |      |      |





- > MIC general architecture
- MIC network performance on the Robin cluster with one IB port
- MIC network performance on the Helios supercomputer with two IB ports
- Host, offload and native computation mode of the test N-Body code
- Micro OpenMP overhead benchmark





PCIe+ +PCIe+IB+PCIe+ +PCIe









#### PCIe+QPI+PCIe+IB+PCIe+QPI+PCIe

| Intra-node |      |      |       |      |  |
|------------|------|------|-------|------|--|
| host0      | CPU1 | 0.31 |       |      |  |
|            | MICO | 3.29 | 2.70  |      |  |
|            | MIC1 | 3.75 | 6.00  | 2.84 |  |
|            |      | CPU1 | MIC0  | MIC1 |  |
| Latency    | (µ5) |      | host0 |      |  |

|              | CPU1 | 1.24 |       |      |
|--------------|------|------|-------|------|
| host0        | MIC0 | 3.80 | 5.97  |      |
|              | MIC1 | 4.15 | 6.47  | 6.95 |
| Latency (µs) |      | CPU1 | MIC0  | MIC1 |
|              |      |      | host1 |      |





#### Intra-node

| Bandwidth<br>(MB/s) |      | CPU1 | MIC0 | MIC1 |
|---------------------|------|------|------|------|
|                     | MIC1 | 480  | 413  | 1984 |
| host0               | MICO | 1611 | 1928 |      |
|                     | CPU1 | 5061 |      |      |



#### \$ export I\_MPI\_DAPL\_PROVIDER\_LIST=ofa-v2-mlx4\_0-1u,ofa-v2-mcm-1



#### Inter-node



#### Inter-node optimized DAPL



Mochalskyy Serhiy

Intel MIC Programming Workshop, June 29th, 2016





#### Inter-node



#### Inter-node

| host0               | CPU0 | 4987  | 5029 |
|---------------------|------|-------|------|
| nosto               | CPU1 | 5075  | 5058 |
| Bandwidth<br>(MB/s) |      | CPU0  | CPU1 |
|                     |      | host1 |      |





#### Inter-node



#### Inter-node







#### Inter-node

| host0        | CPU0 | 1.27  | (1.23) |
|--------------|------|-------|--------|
| host0        | CPU1 | 1.28  | 1.25   |
|              |      | CPU0  | CPU1   |
| Latency (µs) |      | host1 |        |

#### Inter-node

| host0     | CPU0 | 4987 | 5029 |
|-----------|------|------|------|
| nostu     | CPU1 | 5075 | 5058 |
| Bandwidth |      | CPU0 | CPU1 |
| (ME       | 3/s) | ho   | st1  |





#### Inter-node

| host0        | CPU0 | 1.27  | 1.23 |
|--------------|------|-------|------|
| host0        | CPU1 | 1.28  | 1.25 |
| Latency (µs) |      | CPU0  | CPU1 |
|              |      | host1 |      |

#### Inter-node

| host0               | CPU0 | 4987  | 5029 |
|---------------------|------|-------|------|
| host0               | CPU1 | 5075  | 5058 |
| Bandwidth<br>(MB/s) |      | CPU0  | CPU1 |
|                     |      | host1 |      |

### MIC network performance on the Helios supercomputer – new DAPL provider in dat.conf





#### dat - direct access transport

#### /etc/dat.conf for mic0

ofa-v2-mcm-1 u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4\_0 1" "" ofa-v2-mlx4\_0-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4\_0 1" ""

#### /etc/dat.conf for mic1

ofa-v2-mcm-1 u2.0 nonthreadsafe default libdaplomcm.so.2 dapl.2.0 "mlx4\_1 1" "" ofa-v2-mlx4\_1-1u u2.0 nonthreadsafe default libdaploucm.so.2 dapl.2.0 "mlx4\_1 1" ""

#### Inter-node new dat.conf

| Bandwidth<br>(MB/s) |      | CPU1 | MIC0<br>host1 | MIC1 |
|---------------------|------|------|---------------|------|
| host0               | MIC1 |      | 3345          | 3330 |
|                     | MICO |      | 3340          | 3338 |
|                     | CPU1 |      |               |      |





Host0mic0–Host1mic0 1069.35 Memory bandwidth 3383.44 (MB/s) 3393.16 Host0mic1–Host1mic1 1685.87 3346.75 3355.48 Mixed Host0mic0-Host1mic0 1186.84 Host0mic1–Host1mic1 1632.26 Intel Manycore Platform Software Stack (IMPSS) v 3.6.1 **Open Fabrics Enterprise** Distribution (OFED) v 3.18 ~3550 MB/s



- MIC general architecture
- MIC network performance on the Robin cluster with one IB port
- MIC network performance on the Helios supercomputer with two IB ports
- Host, offload and native computation mode of the test N-Body code
- Micro OpenMP overhead benchmark

## Host, offload and native computation mode test using N-Body code



#### Execution time in (s)

| Number<br>of cores | Intel<br>Sandy<br>Bridge | Intel Xeon<br>Phi (offload) | Intel Xeon<br>Phi (native) |
|--------------------|--------------------------|-----------------------------|----------------------------|
| 1                  | 55                       | 130.61                      | 126.60                     |
| 2                  | 28                       | 66.47                       | 62.69                      |
| 4                  | 14                       | 33.78                       | 30.75                      |
| 8                  | 7                        | 18.78                       | 15.86                      |
| 16                 | 3.5                      | 12.02                       | 9.97                       |
| 32                 |                          | 7.44                        | 4.72                       |
| 64                 |                          | 6.19                        | 3.46                       |
| 128                |                          | 4.09                        | 1.59                       |
| 236                |                          | 3.96                        | 1.39                       |



- MIC general architecture
- MIC network performance on the Robin cluster (made by M. Haefele) with one IB port
- MIC network performance on the Helios supercomputer with two IB ports
- Host, offload and native computation mode of the test N-Body code
- Micro OpenMP overhead benchmark



#### "OpenMP reduction" overhead







In real simulation the overhead time can be equal to the computational kernel time



#### "OpenMP firstprivate" overhead using different array size with 240 threads

#### Helios MIC native mode





20 cores on Ivy-Bridge (Hydra)





### Thank you for your attention