

# INTRO TO FPGA FLOWS OVERVIEW Bill Jenkins

Intel Programmable Solutions Group

#### Agenda

- 9:00 am Welcome
- 9:15 am Introduction to FPGAs
- 9:45 am FPGA Programming models: RTL
- 10:15 am FPGA Programming models: HLS
- 11:00 am Lab 1 HLS Flow
- 11:45 am Lunch
- 12:30 pm FPGA Programming models: OpenCL
- 1:00 pm High Performance Data Flow Concepts

1:30 pm Lab 2 OpenCL Flow
2:15 pm Introduction to DSP Builder
3:00 pm Introduction to Acceleration Stack
4:00 pm Lab 3 Acceleration Stack
4:30 pm Curriculum & University Program Coordination





# INTRODUCTION



#### **Typical HPC Workloads**



\* Source: https://comp-physics-lincoln.org/2013/01/17/molecular-dynamics-simulations-of-amphiphilic-macromolecules-at-interfaces/



#### Fast Evolution of Technology

We now have the compute to solve these problems today in near real-time









#### The Urgency of Parallel Computing

If engineers keep building processors the way we do now, CPUs will get even faster but they'll require so much power that they won't be usable.

> Patrick Gelsinger, former Intel Chief Technology Officer, February 7, 2001

Source: http://www.cnn.com/2001/tech/ptech/02/07/hot.chips.idg/







## **Challenges Scaling Systems to Higher Performance**



Need to think about Compute Offload as well as Ingress/Egress Processing



Accelerators can increase Performance at lower TCO for targeted workloads

Intel estimates; bubble size is relative CPU intensity



#### **The Intel Vision**

#### Heterogeneous Systems:

 Span from CPU to GPU to FPGA to dedicated devices with consistent programming models, languages, and tools





### **Heterogeneous Computing Systems**

Modern systems contain more than one kind of processor

- Applications exhibit different behaviors:
  - Control intensive (Searching, parsing, etc...)
  - Data intensive (Image processing, data mining, etc...)
  - Compute intensive (Iterative methods, financial modeling, etc...)
- Gain performance by using specialized capabilities of different types of processors



#### Separation of Concerns

Two groups of developers:

- Domain experts concerned with getting a result
  - Host application developers leverage optimized libraries
- Tuning experts concerned with performance
  - Typical FPGA developers that create optimized libraries

Intel<sup>®</sup> Math Kernel Library a simple example of raising the level of abstraction to the math operations

- Domain experts focus on formulating their problems
- Tuning experts focus on vectorization and parallelization





# **INTRODUCTION TO FPGAS**

## **FPGA Enabled Performance and Agility**



FPGAs enhance CPU-based processing by accelerating algorithms and minimizing bottlenecks

# FPGAs Provide Flexibility to Control the Data path





## **FPGA** Architecture

Field Programmable Gate Array (FPGA)

- Millions of logic elements
- Thousands of embedded memory blocks
- Thousands of DSP blocks
- Programmable interconnect
- High speed transceivers
- Various built-in hardened IP

Used to create Custom Hardware!





#### FPGA Architecture: Flexible Interconnect

Basic Elements are surrounded with a flexible interconnect

#### FPGA Architecture: Flexible Interconnect

Wider <u>custom</u> operations are implemented by configuring and interconnecting Basic Elements

#### FPGA Architecture: Custom Operations Using Basic Elements







#### FPGA Architecture: Floating Point Multiplier/Adder Blocks



#### **DSP Blocks**

Thousands DSP Blocks in Modern FPGAs

- Configurable to support multiple features
  - Variable precision fixed-point multipliers
  - Adders with accumulation register
  - Internal coefficient register bank
  - Rounding
  - Pre-adder to form tap-delay line for filters
  - Single precision floating point multiplication, addition, accumulation



# FPGA Architecture: Configurable Routing Blocks are connected into a custom data-path that matches your application.

#### FPGA Architecture: Configurable IO

The **Custom data-path** can be connected directly to **custom or standard IO interfaces** for inline data processing

#### FPGA I/Os and Interfaces

FPGAs have flexible IO features to support many IO and interface standards

- Hardened Memory Controllers
  - Available interfaces to off-chip memory such as HBM, HMC, DDR SDRAM, QDR SRAM, etc.
- High-Speed Transceivers
- PCIe\* Hard IP
- Phase Lock Loops



## Intel<sup>®</sup> FPGA Product Portfolio

Wide range of FPGA products for a wide range of applications

MAX<sup>®</sup>

Non-volatile, low-cost, single chip small form

FPGA • SoC

Cyclone

Low-power, costsensitive performance Midrange, cost, power, performance balance

Arria

FPGA SoC

Stratix<sup>®</sup>

High-performance, state-of-the-art

#### Products features differs across families

Logic density, embedded memory, DSP blocks, transceiver speeds, IP features, process technology, etc.











#### Sequential Architecture vs. Dataflow Architecture **XFPGA** Dataflow Architecture **Sequential CPU Architecture** load load 42 R е S 0 Time U С e S store

(intel<sup>®</sup>)

#### Custom Data-Path on the FPGA Matches Your Algorithm!



intel

#### Advantages of Custom Hardware with FPGAs

- Custom hardware!
- Efficient processing
- Fine-grained parallelism
- Low power
- Flexible silicon
- Ability to reconfigure
- Fast time-to-market
- Many available I/O standards







# FPGA PROGRAMMING MODEL

#### FPGA Development and Programming Tools



Verilog, VHDL and the Intel® FPGA SDK for OpenCL are currently supported by the Acceleration Stack. High Level Synthesis can be used manually by following app note

#### Traditional FPGA Design Entry

Circuits described using Hardware Description Languages (HDL) such as VHDL or Verilog

A designer must describe the behavior of the algorithm to create a low-level digital circuit

Logic, Registers, Memories, State Machines, etc.

Design times range from several months to even years!









### Intel<sup>®</sup> Quartus<sup>®</sup> Prime Design Software





### Intel<sup>®</sup> Quartus<sup>®</sup> Prime Design Software Projects

Description

- Collection of related design files & libraries
- Must have a designated top-level entity
- Target a single device
- Store settings in the software settings file (.qsf)
- Compiled netlist information stored in **qdb** folder in project directory

Create new projects with New Project Wizard

Can be created using Tcl scripts



### Intel<sup>®</sup> FPGA Design Store

Download complete example design templates for specific development kits

Design examples include design files, device programming files, and software code as required

Install.par files and select as template in New Project Wizard

#### (intel) FPGA Design Store Design Example: Family: Any Category: Any Quartus II Version: 16.0 Development Kit: An IP Core: Any

|    |   | Search:                                                                             |                                          |                                               |              |                            |             | Search in all pages |
|----|---|-------------------------------------------------------------------------------------|------------------------------------------|-----------------------------------------------|--------------|----------------------------|-------------|---------------------|
| 0  | ¢ | Name                                                                                | ¢                                        | ¢<br>Development Kit                          | ¢<br>Family  | Quartus<br>II ¢<br>Version | ¢<br>Vendor | Downloads           |
| *  |   | JPEG Decoder Design Example (OpenCL)                                                | Design Example \<br>Outside Design Store | Non kit specific Stratix V<br>Design Examples | Stratix V    | 16.0.0                     | Altera      | 0 0                 |
| *  |   | 100Gbps Ethernet PHY only Testbench                                                 | Design Example \<br>Outside Design Store | Non kit specific Stratix V<br>Design Examples | Stratix V    | 16.0.2                     | Altera      | 0 0                 |
| •  |   | Accelerated FIR with Built-In Direct Memory Access Example                          | Design Example                           | Cyclone V E FPGA<br>Development Kit           | Cyclone<br>V | 16.0.0                     | Altera      | 81 0                |
| ħ. |   | Adapting Digilent PmodCLP LCD to DE10 Lite Development<br>Kit Arduino Shield Header | Design Example                           | MAX 10 DE10 - Lite                            | MAX 10       | 16.0.0                     | Altera      | 49 0                |

https://cloud.altera.com/devstore/platform/



Take a tour

Search in all name



(intel)

### Chip Planner

#### Graphical view of

- Layout of device resources
- Routing channels between device resources
- Global clock regions

Uses

- View placement of design logic
- View connectivity between resources used in design
- Make placement assignments
- Debugging placement-related issues



### Chip Planner





Tasks window

(intel)



#### Pin Planner

#### Interactive graphical tool for assigning pins

- Drag & drop pin assignments
- Set pin I/O standards
- Reserve future I/O locations
- Default window panes
- Package View
- All Pins list
- Groups list
- Tasks window
- Report window

Assignments menu → Pin Planner, toolbar, or Tasks window





#### Pin Planner Window

Of Pin Planner - C:/altera\_trn/Quartus\_Prime\_Foundation\_17\_1\_v1/QPF17\_1/VHDL/pipemult - pipemult\_lc \_ Х Eile Edit View Processing Tools Window Help Search Intel FPGA ₽ 륜 × ₽ ₽ × Groups Pin Legend Top View - Flip Chip Named: \* Symbol Pin Type Arria 10 - 10AX115S2F45I1SG Node Name Direction -User I/O dataa[15..0] Input Group User assign... datab[15..0] Input Group Fitter assign... 🝟 q[31..0] Outp...roup Unbonded ... rdaddr...[5..0] Input Group wraddr...[5..0] Input Group Reserved pin <<new group>> DQ DOS **Groups** list DQSB CLK n Package -CLK p GX X\*n Groups Report View GX X\*p **₽** ₽ × **Fasks** Ε TEMPDIODE ✓ ► Early Pin Planning VSIG Early Pin Planning... Other PLL Run I/O Assignment / MSEL0 Export Pin Assignmen MSEL1 Pin Finder... MSEL2 Tasks pane CONF\_DONE Y 🏲 Clock Pins DCLK Clock Ê nCE and and make nCONFIG 80 Named: \* 🖂 🖏 Edit: 🗡 Filter: Pins: all Node Name Direction Location I/O Bank Fitter Location I/O Standard Reserved urrent Strengt Slew Rate ifferential Pai (R PIN G26 1.8 V 12mA ...ault) clk1 Input DINI ADDA 101/ 👆 dataa[15] Input 12mA ...ault) dataa[14] Input 12mA ...ault) All Pins list dataa[13] Input 12mA ...ault) - dataa[12] 12mA ...ault) Input 💾 dataa[11] Input PIN AV35 1.8 V 12mA ...ault) 🔓 dataa[10] PIN\_AV33 1.8 V 12mA ...ault) Input dataa[10 Input PIN\_AU35 1.8 V 12mA ...ault) × H >

Toolbar



#### The Programmer



#### Tools menu → Programmer





#### **State Machine Editor**

#### Create state machines in GUI

- Manually by adding individual states, transitions, and output actions
- Automatically with State Machine Wizard (Tools menu & toolbar)

Generate state machine HDL code (required)

- VHDL
- Verilog
- SystemVerilog

#### File menu → New or Tasks window Select State Machine File (.smf)





#### **Platform Designer**

Components in system use different interfaces to communicate (some standard, some non-standard)

Typical system requires significant engineering work to design custom interface logic

Integrating design blocks and intellectual property (IP) is tedious and error-prone





#### **Automatic Interconnect Generation**

- Avoids error-prone integration
- Saves development time with automatic logic & HDL generation
- Enables you to focus on value-add blocks
- Platform Designer improves productivity by automatically generating the system interconnect logic





#### The Platform Designer GUI



#### Access in Tools menu, toolbar, or Tasks window









## **FPGA PROGRAMMING MODEL:**

High Level Synthesis

#### Can Also Be Wrapped With Higher Level Flows



### The Software Programmer's View









Programmers develop in mature software environments

- Ideas can easily be expressed in languages such as 'C'
  - Typically start with simple sequential program
  - Use parallel APIs / language extensions to exploit multi core for additional performance
- Compilation times are almost instantaneous
- Immediate feedback
- Rich debugging tools



#### High Level Design is the Bridge Between HW & SW

100x More Software Engineers than Hardware Engineers

Key to wide-spread adoption of FPGA in Datacenter

Debugging software is much faster than hardware

Many functions are easier to specify in software than RTL

Simulation of RTL takes thousands times longer than software

Design Exploration is much easier and faster in software

We Need to Raise the Level of Abstraction

- Similar to what assembly programmers did with C over 30 years ago
  - (Today) Abstract away FPGA Design with Higher Level Languages \_
  - (Today) Abstract away FPGA Hardware behind Platforms
  - (Tomorrow) Leverage Pre-Compiled Libraries as Software Services











59

**Productivit** 

and

ction







Goal: Same performance as hand-coded RTL with 10-15% more resources



#### **HLS Procedure**

Create Component and Testbench in C/C++







a is the default output name, -o option can be used to specify a non-default output name

ntel



<sup>(</sup>intel)

### Simple Example Program: i++ and g++ flow

#### Example Program

// test.cpp
#include <stdio.h>

int main() {
 printf("Hello world\n");
 return 0;

Terminal Commands and Outputs

\$ g++ test.cpp
\$ ./a.out
Hello world

\$ i++ test.cpp
\$ ./a.out
Hello world
\$

Using the default -march=x86-64



### g++ Compatibility

Intel HLS Compiler is command line compatible with g++

- Similar command-line flags, x86 behavior, and compilation flow
- Changing "g++" to "i++" should just work
  - g++ <flags> <src>
  - i++ <flags> <src>
- x86 behavior should match g++
  - Except for integer promotion (discussed later)
- No source modifications required (for x86 mode)
- Support for GNU Makefiles



### i++ Options : g++ Compatible Options

| Option                               | Description                                                          |
|--------------------------------------|----------------------------------------------------------------------|
| -h                                   | Display help information                                             |
| -o <name></name>                     | Specify a non-default output name                                    |
| -c                                   | Instructs compiler generate the object files and not the executable  |
| -march= <arch></arch>                | Compile for architecture x86-64 (Default) or <fpga family=""></fpga> |
| -v                                   | Verbose mode                                                         |
| -g                                   | Generate debug information (default)                                 |
| -g0                                  | Do not generate debug information                                    |
| -I <dir></dir>                       | Add to include path                                                  |
| -D <macro>[=<val>]</val></macro>     | Define <macro> with <val> or 1</val></macro>                         |
| -L <dir> -l<library></library></dir> | Library search directory and library name when linking               |
| Example: i++ -                       | -march=x86-64 myfile.cpp -o myexe                                    |



# i++ Options: FPGA Related Options

| Option                                                                           | Description                                                                                       |  |  |  |
|----------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|--|--|--|
| component <components></components>                                              | Specify a comma-separated list of function names to be synthesizes to RTL                         |  |  |  |
| clock <clock_spec></clock_spec>                                                  | Optimizes the RTL for the specified clock frequency or period                                     |  |  |  |
| -ghdl                                                                            | Enable full debug visibility and logging of all signals when verification executable is run       |  |  |  |
| quartus-compile                                                                  | Compiles the resulting HDL files using the Intel <sup>®</sup> Quartus <sup>®</sup> Prime software |  |  |  |
| simulator <simulator></simulator>                                                | Specify the simulator used for verification, "none" to skip testbench generation                  |  |  |  |
| x86-only                                                                         | Only create the executable for testbench, no RTL or cosim support                                 |  |  |  |
| fpga-only                                                                        | Create FPGA component project, RTL and cosim support, no testbench binary                         |  |  |  |
|                                                                                  |                                                                                                   |  |  |  |
| Example: i++ -march= <fpga fam="">component mycompclock 400Mhz myfile.cpp</fpga> |                                                                                                   |  |  |  |

There are many other optimization options available please see the Intel HLS Compiler Reference Manual













### Using printf()

Requires "HLS/stdio.h"

Maps to <stdio.h> when appropriate

Can be included in the testbench or the component

Used with no limitations in the x86 emulation flow

printf statements inside the component ignored for HDL generation

Ignored in the cosimulation flow with an HDL simulator



# Using printf(): Example

#### Example Program

```
// test.cpp
#include "HLS/stdio.h"
```

```
void say_hello() {
    printf("Hello from the component n");
}
```

int main() {
 printf("Hello from the testbench\n");
 say\_hello();
 return 0;

#### Terminal Commands and output

```
$ i++ test.cpp
$ ./a.out
Hello from the testbench
Hello from the component
$
```

```
$ i++ test.cpp -march=Arrial0 \
        --component say_hello
$ ./a.out
Hello from the testbench
$
```



# Debugging Using gdb

i++ integrates well with GNU gdb

- Debug data is generated by default
  - Unlike g++, -g enabled by default, use -g0 to turn off debug data

-march=x86-64 flow:

Can step through any part of the code (including the component)

-march=<fpga family> flow:

- Can step through testbench code
- gdb does not see the component side execution (that runs in an HDL simulator)



#### gdb Example

#### Example Program

```
// test.cpp
#include "HLS/hls.h"
#include "HLS/stdio.h"
```

```
component void say_hello() {
  printf("Hello from the component\n");
}
```

```
int main() {
    printf("Hello from the testbench\n");
    say_hello();
    return 0;
```

#### Terminal Commands and output

\$ i++ test.cpp -march=x86-64 -o test-x86
\$ gdb ./test-x86

```
<GDB Command Prompt> (gdb)
```

```
$ i++ test.cpp -march=Arria10 -o test-fpga
$ gdb ./test-fpga
```

```
<GDB Command Prompt> (gdb)
```



# Debugging with Valgrind

"Valgrind is an instrumentation framework for building dynamic analysis tools."

- Valgrind tools can detect:
  - Memory leaks
  - Invalid pointer uses
  - Use of uninitialized values
  - Mismatched use of malloc/new vs free/delete
  - Doubly freed memory
- Use to debug component and testbench in the x86 emulation flow





# Simple Valgrind Example

#### Example Program:

```
// test.cpp
    #include "hls/stdio.h"
    #include <stdlib.h>
 2
 3
    int bin count (int *bins, int a) {
      return ++bins[a]
 6
    int main() {
8
      int *bins = (int *) malloc(16)*
                                       sizeof(int));
9
      srand(0);
10
      for (int i - + i < 256; i++) {
11
12
        int x = (rand()
        int res = bin count (bins, x);
13
        printf("Count val: %d\n", res);
14
15
16
      return 0;
17
```

#### Terminal Commands and output:

```
$ i++ test.cpp
$ ./a.out
Segmentation Fault
$ valgrind --leak-check=full --show-reachable=yes ./a.out
==9744== Invalid read of size 4
==9744==
           at 0x4006B3: bin count(int*, int) (test.cpp:5)
==9744==
           by 0x400723: main (test.cpp:13)
==9744== Address 0x1b31075dc is not stack'd, malloc'd or
(recently) free'd
==9744== Process terminating with default action of signal
11 (SIGSEGV)
==9744== Access not within mapped region at address
0x1B31075DC
==9744==
            at 0x4006B3: bin count(int*, int) (test.cpp:5)
           by 0x400723: main (test.cpp:13)
==9744==
==9744== 64 bytes in 1 blocks are still reachable in loss
record 1 of 1
            at 0x4A06A2E: malloc (vg replace malloc.c:270)
==9744==
            by 0x4006ED: main (test.cpp:9)
==9744==
Segmentation fault
```











# Example Component/Testbench Source

```
#include "HLS/hls.h"
                                 i++ -march=<fpga family> --component accelerate mysource.cpp
#include "assert.h"
#include "HLS/stdio.h"
#include "stdlib.h"
                                              accelerate() becomes an FPGA
component int accelerate(int a, int b) {
                                              component
     return a+b;
                                                    Use -- component i++ argument or
int main() {
                                                     component attribute in source
     srand(0);
     for (int i=0; i<10; ++i) {</pre>
          int x=rand() % 10;
                                                 main() becomes testbench for
          int y=rand() % 10;
          int z=accelerate(x, y);
                                                 componentaccelerate()
          printf("%d + %d = %d\n", x, y, z);
          assert(z == x + y);
     return 0;
```



# Translation from C function API to HDL module

All component functions are synthesized to HDL

Each synthesized component is an independent HDL module

Component functions can be declared:

- Using component keyword in source
- Specifying "--component <component\_name>" in the command-line



## Cosimulation

Combines x86 testbench with RTL simulation

HDL code for the component runs in an RTL Simulator

- Verilog
- RTL testbench automatically created from software

main() and everything else called from main runs on x86 as the testbench
Communication using SystemVerilog Direct Programming Interface (DPI)

- Allows C/C++ to interface SystemVerilog
- Inter-process communication (IPC) library used to pass testbench input data to RTL simulator, and returns the data back to the x86 testbench



# Cosimulation Verifying HLS IP

The Intel<sup>®</sup> HLS compiler automatically compiles and links C++ testbench with an instance of the component running in an RTL simulator

- To verify RTL behavior of IP, just run the executable generated by the HLS compiler targeting the FPGA architecture
  - Any calls to the component function becomes calls the simulator through DPI









#### Streaming Simulation Behavior

Use enqueue function calls to stream data into the component



85

### Viewing Component Waveforms

- Compile design with i++ -ghdl flag
  - Enable full visibility and logging of all HDL signals in simulation
- After cosimulation execution, waveform available at a.prj/verification/vsim.wlf
- Examine with the ModelSim GUI:
  - vsim a.prj/verification/vsim.wlf

# Viewing Waveforms in Modelsim

1.4



| IM.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                                                                                |                                                                                                                                                                                                                                                            |                                                               |           |                           |            |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------|-----------|---------------------------|------------|
| <u>F</u> ile <u>E</u> dit <u>V</u> iew <u>C</u> ompile <u>S</u> imula                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | te A <u>d</u> d <b>Objec<u>t</u>s</b> T <u>o</u> ols La                                        | ayo <u>u</u> t Boo <u>k</u> marks <u>W</u> indo                                                                                                                                                                                                            | w <u>H</u> elp                                                | -         |                           |            |
| 📄 🖻 + 🚘 🔚 🦈 🚳   🐰 🖿 🌘                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | 🖁 🖄 🔐   💿 - 🛤 🖺 🗖                                                                              | 🔌 🖄 🛍 🖧 💆                                                                                                                                                                                                                                                  | 🔁 🕇 🖛 🖦                                                       | 100 🗣 🚉 🚉 | E‡ 🛣 🂲   🕥                | 🕥 🕀 🕴 🏞 🏌  |
| I O 1/0 i au 🥍 🕅 🕅                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | i 🗣 II Ei I 🚯 🛛 🕹 🕹                                                                            | 1                                                                                                                                                                                                                                                          | 3•••€• 3•                                                     | Search:   | <b>v</b> <i>(</i> ) (0, ( | » <u> </u> |
| 🖉 vsim - Default 😑 💷 🖃 🗷                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | 😒 Objects 🕬 🛨 🗗 🗙                                                                              | Wave - Default                                                                                                                                                                                                                                             |                                                               |           |                           |            |
| ▼ Instance                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | ▼N; 1 € ● 72883 ps → ſ ▶                                                                       |                                                                                                                                                                                                                                                            | Msqs                                                          |           |                           |            |
| tb clock_reset_inst component_dpi_controlle concatenate_component Locate Iler_inst Component_ent_dpi ient_dpi image of the set | <ul> <li>↓ busy</li> <li>↓ clock</li> <li>↓ resetn</li> <li>↓ done</li> <li>↓ stall</li> </ul> | <ul> <li>/tb/mymult_inst/clives/tb/mymult_inst/clives/tb/mymult_inst/st</li> <li>/tb/mymult_inst/st</li> <li>/tb/mymult_inst/st</li> <li>/tb/mymult_inst/st</li> <li>/tb/mymult_inst/st</li> <li>/tb/mymult_inst/st</li> <li>/tb/mymult_inst/st</li> </ul> | setn St1<br>art St0<br>isy St0<br>0<br>0<br>turn 0<br>one St0 |           |                           |            |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | to Wavef                                                                                       | orm                                                                                                                                                                                                                                                        |                                                               |           |                           |            |







#### Main HTML Report

The Intel<sup>®</sup> HLS Compiler automatically generates HTML report that analyzes various aspects of your function including area, loop structure, memory usage, and system data flow

Located at a.prj/reports/report.html





## **HTML Report: Summary**

**Overall compile statics** 

- FPGA Resource Utilization
- Compile Warnings
- Quartus<sup>®</sup> fitter results
  - Available after Quartus compilat
- etc.

| Project Name             | ./tpga/add_ex                |                 |                |                   |           |
|--------------------------|------------------------------|-----------------|----------------|-------------------|-----------|
| Target Family, Device    | Arria 10, 10AX11             | 5U1F45I1SG      |                |                   |           |
| i++ Version              | 17.1.0 Build 2.40            |                 |                |                   |           |
| Quartus Version          | 17.10 Build 240              |                 |                |                   |           |
| Command                  | i++ -march=Arria             | 10componen      | add add_ex.cp  | p -o ./fpga/add_e | cout      |
| Reports Generated At     | Tue Oct 31 10:18             | 3:13 2017       |                |                   |           |
| uartus Fit Clock Sur     | an such divers a             |                 |                |                   |           |
| guartus Pit Clock Sul    | y                            |                 | 1x clock f     | max               |           |
|                          |                              |                 |                |                   |           |
| Frequency (MHz)          |                              |                 | 612.75         |                   |           |
|                          | V                            |                 | 612.75         |                   |           |
|                          | Utilization Summary          |                 |                |                   |           |
|                          | Utilization Summary<br>ALMs  | /<br>FFs        | 612.75<br>RAMs |                   | DSPs      |
|                          |                              |                 |                |                   | DSPs<br>0 |
| Quartus Fit Resource     | ALMs<br>18                   | FFs             | RAMs           |                   |           |
| Quartus Fit Resource     | ALMs<br>18                   | FFs             | RAMs           | RAMs              |           |
| add<br>stimated Resource | ALMs<br>18<br>Usage          | FFs<br>3        | RAMs           | RAMs<br>0         | 0         |
| add<br>stimated Resource | ALMs<br>18<br>Usage<br>ALUTs | FFs<br>3<br>FFs | RAMs<br>0      |                   | 0<br>DSPs |



#### **HTML Report: Loops**

Serial loop execution hinders function dataflow circuit performance

- Use Loop Analysis report to see if and how each loop is optimized
  - Helps identify component pipeline bottlenecks





# Loop Unrolling

Loop unrolling: Replicate hardware to execute multiple loop iterations at once

- Simple loops unrolled by the compiler automatically
- User may use #pragma unroll to control loop unrolling
- Loop must not have dependency from iteration to iteration





# Loop Pipelining

Loop pipelining: Launch loop iterations as soon as dependency is resolved

- Initiation interval(II): launch frequency (in cycles) of a new loop iteration
  - II=1 is optimally pipelined
    - No dependency or dependencies can be resolved in 1 cycle



Pipelined Execution of





#### HTML Report: Area Analysis

View detailed estimated resource consumption by system or source line

- Analyze data control overhead
- View memory implementation
- Shows resource usage
  - ALUTs
  - FFs
  - RAMs
  - DSPs
- Identifies inefficient uses

|                                     | ALUTS | FFS  | RAMS | DSPS | Details      |
|-------------------------------------|-------|------|------|------|--------------|
| Variable:<br>- 'j' (example.cpp:11) | 25    | 133  | e    | e    | • Implemente |
| example.cpp:12 (a_buf)              | 33    | 1152 | 16   | 0    | • Henory sys |
| example.cpp:13 (b_buf)              |       | 9    | 64   | e    | • Hemory sys |
| > No Source Line                    | \$53  | 1168 | 0    | Ð    |              |
| > example.cpp/14                    | 37    | 51   | 1    | 0    |              |
| ♥ example.cpp:15                    | 94    | 111  | 0    | Ð    |              |
| state                               | 60    | 87   | 0    | e    |              |
| store                               | 34    | 24   | e    | e    |              |
| > example.cpp:16                    | 94    | 111  | 0    | 0    |              |
| > example.cpp:22                    | 1038  | 784  | 0    | 0    |              |
| > example.cpp:20                    | 14    | 28   |      | 0    |              |
| > example.cpp:25                    | 34    | 111  | 0    | 0    |              |

| example | e.cpp hls.h hls_internal.h                                                                          |
|---------|-----------------------------------------------------------------------------------------------------|
| 1       | Winclude "HLS/hls.h"                                                                                |
| 2       | #include "stdio.h"                                                                                  |
| 3       | #include "stdlib.h"                                                                                 |
| 4       |                                                                                                     |
| 5       | <pre>typedef altera::stream_in<int> my_operand;</int></pre>                                         |
| б       | typedef altera::stream_out(int) my_result;                                                          |
| 7       |                                                                                                     |
| 8       | <pre>component void vec_add_kernel(my_operand &amp;a, my_operand &amp;b, my_resul<br/>&amp;c)</pre> |
| 9 -     |                                                                                                     |
| 10      | int i;                                                                                              |
| 11      | int j;                                                                                              |
| 12      | int a_buf[32][32];                                                                                  |
| 13      | int b_buf[32][32];                                                                                  |
| 14 *    |                                                                                                     |
| 15      | <pre>a_buf[i / 32][i % 32] = a.read();</pre>                                                        |
| 16      | <pre>b_buf[i / 32][i % 32] = b.read();</pre>                                                        |
| 17 18 * | }                                                                                                   |
| 18 -    | for (j = 0; j < 1024 * 32; j++) {                                                                   |
| 28      | #oragma_uncoll                                                                                      |
| 21 +    | for (i = 0; i < 32; i++) {                                                                          |
| 22      | <pre>b_buf[j % 32][i] += a_buf[i][j % 32];</pre>                                                    |
| 23      | provid a selfel += sroutell) a self                                                                 |
| 24      | Y                                                                                                   |
| 25 -    | for (i = 0; i < 32 * 32; i++) {                                                                     |
| 26      | <pre>c.write(b_buf[i / 32][i % 32]);</pre>                                                          |
| 27      | }                                                                                                   |
| 28      | }                                                                                                   |
| 29      |                                                                                                     |
| 38 +    | <pre>int main() {</pre>                                                                             |
| 31      | my_operand a, b;                                                                                    |
| 32      | my_result c, d;                                                                                     |
| 33      |                                                                                                     |
| 34      | unsigned long long start = altera hls get sim time();                                               |



# HTML Report: Component Viewer

Displays abstracted netlist of the HW implementation

- View data flow pipeline
  - See loads and stores
  - Interfaces including stream reads and writes
  - Memory structure
  - Loop structure
  - Possible performance bottlenecks
    - Unpipelined loops are colored light red
    - Stallable points are red



Mouse over node to see tooltip and details.



## **HTML Report: Memory Viewer**

Displays local memory implementation and accesses

- Visualize memory architecture
  - Banks, widths, replication, etc
- Visualize load-store units (LSUs)
  - Stall-free?
  - Arbitration
  - Red indicates stallable



#### **HTML Report: Verification Statistics**

Reports execution statics from testbench execution, available after component is simulated (testbench executable ran)

- Number and type of component invocation
- Latency of component
- Dynamic Initiation interval of Component

Data rates of streams

Measurements based on latest execution of testbench

| Verification Statistics                           |             |                          |                     |                   |
|---------------------------------------------------|-------------|--------------------------|---------------------|-------------------|
|                                                   | Invocations | Latency<br>(min,max,avg) | ll<br>(min,max,avg) | Details           |
| dut (Unknown location)                            | 101         | 4,4,4                    | 1,1,1               | Click for details |
| Explicit component invocations (Unknown location) | 1           | 4,4,4                    | n/a,n/a,n/a         |                   |
| Enqueued component invocations (Unknown location) | 100         | 4,4,4                    | 1,1,1               |                   |





# Quartus<sup>®</sup> Generated QoR Metrics for IP

Use Intel<sup>®</sup> Quartus<sup>®</sup> Prime software to generate quality-of-result reports

- i++ creates the Quartus project in a.prj/quartus
- To generate QoR data (final resource utilization, fmax)
  - Run quartus\_sh --flow compile quartus\_compile
  - Oruse i++ --quartus-compile opt
- Report part of the HTML report
  - a.prj/reports/report.html
  - Summary page

| Quartus Fit Clock Sumn  | nary               |     |            |      |  |
|-------------------------|--------------------|-----|------------|------|--|
|                         |                    | 1x  | clock fmax |      |  |
| Frequency (MHz)         |                    | 61  | 2.75       |      |  |
| Quartus Fit Resource Ut | tilization Summary |     |            |      |  |
|                         | ALMs               | FFs | RAMs       | DSPs |  |
| mycomp                  | 18                 | 3   | 0          | 0    |  |
| Estimated Resource Us   | age                |     |            |      |  |
| Component Name          | ALUTs              | FFs | RAMs       | DSPs |  |
| mycomp                  | 38                 | 2   | 0          | 0    |  |
|                         |                    |     |            |      |  |



# Intel® Quartus® Software Integration

- a.prj/components directory contains all the files to integrate
- One subdirectory for each component
  - Portable, can be moved to a different location if desire
- 2 use scenarios
  - 1. Instantiate in HDL
  - 2. Adding IP to a Platform Designer system



#### **HDL** Instantiation

Add Components to Intel<sup>®</sup> Quartus Project

- <component>.qsys to Standard Edition
- <component>.ip to Pro Edition

Instantiate component module in your design

Use template

a.prj/components/<component>/<component> inst.v

```
add add inst
  // Interface: clock (clock end)
  .clock
             ( ), // 1-bit clk input
  // Interface: reset (reset end)
             (), // 1-bit reset n input
  . resetn
  // Interface: call (conduit sink)
  .start
                , // 1-bit valid input
             ( ), // 1-bit stall output
  .busy
  // Interface: return (conduit source)
  done
               ), // 1-bit valid output
             (), // 1-bit stall input
  .stall
  // Interface: returndata (conduit source)
  .returndata( ), // 32-bit data output
  // Interface: a (conduit sink)
                , // 32-bit data input
  . a
  // Interface: b (conduit sink)
                  // 32-bit data input
```



# **Platform Designer System Integration Tool**



**Catalog of** available IP

- Interface protocols
- Memory
- DSP
- Embedded
- Bridges
- Custom Components
- Custom Systems

Accelerate development





Simplify integration

Automate integration tasks



Custom 2

## Platform Designer Integration

Platform Designer component generated for each component:

- For PD Standard a.prj/components/<component>/<component>.qsys
- For Platform Designer a.prj/components/<component>.ip

In Platform Designer, instantiate component from the IP Catalog in the HLS project directory

- Add IP directory to IP Catalog Search Locations
  - May use a.prj/components/\*\*/
- Can be stitched with other user IP or Intel<sup>®</sup> Quartus<sup>®</sup> IP with compatible interfaces

See tutorials under tutorials/usability



# Platform Designer HLS Component Example

#### Example

Cascaded and high-

| pre                    | 🔄 Syste | m Contents | X Ad            | ldress Map 🕺 🛛 Inter   | connect Requirements 🛛 🛛 Det | ails 🖾                 |            |
|------------------------|---------|------------|-----------------|------------------------|------------------------------|------------------------|------------|
|                        |         | 🔺 🏢 Syste  | em: top         | Path: top_lpf_0.return | ndata                        |                        |            |
| scaded low-pass filter | + Us    | e Connecti | ions            | Name                   | Description                  | Export                 | Clock      |
| -                      |         |            |                 | clock_in               | Clock Bridge                 |                        |            |
| d high-pass filter     |         |            |                 | in_clk                 | Clock Input                  | clk                    | exported   |
|                        |         |            | $ \rightarrow $ |                        | Clock Output                 | Double-click to export | clock_in_o |
|                        | 🔁 🗹     |            |                 |                        | Reset Bridge                 |                        |            |
|                        |         |            | $\rightarrow$   |                        | Clock Input                  | Double-click to export | clock_in   |
|                        |         |            | 머               |                        | Reset Input                  |                        | [clk]      |
|                        |         |            |                 | -                      | Reset Output                 | Double-click to export | [clk]      |
|                        |         |            |                 | · · · ·                | hpf_internal                 |                        |            |
|                        |         |            | 60              |                        | Conduit                      |                        | [clock]    |
|                        |         |            |                 |                        | Conduit                      |                        | [clock]    |
|                        |         |            | $\rightarrow$   |                        | Clock Input                  | Double-click to export |            |
|                        |         |            | $\rightarrow$   |                        | Reset Input                  | Double-click to export |            |
|                        |         |            | 19              | return                 | Conduit                      |                        | [clock]    |
| HLS Components         |         |            |                 |                        | Conduit                      |                        | [clock]    |
|                        |         |            |                 |                        | Conduit                      | Double-click to export | [clock]    |
|                        |         |            |                 | top_lpf_0              | lpf_internal                 |                        |            |
|                        |         |            | $\sim$          | alpha                  | Conduit                      |                        | [clock]    |
|                        |         |            | 언               | call                   | Conduit                      |                        | [clock]    |
|                        |         | • • •      | $\rightarrow$   | clock                  | Clock Input                  |                        | clock_in   |
|                        |         | • + + +    | $\rightarrow$   |                        | Reset Input                  | Double-click to export |            |
| Ť                      |         | •          |                 | return                 | Conduit                      | Double-click to export |            |
|                        |         | -          |                 |                        | Conduit                      |                        | [clock]    |
|                        |         |            | ~어              | Х                      | Conduit                      | top_lpf_0_x            | [clock]    |
|                        |         |            |                 |                        |                              |                        |            |



### HLS-Backed Components

- Generic component can be used in place of actual IP core
  - Generic Component
     Generic Component
- Choose Implementation Type: HLS

|                                                              | Componen                   | t Instantiation - generi      | c_component_0*    | ×                                          |
|--------------------------------------------------------------|----------------------------|-------------------------------|-------------------|--------------------------------------------|
| <u>T</u> emplates <u>V</u> iew <u>A</u> dvar                 | nced                       |                               |                   |                                            |
| Implementation Type:                                         | IP                         | HDL                           | Blackbox          | HLS                                        |
| Compilation Info 🛛 🖾                                         |                            |                               |                   | - 🗗 🗖                                      |
| <ul> <li>About Compilation Ir</li> <li>HLS files:</li> </ul> |                            |                               |                   |                                            |
| nd life.                                                     | mult.cpp                   |                               | Run               | Compile Import<br>Verification Show Report |
| HDL entity name:                                             | mymult                     |                               |                   |                                            |
| HDL compilation library:                                     | mymult                     |                               |                   |                                            |
| IP file:                                                     | /home/student/fpga_trn/hls | s_i/mult/mult.prj/components/ | /mymult/mymult.ip | -                                          |

- Specify HLS source files
- Compile Component
  Run Cosim
- Display HTML report







# **FPGA PROGRAMMING MODEL:**

OpenCL





## OpenCL

Hardware Agnostic Compute Language Invented by Apple

- 2008 Specification donated to Khronos Group
- Now managed by Intel

OpenCLC and C++

What does OpenCL<sup>™</sup> give us?

- Industry standard programming model
- Functional portability across platforms
- Well thought out specification







### Heterogeneous Platform Model





### OpenCL Use Model: Abstracting the FPGA away



e) 112

## **OpenCL Host Program**

Pure software written in standard C/C++ languages

Communicates with the accelerator devices via an API which abstracts the communication between the host processor and the kernels

main()



read\_data\_from\_file( ... );
manipulate data( ... );

>clEnqueueWriteBuffer( ... ); clEnqueueNDRange(..., sum, ...); clEnqueueReadBuffer( ... );

display\_result ( ... );



## **OpenCL** Kernels

#### Kernel: Data-parallel function

- Defines many parallel threads
- Each thread has an identifier specified by "get\_global\_id"
- Contains keyword extensions to specify parallelism and memory hierarchy

Executed by an OpenCL device

- CPU, GPU, FPGA
- Code portable NOT performance portable
- kernel void sum( global float \*a, global float \*b, **global** float \*answer) int xid = get global id(0); result[xid] = a[xid] + b[xid]; float \*a = float \*b = kernel void sum( ... ); float \*result =

Between FPGAs it is!



### Software Engineer's View of an OpenCL System



Device contains compute engines that run the kernel Host talks to global memory through OpenCL routines Global memory is large, fast, and likes to burst Local memory is small, fast, and supports random access



### FPGA OpenCL Architecture



Modest external memory bandwidth

Extremely high internal memory bandwidth

Highly customizable compute cores



## Start with a Reference Platform (1/2)



### Start with a Reference Platform (2/2)

### Host and accelerator in same package: SoC











### **Compiling Kernel**

Run the Altera Offline Compiler in command prompt

- aoc --board <board> <Kernel.cl>
- Run aoc --list-boards to see all available boards

AOC performs system integration to generate the kernel hardware system and the Quartus Prime software to compile the design

| /mydesigns/matrixMult\$ aoc matrixMul.c<br>aoc: Selected target board bittware_s5 |         |        |
|-----------------------------------------------------------------------------------|---------|--------|
| +                                                                                 |         | +      |
| ; Estimated Resource Usage Summary                                                |         | •      |
| +                                                                                 | +       | +      |
| ; Resource                                                                        | + Usage |        |
| +                                                                                 | +       | +      |
| ; Logic utilization                                                               | ; 52%   | ,      |
| ; Dedicated logic registers                                                       | ; 23%   | •<br>• |
| ; Memory blocks                                                                   | ; 31%   | •      |
| ; DSP blocks                                                                      | ; 54% ; |        |
| +                                                                                 | +       | ;      |



### Executing the kernel: clCreateProgramWithBinary













### Printf

Can use printf within kernel on FPGA

Adds some memory traffic overhead

In the emulator, printfruns on IA

Useful for fast debug iterations







### **Optimization Report**

Intel FPGA SDK for OpenCL provides a static report to identify performance bottlenecks when writing single-threaded kernels

Use -c to stop after generating the reports

- aoc -c <kernel.cl>
- Report is in: <kernel>/reports/report.html











3. Run the profiler GUI: aocl report <aocx> <profile.mon>



### **Dynamic Profiler**

Intel FPGA SDK for OpenCL enables users to get runtime information about their kernel performance

#### Bottlenecks, bandwidth, saturation, pipeline occupancy





**Performance Stats** 





# **HIGH PERFORMANCE DATA FLOW**

# Execution of Threads on FPGA – Naïve Approach

Thread execution can be executed on replicated pipelines in the FPGA





# Execution of Threads on FPGA – Naïve Approach

Thread execution can be executed on *replicated* pipelines in the FPGA





# Execution of Threads on FPGA – Naïve Approach

Thread execution can be executed on *replicated* pipelines in the FPGA

- Throughput = 1 thread per cycle
- Area inefficient





- Attempt to create a deeply pipelined implementation of kernel
- On each clock cycle, we attempt to send in new thread





- Attempt to create a deeply pipelined implementation of kernel
- On each clock cycle, we attempt to send in new thread





- Attempt to create a deeply pipelined implementation of kernel
- On each clock cycle, we attempt to send in new thread





- Attempt to create a deeply pipelined implementation of kernel
- On each clock cycle, we attempt to send in new thread





- Attempt to create a deeply pipelined implementation of kernel
- On each clock cycle, we attempt to send in new thread





- Attempt to create a deeply pipelined implementation of kernel
- On each clock cycle, we attempt to send in new thread





Better method involves taking advantage of *pipeline parallelism* 

- Throughput = 1 thread per cycle







# SINGLE THREADED OPTIMIZATIONS

### **OpenCL on Intel FPGAs**

Main assumptions made in previous OpenCL programming model

- Data level parallelism exists in the kernel program

Not all applications well suited for this assumption

- Some applications do not map well to data-parallel paradigms

These are the only workloads that GPUs support



### **Data-Parallel Execution**

On the FPGA, we use the idea of pipeline parallelism to achieve acceleration



Threads can execute in an embarrassingly parallel manner



# Data-Parallel Execution - Drawbacks

Difficult to express programs which have partial dependencies during execution



Would require complicated hardware and new language semantics to describe the desired behavior



# Solution: Tasks and Loop-Pipelining

Allow users to express programs as a single-thread

Pipeline parallelism still leveraged to efficiently execute loops in Intel's FPGA OpenCL





# Loop Carried Dependencies

Loop-carried dependencies are dependencies where one iteration of the loop depends upon the results of another iteration of the loop

```
kernel void state_machine(ulong n)
{
  t_state_vector state = initial_state();
  for (ulong i=0: i<n: i++) {
    state = next_state( state );
    unit y = process( state );
    write_channel_altera(OUTPUT, y);
  }
}</pre>
```

The variable state in iteration 1 depends on the value from iteration 0. Similarly, iteration 2 depends on the value from iteration 1, etc.



# Loop Carried Dependencies

To achieve acceleration, we can *pipeline* each iteration of a loop containing loop carried dependencies

- Analyze any dependencies between iterations
- Schedule these operations
- Launch the next iteration as soon as possible

```
kernel void state_machine(ulong n)
{
   t_state_vector state = initial_state();
   for (ulong i=0: i<n: i++) {
      state = next_state(state);
      unit y = process(state);
      write_channel_altera(OUTPUT, y);
   }
}</pre>
```







# Parallel Threads vs. Loop Pipelining

### So what's the difference?



Loop Pipelining enables Pipeline Parallelism \*AND\* the communication of state information between iterations.



# Image Filter

```
const int filterH[3][3] = { {-1,0,1}, {-2,0,2}, {-1,0,1} };
const int filterV[3][3] = { {-1,-2,-1}, {0,0,0}, {1,2,1} };
```

char rows[2 \* WIDTH + 3]; // Pixel buffer of 2 rows and 3 extra pixels

```
int count = 0;
while (count != iterations) {
 // Each cycle, shift a new pixel into the buffer.
 // Unrolling this loop allows the compiler to infer a shift register.
  #pragma unroll
  for (int i = WIDTH * 2 + 2; i > 0; --i) {
    rows[i] = rows[i - 1];
  rows[0] = data in[count]; // Shift image data (from DDR) into one end
  int accumH=0, accumV=0;
  for (unsigned y=0; y<TILE SIZE; y++) {</pre>
   for (unsigned x=0; x<TILE SIZE; x++) {</pre>
      unsigned int pixel = rows[y * WIDTH + x];
      accumH += pixel * filterH[y][x];
      accumV += pixel * filterV[y][x];
  int accum = accumH*accumH + accumV*accumV;
  char out val = (accum > (threshold * threshold)) ? 255 : 0;
  data out[count++] = out val; //output pixel (to DDR)
```





# CHANNELS

Harnessing Dataflow to Reduce Memory Bandwidth

# Data Movement in GPUs

Data is moved from host over PCIexpress

Instructions and data is constantly sent back and forth between host cache and memory and GPU memory

- Requires buffering larger data sets before passing to GPU to be processed
- Significant latency penalty
- Requires high memory and host bandwidth
- Requires sequential execution of kernels



# Altera\_Channels Extension

An FPGA has programmable routing

Can't we just send data across wires between kernels?

Advantages:

- Reduce memory bandwidth
- Lower latency through fine-grained synchronization between kernels
- Reduce complexity (wires are trivial compared to memory access)
  - Lower cost, lower area, higher performances
- Enable modular dataflow design through small kernels exchanging data
- Different workgroup sizes and degrees of parallelism in connected modules





# Data Movement in FPGAs

FPGA allows for result reuse between instructions

Ingress/Egress to custom functions 100% flexible

Multiple memory banks of various types directly off FPGA

- Algorithms can be architected to minimize buffering to external memory or host memory
- Multiple optional memory banks can be used to allow simultaneous access





# Example: Multi-Stage Pipeline

An algorithm may be divided into multiple kernels:

- Modular design patterns
- Partition the algorithm into kernels with different sizes and dimensions
- Algorithm may naturally split into both single-threaded <u>and</u> NDRange kernels

# Generating random data for a Monte Carlo simulation:

```
kernel void rng(int seed) {
    int r = seed;
    while(true) {
        r = rand(r);
        write_channel_altera(
            RAND, r);
    }
    Single-Threaded
```















# An Even Closer Look: FPGA Custom Architectures

Kernel Replication with num\_compute\_units using OpenCL

- Step #1: Design an efficient kernel



#### Kernel Replication With Intel<sup>®</sup> FPGA SDK for OpenCL Attribute to specify 1-dim or 2-dim array of ΡE ΡE PE PE kernels PE PE PE PE Add API to identify kernel in the array ΡE PE PE PE \_attribute\_\_((num\_compute\_units(4,4))) kernel void PE() { PE PE PE PE n row = get compute id(0); 2 col = get\_compute\_id(1); Processing elements (task-based kernels) Compile-time constants allows compiler to specialize each PE



# Kernel Replication With Intel® FPGA SDK for OpenCL





# Matrix Multiply in OpenCL

Every PE / feeder is a kernel

Communication via OpenCL channels

Data-flow network model

Software control:

- Compute unit granularity
- Spatial Locality
- Interconnect topology
- Data movement
- Caching
- Banking

#### Performance: ~1 TFLOPs







# CNN On FPGA

Want to minimize accessing external memory 🖢

Want to keep resulting data between layers on the device and between computations

Want to leverage reuse of the hardware between computations

Parallelism in the depth of the kernel window and across output features. Defer complex spatial math to random access memory.

Re-use hardware to compute multiple layers.





## **Efficient Parallel Execution of Convolutions**



- Parallel Convolutions
  - Different filters of the same convolution layer processed in parallel in different processing elements (PEs)
- Vectored Operations
  - Across the depth of feature map
- PE Array geometry can be customized to hyperparameters of given topology

## **Design Exploration with Reduced Precision**

Tradeoff between performance and accuracy

- Reduced precision allows more processing to be done in parallel
- Using smaller Floating Point format does not require retraining of network
- FP11 benefit over using INT8/9
  - No need to retrain, better performance, less accuracy loss

 FP16
 Image: Constraint of the second sec

Sign, 5-bit exponent, 10-bit mantissa Sign, 5-bit exponent, 5-bit mantissa Sign, 5-bit exponent, 4-bit mantissa Sign, 5-bit exponent, 3-bit mantissa Sign, 5-bit exponent, 2-bit mantissa

# **OPENCL FLOW**

Lab 3



# **FPGA PROGRAMMING MODEL:**

DSP Builder Advanced Blockset

# The Mathworks\* Design Environment

#### Matlab\*

- High-level technical computing language
  - Simple C like language
  - Efficient with vectors and matrices
  - Built-in mathematical functions
- Interactive environment for algorithm development
  - 2D/3D graphing tool for data visualization
- Simulink\*
  - Hierarchical block diagram design & simulation tool
  - Digital, analog/mixed signal & event driven
  - Visualize signals
  - Integrated with MATLAB\*





# DSP Builder for Intel® FPGAs

Enables MathWorks\* Simulink for Intel FPGA design

Device optimized Simulink\* DSP Blockset

- Key Features:
  - High-Level Design Exploration
  - HW-in-the-Loop verification
  - IP Generation for Intel<sup>®</sup> Quartus
     SW / Platform Designer





# **FPGA Design Flow - Traditional**





# FPGA Design Flow – DSP Builder for Intel® FPGAs





# **Core Technologies**

- IP (ready made) library
  - Multi-rate, multi-channel filters
  - Waveform synthesis (NCO/DDS/Mixers)
- Custom IP creation using primitive library
  - Vectorization
  - Zero latency
  - Scheduled
  - Aligned RTL generation
- System integration .
  - Platform Designer
  - Processor Integration

- Automatic pipelining
- Automatic folding and resource sharing
- Multichannel designs with automatic vectorization
- Avalon<sup>®</sup> Memory-Mapped and Streaming Interfaces
- Design exploration across device families
- High-performance floating-point designs
- System-in-the-Loop accelerated simulation



# Advanced Blockset - High Performance DSP IP

Over 150 device optimized DSP building blocks for Intel<sup>®</sup> FPGAs

- DSP building blocks
- Interfaces
- IP library blocks
- Primitives library blocks
  - Math and Basic blocks
- Vector and Complex data types







Design Configuration



Primitives

Utilities



# **Build Custom FFTs from FFT Element Library**

- Quickly build DSP designs using Complete FFT IP Functions from the FFT Library
- Build custom radix-2<sup>2</sup> FFTs using blocks from the FFT Element Library

| FFT IP Library     | FFT Element Library        | x(0) x(0) Radix-2 Radi                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
|--------------------|----------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| FFT                | Pruning and Twiddle        | $\times (1) \qquad \qquad \times (4) \qquad \qquad \qquad \times (4) \qquad \qquad$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| FFT_float          | Bit vector combine         | $\times (2) \xrightarrow{\times (1)} \operatorname{Radix-2}_{\times (5)} \operatorname{Wg^1} \xrightarrow{Wg^1} \operatorname{Radix-2}_{\operatorname{Butterfly}} \operatorname{Wg^2} \operatorname{Radix-2}_{\operatorname{Butterfly}} \operatorname{Wg^0}$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| VFFT               | Butterfly Unit             |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| VFFT_float         | Choose Bits                | × (4) × (2) Radix-2 W n <sup>0</sup> Radix-2 W n <sup>0</sup>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| BitReverseCoreC    | Dual Twiddle Memory        | ×(6)<br>Butterfly<br>Solve W s<br>Butterfly<br>Solve W s<br>Butterfly<br>Solve W s<br>Butterfly<br>Solve S<br>Solve S<br>Sol |
| VariableBitReverse | Edge Detect                | $\times$ (6) $\times$ (3) Radix-2 $W_{n,2}$ Radix-2 $W_{n,2}$ Radix-2 $W_{n,2}$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
|                    | Floating-Point Twiddle Gen | $\times (7) \longrightarrow \qquad $                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |

**Crossover Switch** 

(intel) 1

# Filter and Waveform Synthesis Library

DSP Builder includes a comprehensive waveform IP library

- Automatic resource sharing based on sample rate
- Support for super sample rate architectures



|  | IP    | Im                    | plementations                                                                                                          |  |
|--|-------|-----------------------|------------------------------------------------------------------------------------------------------------------------|--|
|  | FIR   | •<br>•<br>•<br>•<br>• | Half-band<br>L-Band<br>Symmetric<br>Decimating<br>Fractional Rate<br>Interpolation<br>Single-Rate<br>Super Sample Rate |  |
|  | CIC   | •<br>•                | Decimating<br>Interpolating<br>Super Sample Rate                                                                       |  |
|  | Mixer | •<br>•                | Complex<br>Real<br>Super Sample Rate                                                                                   |  |
|  | NCO   | •                     | Super Sample Rate<br>Multi-bank                                                                                        |  |

# Library is Technology Independent

- Target device using a **Device** block
- Same model generates optimized RTL for each FPGA and speed grade





# Datapath Optimization for Performance

### Automatic Timing Driven Synthesis of Model

- Based on specified device and clock frequency

| ·                     | Description                                               |                                             |
|-----------------------|-----------------------------------------------------------|---------------------------------------------|
| Optimization          | Description                                               | Use this window to control global settings. |
| Pipelining            | Inserts registers to improve Fmax                         | General Clock Testbenches                   |
| Algorithmic Retiming  | Moves registers to balance pipelining                     | System Clock<br>Clock Signal Name: clk      |
| Bit Growth Management | Manages bit growth for fixed-point designs                | Clock Frequency (MHz): 240                  |
| Multi-rate            | Optimizes hardware based on sample rate                   | Clock Margin (MHz): 0                       |
| Optimizations         | $\langle \mathcal{O} \rangle \langle \mathcal{O} \rangle$ | After                                       |
|                       |                                                           |                                             |

$$A \longrightarrow B \longrightarrow C$$

×

DSP Builder for I...

B

Retiming

Bit Growth

## **Custom IP Generation**





### ALU Design Folding Improves Area Efficiency

Optimizes hardware usage for lowthroughput designs

- Arranges one of each resources in a central arithmetic logic unit (ALU) fashion
- Folding factor = clock rate / data rate
- Performed when Folding factor > 500









# TDM Design: Trade-Off Example

| 49-tap Symmetric Single Rate FIR Filter       |           |       |             |            |  |
|-----------------------------------------------|-----------|-------|-------------|------------|--|
|                                               | Resources |       |             |            |  |
| Stratix 10                                    | LUT4s     | Mults | Memory bits | TDM Factor |  |
| Clock Rate = 72 MHz<br>Sample Rate = 72 MSPS  | 898       | 26    | 0           | 1          |  |
| Clock Rate = 144 MHz<br>Sample Rate = 72 MSPS | 1082      | 14    | 0           | 2          |  |
| Clock Rate = 288 MHz<br>Sample Rate = 72 MSPS | 741       | 8     | 0           | 4          |  |
| Clock Rate = 72 MHz<br>Sample Rate = 36 MSPS  | 1082      | 14    | 0           | 2          |  |



## 2 Antenna DUC Reference Design



Cosine Seq. = cosA1,cosA2

Reference Design Included with DSP Builder

184 184

Clock Rate = 179.2MHz





#### Changing the Design with DSP Builder

- Modifications done in minutes
- Design still looks the same





## Five Designs Iterations < 1 Hour

|                                   | Arria® 10<br>6 channel | Arria 10<br>6 channel | Arria 10<br>12 channel | Stratix® 10<br>6 channel | Stratix 10<br>12 channel |
|-----------------------------------|------------------------|-----------------------|------------------------|--------------------------|--------------------------|
| Requested Clock<br>(MHz)          | 250                    | 450                   | 450                    | 450                      | 450                      |
| Actual Fmax<br>(slow model, 85C)  | 351                    | 458                   | 458                    | 524                      | 484.5                    |
| Multiplier Count<br>(18x18)       | 10                     | 6                     | 10                     | 6                        | 10                       |
| Logic Resources<br>(registers)    | 686                    | 465                   | 818                    | 1267                     | 1863                     |
| Block Memory<br>Resources (kbits) | 0                      | 0                     | 0                      | 0                        | 25.8                     |



### Generates Reusable IP for Platform Designer

- Platform Designer is the System Integration Environment for Intel<sup>®</sup> FPGAs
- DSP Builder designs fully compatible with Platform Designer
- Integrate with other FPGAIPs
  - Processors
  - State machines
  - Streaming interfaces
- Design reuse fully supported





#### Typical Design Flow

Identify system architecture, design filters and choose desired Fmax and device

- Set the top level system parameters in the MATLAB<sup>®</sup> software using the 'params' file number of channels, performance, etc.
- Build the system using the Advanced Blockset tool
- Simulate the design using Simulink® and ModelSim® tools
- Target the right FPGA family and compile
- As system design specs changes, edit the 'params' file and repeat



#### **Design Flow - Create Model**

Create a new blank model

Select New Model Wizard from DSP Builder menu







Top-level of a DSPB-AB design is a testbench Must include Control and Signals blocks

#### **Design Flow - Synthesizable Model**



Device block marks the top level of the FPGA







#### Design Flow – ModelPrim Blocks



(intel) 194

#### Design Flow – Parameterize the Design



#### C structure like template

Runs when model is opened or simulation is run



#### Design Flow – Processor Interface

Drop memory and registers in the design

ModelIPs have built in memory mapped interface to control registers, coefficient registers



| Function Block Parameters: InterpolatingCIC             |     |
|---------------------------------------------------------|-----|
| Parameters                                              | *   |
| Input Rate per Channel/MSPS                             |     |
| ampleRate * wcdma_mcduc.FIR2Rate * wcdma_mcduc.FIR1Rate |     |
| Number of Channels:                                     |     |
| wcdma_mcduc.ChanCount                                   |     |
| Number of Stages                                        |     |
| wcdma_mcduc.CICN                                        |     |
| Interpolation Factor                                    |     |
| wcdma_mcduc.CICRate                                     |     |
| Differential Delay                                      | 111 |
| wcdma_mcduc.CICM                                        |     |
| Final Decimation                                        |     |
| 1                                                       |     |
|                                                         | Ŧ   |
| OK Cancel Help Apply                                    |     |



## **Design Flow - Running Simulink Simulation**

Creates files in location specified by Control block

- VHDL Code
- Timing constraints file (.sdc)
- DSPB-AB subsystem Quartus® IP file



#### **Design Flow - Documentation Generation**

Get accurate resource utilization of all modules right after

simulation, without place & route DSP Builder > Resource Usage DSP Builder > View Address Map

| Description                                       |               |       |              |
|---------------------------------------------------|---------------|-------|--------------|
| Shows the memory map for the lab_dsbdm model.     |               |       |              |
| Nodes                                             | Register bits | Reset | Word address |
| FIR Coefficient Registers                         |               |       |              |
| 💷 🌗 lab_dsbdm/DSB_Demod_Chip/DSB_Demod/PreDe      |               |       |              |
| FIR Coefficient Register 0                        | 150 (16)      | -1524 | 0x40         |
| FIR Coefficient Register 1                        | 150 (16)      | 688   | 0x41         |
| FIR Coefficient Register 2                        | 150 (16)      | 686   | 0x42         |
| FIR Coefficient Register 3                        | 150 (16)      | 718   | 0x43         |
| FIR Coefficient Register 4                        | 150 (16)      | 722   | 0x44         |
| FIR Coefficient Register 5                        | 150 (16)      | 639   | 0x45         |
| <ul> <li>FIR Coefficient Register 6</li> </ul>    | 150 (16)      | 434   | 0x46         |
| FIR Coefficient Register 7                        | 150 (16)      | 73    | 0x47         |
| <ul> <li>FIR Coefficient Register 8</li> </ul>    | 150 (16)      | -434  | 0x48         |
| FIR Coefficient Register 9                        | 150 (16)      | -1065 | 0x49         |
| <ul> <li>FIR Coefficient Register 10</li> </ul>   | 150 (16)      | -1776 | 0x4A         |
| FIR Coefficient Register 11                       | 150 (16)      | -2499 | 0x4B         |
| FIR Coefficient Register 12                       | 150 (16)      | -3166 | 0x4C         |
| <ul> <li>FIR Coefficient Register 13</li> </ul>   | 150 (16)      | -3704 | 0x4D         |
| FIR Coefficient Register 14                       | 150 (16)      | -4052 | 0x4E         |
| FIR Coefficient Register 15                       | 150 (16)      | 28594 | 0x4F         |
| Iab_dsbdm/DSB_Demod_Chip/DSB_Demod/PreDe          |               |       |              |
| Iab_dsbdm/DSB_Demod_Chip/DSB_Demod/Single         |               |       |              |
| Inco Phase Increment Registers (sin Inversion@MSB |               |       |              |
| Iab_dsbdm/DSB_Demod_Chip/DSB_Demod/NCO (          |               |       |              |

| Description                          |              |       |       |             |
|--------------------------------------|--------------|-------|-------|-------------|
| Shows the resource usage of the lab_ | dsbdm model. |       |       |             |
| Blocks                               | Туре         | LUT4s | Mults | Memory bits |
| lab_dsbdm                            |              | 9237  | 108   | 1644        |
| 💷 🌽 DSB_Demod_Chip                   |              | 9237  | 108   | 16448       |
| DSB_Demod                            | SYNTH        | 9221  | 108   | 16448       |
| NCO                                  | NCO          | 1153  | 8     | 1644        |
| PreDetectionHPF                      | FIRS         | 2035  | 32    | (           |
| PreDetectionLPF                      | FIRS         | 2035  | 32    | 1           |
| Scale                                | SCALE        | 459   | 0     | (           |
| Scale1                               | SCALE        | 489   | 0     | 1           |
| Scale2                               | SCALE        | 459   | 0     |             |
| Scale3                               | SCALE        | 399   | 0     |             |
| SingleRateFIR                        | FIRS         | 2035  | 32    |             |
| Sync_Mix                             | SYNTH        | 125   | 4     |             |
| DualMem_0                            | DUALMEM      | 28    | 0     |             |
| DualMem_1                            | DUALMEM      | 28    | 0     |             |
| Mult_0                               | MULT         | 0     | 2     |             |
| Mult_1                               | MULT         | 0     | 2     |             |
| Id_ChannelIn_dV_s_to                 | DELAY        | 4     | 0     |             |
| Id_ChannelIn_dIn_0_to                | DELAY        | 16    | 0     |             |
| Id_ChannelIn_dIn_1_to                | DELAY        | 16    | 0     |             |
| Id_ChannelIn_dC_s_to                 | DUALMEM      | 24    | 0     |             |
| Id_ChannelIn_dC_s_to                 | COUNTER      | 2     | 0     |             |
| Id_ChannelIn_dC_s_to                 | REG          | 2     | 0     |             |
| Id_ChannelIn_dC_s_to                 | CONSTANT     | 0     | 0     |             |
| Id_ChannelIn_dC_s_to                 | REG          | 1     | 0     |             |
| Id_ChannelIn_dC_s_to                 | REG          | 1     | 0     |             |
| Id_ChannelIn_dC_s_to                 | LOGICAL      | 0     | 0     |             |
| Id_ChannelIn_dC_s_to                 | LOGICAL      | 1     | 0     |             |
| Id_ChannelIn_dC_s_to                 | LOGICAL      | 1     | 0     | 1           |
| Id_ChannelIn_dC_s_to                 | LOGICAL      | 1     | 0     | 1           |
| busReadSelector                      | SELECTOR     | 32    | 0     |             |
| busReadSelector                      | SELECTOR     | 16    | 0     | 1           |

**tel** 198

Close

#### **Design Verification**





### Design Flow – System Integration

| 📕 Options                                                                                       |                                                                       |                                                                                                      |                                                                                             |                                                      |
|-------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|------------------------------------------------------|
| Category P Search Path P Search Path P Search Path C/altera_trn/Working/DSP/rtl/wcdm Add Remove | na_multichannel_duc_mixer/chip/**/*                                   | Add <subsystem<br>to Qsys IP Search<br/><sub>Qsys-&gt; Tools -&gt;</sub></subsystem<br>              |                                                                                             |                                                      |
| L Qsys<br>File Edit System View Tools Help                                                      |                                                                       |                                                                                                      | the transmission of                                                                         |                                                      |
|                                                                                                 |                                                                       | ettings Project Settings Instance Parameters                                                         | System Inspector HDL Example                                                                | Generation                                           |
| 🔍 🗙 🖓                                                                                           | Use Conn Name                                                         | Description                                                                                          | Export                                                                                      | Clock Bas                                            |
| Project<br>New component<br>Library<br>-Altera DSP Builder Advanced                             | Ck_0     Ck_in     Ck_in     Ck_in_reset     ck_in_reset     ck_reset | Clock Source<br>Clock Input<br>Reset Input<br>Clock Output<br>Reset Output                           | clk<br>reset<br>Click to export<br>Click to export                                          | clk_0                                                |
| Brages     Clock and Reset     Configuration & Programming     DSP     Embedded Processors      | → clock<br>→ clock_reset<br>→ clock_reset<br>→ bus<br>→ bus clock     | annel_d   chip<br>Clock Input<br>Reset Input<br>Conduit<br>Avalon Memory Mapped Slave<br>Clock Input | Click to export<br>Click to export<br>Click to export<br>Click to export<br>Click to export | unconnected<br>[clock]<br>[bus_clock]<br>unconnected |
| Add subsystem from<br>the Component pick                                                        | → bus_clock_res                                                       |                                                                                                      | Click to export                                                                             | [bus_clock]                                          |
| list                                                                                            |                                                                       |                                                                                                      |                                                                                             |                                                      |





## **ACCELERATION STACK FOR XEON WITH FPGA**



\* Other names and brands may be claimed as the property of others.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos



## ACCELERATION STACK FOR INTEL® XEON® CPU WITH FPGAS COMPREHENSIVE ARCHITECTURE FOR DATA CENTER DEPLOYMENTS

**Rack-Level Solutions** 

**User Applications** 

Industry Standard Software Frameworks

**Acceleration Libraries** 

Intel Developer Tools (Intel Parallel Studio XE, Intel FPGA SDK for OpenCL™, Intel Quartus® Prime)

Acceleration Environment (Intel Acceleration Engine with OPAE Technology, FPGA Interface Manager (FIM)

**OS & Virtualization Environment** 

Intel<sup>®</sup> Hardware

#### **Faster Time to Revenue**

- Fully validated Intel® board
- Standardized frameworks and high-level compilers
- Partner-developed workload accelerators

#### **Simplified Management**

- Supported in VMware vSphere\* 6.7 Update 1\*
- Rack management and orchestration framework integration

#### Broad Ecosystem Support

- Upstreaming FPGA drivers to Linux\* kernel
- Qualified by industry-leading server OEMs
- Partnering with IP partners, OSVs, ISVs, SIs, and VARs

\* Demonstrated at VMWorld Las Vegas - August 28-30, 2018



(intel<sup>®</sup>



Programmable Solutions Group



## Server Virtualization for the Acceleration Stack with VMware



Out-of-the-box support from VMWare for Intel Arria 10 PAC and Acceleration Stack in upcoming v Sphere 6.7 U1

Server virtualization enables customers to deploy FPGA workload acceleration with lower total cost of ownership



#### Migrating FPGA-Accelerated Workload with vMotion\*

Image inference workload

CPU + FPGA



Server 1

1. Run Application on Bare Metal

 # – Unoptimized, proof-of-concept code. Not part of a shipping product.
 See supplementary slide for system configuration details.





### Components of Acceleration Stack: Overview













#### Nearly Transparent Software Application Use Model **Properties** Handle **Object model** Object Object Start / stop Acquire Map AFU Allocate / define Discover / computation on ownership of registers to user shared memory search resource AFU and wait resource space space for result Deallocate Relinquish Unmap MMIO $\rightarrow$ Reconfigure shared memory ownership AFU







#### Acquire and Release Accelerator Resource









#### Management and Reconfiguration













### Growing the Xeon+FPGA Ecosystem





Portfolio of Accelerator Solutions developed by Intel and third-party technologists to expedite application development and deployment



## **DEVELOPER COMMUNITY**

Enabling software developers access via:

- Intel Builder programs
- Al Academy
- Intel Developer Zone (IDZ)
- Rocketboards.org



## UNIVERSITIES

Reaching over 200,000 students per year with FPGA publications, workshops and hands-on research labs

Committed to Open Source vision



#### **ISV PARTNERS**

Expanding the reach for system vendors with platforms and ready-touse application workloads.







## **INTEL PAC TOP SOLUTIONS FOR DATA CENTER ACCELERATION**

| <b>r</b> eniac        | swarm64      |                     | Accelered Computing      | MEGH                  | Levyx                      | napatech®            |
|-----------------------|--------------|---------------------|--------------------------|-----------------------|----------------------------|----------------------|
| Cassandra             | PostgreSQL   | Genomics<br>GATK    | JPEG2Lepton<br>JPEG2Webp | Big Data<br>Streaming | Financial<br>Black Scholes | Network<br>Security/ |
|                       |              |                     |                          | Analytics             |                            | Monitoring           |
| 96% latency reduction | <b>½ TCO</b> | 2.5X<br>performance | 3-4X<br>performance      | 5X<br>performance     | 8X<br>performance          | 3x<br>performance    |

## CASE STUDY: 8X SPARK SQL / KAFKA PERFORMANCE INCREASE



Customer Application: Big Data Applications running on Spark/Kafka Platforms

Current solution: Run Spark/SQL on a cluster of CPUs

Challenge: For many applications in the FinServ/Genomics/Intelligence Agencies/etc. Spark performance does not meet customers SLA requirements, especially for delay sensitive streaming workloads

Solution Value Proposition Performance - Accelerate Spark SQL/Kafka by 8x Ease of Use - Zero Code Change Scalability - Hardware Agnostic Lower TCO

## **CASE STUDY: 5X RISK ANALYTICS PERFORMANCE INCREASE**



Customer Application: Risk Management acceleration framework (financial back-testing)

Current solution: Deploy a cluster of CPUs or GPUs with complex data access

Challenge: Traditional risk management methods are compute intensive, time consuming applications - > 10+ hours for financial back-testing

Solution Value Proposition >5x Performance Improvement Perform Risk and Pricing Calculations Simultaneously Abstraction - Integrated Solution with Apache Spark, SSD Access and FPGA Implementation







#### AF Simulation Environment (ASE) enables seamless portability to real HW

- Allows fast verification of OPAE software together with AF RTL without HW
  - SW Application loads ASE library and connects to RTL simulation
- For execution on HW, application loads Runtime library and RTL is compiled by Intel® Quartus into FPGA bitstream



#### FPGA Components of Acceleration Stack



\* Could be other interfaces in the future (e.g. UPI)

\*\* Stratix 10 PAC Card



## AFU Development Flow Using OPAE SDK

AFU requests the ccip\_std\_afu top level interface classes

\$OPAE\_PLATFORM\_ROOT/hw/samples/hello\_afu/hw/rtl/hello\_afu.json

AFU RTL files implementing accelerated function

\$OPAE\_PLATFORM\_ROOT/hw/samples/hello\_afu/hw/rtl/afu.sv

List all source files and platform configuration file

\$OPAE\_PLATFORM\_ROOT/hw/samples/hello\_afu/hw/rtl/filelist.txt

In terminal window, enter these commands:

- cd \$OPAE\_PLATFORM\_ROOT/hw/samples/hello\_afu
- afu\_sim\_setup --source hw/rtl/filelist.txt build\_sim





## AFU Development Flow Using OPAE SDK

Compile AFU and platform simulation models and start simulation server process

- cd build\_sim
- make
- make sim

In 2<sup>nd</sup> terminal window compile the host application and start the client process

- Export ASE\_WORKDIR= \$OPAE\_PLATFORM\_ROOT/hw/samples/hello\_afu/ build\_sim/work
- cd \$OPAE\_PLATFORM\_ROOT/hw/samples/hello\_afu/sw
- make clean
- make USE\_ASE=1
- ./hello\_afu





## AFU Simulation Environment (ASE)

Hardware software co-simulation environment for the Intel Xeon FPGA development

Uses simulator Direct Programming Interface (DPI) for HW/SW connectivity

- Not cycle accurate (used for functional correctness)
- Converts SW API to CCI transactions

Provides transactional model for the Core Cache Interface (CCI-P) protocol and memory model for the FPGA-attached local memory

Validates compliance to

- CCI-P protocol specification
- Avalon® Memory Mapped (Avalon-MM) Interface Specification
- Open Programmable Acceleration Engine



#### Simulation Complete

|                                                                                                                                         | # [SIM] 1 ADDED /umas.187070359034322                                                                                                    |
|-----------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|
|                                                                                                                                         |                                                                                                                                          |
| [APP] Issuing Soft Reset                                                                                                                | # [SIM] Request to deallocate "/umas.187070359034322"                                                                                    |
| [APP] MMID Read : tid = 0x002, offset = 0x0                                                                                             | # [SIM] 1 REMOVED /umas.187070359034322                                                                                                  |
| [APP] MMIO Read Resp : tid = 0x002, data = 100001000000000                                                                              | # [SIM] Request to deallocate "/mmio.187070359034322"                                                                                    |
| AFU DFH REG = 100001000000000                                                                                                           | # [SIM] 9 REMOVED /mmio.187070359034322                                                                                                  |
| [APP] MMIO Read : tid = 0x003, offset = 0x8                                                                                             | # [SIM] ASE recognized a SW simkill (see ase.cfg) Simulator will EXIT                                                                    |
| [APP] MMIO Read Resp : tid = 0x003, data = 9722d43375b61c66                                                                             | <pre># [SIM] SIM-C Exiting event socket server@/tmp/ase_event_server_187070359034322</pre>                                               |
| AFU ID LO = 9722d43375b61c66                                                                                                            | # [SIM] Closing message queue and unlinking                                                                                              |
| [APP] MMIO Read : tid = 0x004, offset = 0x10                                                                                            | # [SIM] Unlinking Shared memory regions                                                                                                  |
| [APP] MMIO Read Resp : tid = 0x004, data = 850adcc26ceb4b22                                                                             | # <[SIM] Session code file removed                                                                                                       |
| AFU ID HI = $850adcc26ceb4b22$                                                                                                          | # [SIM] Removing message queues and buffer handles                                                                                       |
| [APP] MMIO Read : tid = 0x005, offset = 0x18                                                                                            | # [SIM] Cleaning session files                                                                                                           |
| [APP] MMIO Read Resp : tid = 0x005, data = 0                                                                                            | # [SIM] Simulation generated log files                                                                                                   |
| AFU NEXT = 00000000                                                                                                                     | # [SIM] Transactions file   \$ASE WORKDIR/ccip transactions.tsv                                                                          |
| [APP] MMIO Read : tid = 0x006, offset = 0x20                                                                                            | K SIM Workspaces info \$ASE WORKDIR/workspace_info.log                                                                                   |
| [APP] MMIO Read Resp : tid = 0x006, data = 0                                                                                            | # [SIM] ASE seed \$ASE WORKDIR/ase seed.txt                                                                                              |
| AFU RESERVED = 00000000                                                                                                                 | # ISIM                                                                                                                                   |
| [APP] MMIO Read : tid = 0x007, offset = 0x80                                                                                            | # [SIM] Tests run => 1                                                                                                                   |
| [APP] MMIO Read Resp : tid = $0 \times 007$ , data = 0                                                                                  | # ISIM                                                                                                                                   |
| Reading Scratch Register (Byte Offset=00000080) = 00000000                                                                              | # [SIM] Sending kill command                                                                                                             |
| MMIO Write to Scratch Register (Byte Offset=00000080) = 123456789abcdef                                                                 | <pre># [SIM] Simulation kill command received</pre>                                                                                      |
| [APP] MMIO Write : tid = 0x008, offset = 0x80, data = 0x123456789abcdef                                                                 | #                                                                                                                                        |
| [APP] MMIO Read : tid = 0x009, offset = 0x80                                                                                            | # Transaction count   VA VL0 VH0 VH1   MCL-1 MCL-2 MCL-4                                                                                 |
| [APP] MMIO Read Resp : tid = 0x009, data = 123456789abcdef                                                                              |                                                                                                                                          |
| Reading Scratch Register (Byte Offset=00000080) = 123456789abcdef                                                                       | # MMIOWrReg 2                                                                                                                            |
| Setting Scratch Register (Byte Offset=00000080) = 00000000                                                                              | # MMIORGReg 10                                                                                                                           |
| [APP] MMIO Write : tid = 0x00a, offset = 0x80, data = 0x0                                                                               | # MMIORdRsp 10                                                                                                                           |
| [APP] MMIO Read : tid = 0x00b, offset = 0x80                                                                                            | # IntrReg 0                                                                                                                              |
| [APP] MMIO Read Resp : tid = $0x00b$ , data = $0$                                                                                       | # IntrResp 0                                                                                                                             |
| Reading Scratch Register (Byte Offset=00000080) = 00000000                                                                              | # RdReg 0 0 0 0 0 0 0 0                                                                                                                  |
| Done Running Test                                                                                                                       | # RdResp 0 0 0 0 0                                                                                                                       |
| [APP] Deinitializing simulation session                                                                                                 | # WrReg 0 0 0 0 0 0 0 0                                                                                                                  |
| [APP] Closing Watcher threads                                                                                                           | # WrResp 0 0 0 0 0 0 0 0 0                                                                                                               |
| [APP] Deallocating UMAS                                                                                                                 | # WrFence 0 0 0 0 0 0                                                                                                                    |
| [APP] Deallocating memory /umas.187070359034322                                                                                         | # WrFenRsp 0 0 0 0 0 0                                                                                                                   |
|                                                                                                                                         |                                                                                                                                          |
| TAPPT SUCCESS                                                                                                                           | #                                                                                                                                        |
| [APP] SUCCESS<br>[APP] Deallocating MMIO map                                                                                            | # # ** Note: \$finish : /home/student/fpga trn/AccelStack Workshop/hello afu/build sim/rtl/cci                                           |
| [APP] Deallocating MMIO map                                                                                                             | <pre># # ** Note: \$finish : /home/student/fpga_trn/AccelStack_Workshop/hello_afu/build_sim/rtl/cc: 4)</pre>                             |
| [APP] Deallocating MMIO map<br>[APP] Deallocating memory /mmio.187070359034322                                                          | 4)                                                                                                                                       |
| <pre>[APP] Deallocating MMIO map [APP] Deallocating memory /mmio.187070359034322 [APP] SUCCESS</pre>                                    | 4)<br># Time: 21620047500 ps Iteration: 2 Instance: /ase top/ccip emulator                                                               |
| [APP] Deallocating MMIO map<br>[APP] Deallocating memory /mmio.187070359034 <b>322</b><br>[APP] SUCCESS<br>[APP] Deallocate all buffers | 4)<br># Time: 21620047500 ps Iteration: 2 Instance: /ase_top/ccip_emulator<br># End time: 12:34:40 on Aug 21,2018, Elapsed time: 0:28:57 |
| <pre>[APP] Deallocating MMIO map [APP] Deallocating memory /mmio.187070359034322 [APP] SUCCESS</pre>                                    | 4)<br># Time: 21620047500 ps Iteration: 2 Instance: /ase top/ccip emulator                                                               |

#### AFU Simulator Window (server)

#### Application SW Window (client)

(intel)

## AFU Development Flow Using OPAE SDK

#### Generate the AF build environment:

- cd \$OPAE\_PLATFORM\_ROOT/hw/samples/hello\_aft
- afu\_synth\_setup --source hw/rtl/filelist.txt build\_synth

#### Generate the AF

- cd build\_synth
- \$OPAE\_PLATFORM\_ROOT/bin/run.sh





### Using the Quartus GUI

Compiling the AFU uses a command line-driven PR compilation flow

Builds PR region AF as a .gbs file to be loaded into OPAE hardware platform

Can use the Quartus GUI for the following types of work:

- Viewing compilation reports
- Interactive Timing Analysis
- Adding SignalTap instances and nodes



## Acceleration Stack Demo

Lab 3



Programmable Solutions Group









### Programmable Acceleration Cards (PAC)

#### Intel<sup>®</sup> Arria<sup>®</sup> 10 Accelerator Card



Broadest Deployment at Lowest Power

40G, PCIe\*Gen3 x8 ½ length, ½ height, single-slot PCIe card Lowest power 66W TDP Intel Stratix<sup>®</sup> 10 Accelerator Card



**Highest Performance and Throughput** 

2x 100G, PCIe Gen3 x16 ¾ length, full height, dual-slot PCIe card Up to 225 W maximum

# INTEL® FPGA ACCELERATION HUB

A new collection of software, firmware, and tools that allows all developers to leverage the power of Intel® FPGAs.

#### Intel<sup>®</sup> portal for all things related to FPGA acceleration

- Acceleration Stack for Intel<sup>®</sup> Xeon<sup>®</sup> with FPGAs
- FPGA Acceleration Platforms
- Acceleration Solutions & Ecosystem
- Knowledge Center
- FPGA as a Service
- 01.org\*



# NEXT STEPS

#### Follow-On Courses

Introduction to Cloud Computing

Introduction to High Performance Computing (HPC)

Introduction to Apache<sup>™</sup> Hadoop

Introduction to Apache Spark™

Introduction to Kafka™

Introduction to Intel® FPGAs for Software Developers

Introduction to the Acceleration Stack for Intel® Xeon® CPU with FPGA

Application Development on the Acceleration Stack for Intel® Xeon® CPU with FPGAs

Building RTL Workloads for the Acceleration Stack for Intel® Xeon® CPU with FPGAs

OpenCL<sup>™</sup> Development with the Acceleration Stack for Intel® Xeon® CPU with FPGA

Intel FPGA OpenCL Trainings and HLS Trainings

https://www.intel.com/content/www/us/en/programmable/ support/training/overview.html



### **Teaching Resources**

University-focused content & curriculum

- Semester-long laboratory exercises for hands-on learning with solutions
- Tutorials and online workshops for self-study on key use cases
- Free library of IP common for student projects
- Example designs and sample projects
- Easy-to-use, powerful software tools
- Quartus Prime CAD Environment
- ModelSim
- Intel FPGA Monitor Program for assembly & C development
- Intel<sup>®</sup> SDK for OpenCL<sup>™</sup> Applications
- Intel OpenVINO<sup>™</sup> toolkit (Visual Inference & Neural Network Optimization)



### **Teaching Resources (cont.)**

#### Hardware designed for education

- 4 different FPGA kits with a variety of peripherals to match project needs
- Compact designs with robust shielding to provide longevity
- Reduced academic prices (range: \$55-\$275)
- Donations available in some circumstances

#### Support

- Total access to all developer resources
  - Documentation
  - Design examples
  - Support forum
  - Virtual or on-demand trainings



#### DE-Series Development Boards









**DE10-Standard** Cyclone V FPGA + SoC \$259 **DE1-SOC** Cyclone V FPGA + SoC \$175 **DE10-Nano** Cyclone V FPGA + SoC \$99

**DE10-Lite** Max 10 FPGA \$55

Visit our <u>website</u> for full specs on these boards See the full catalog of Intel FPGA boards & kits at <u>www.terasic.com</u>



|                                                 | Beginner FPGA Dev Kit | FPGA+SoC Academic Dev Kit |                                                    | Full-Featured<br>Academic Dev Kit                  |
|-------------------------------------------------|-----------------------|---------------------------|----------------------------------------------------|----------------------------------------------------|
| Dev Kit                                         | INTEL DE10-LITE       | INTEL DE10-NANO           | INTEL DE1-SOC                                      | INTEL DE10-STANDARD                                |
| Academic Price                                  | \$55                  | \$99                      | \$175                                              | \$259                                              |
| FPGA                                            | Max <sup>®</sup> 10   | Cyclone® V                | Cyclone® V                                         | Cyclone® V                                         |
| Logic Elements                                  | 50,000                | 110,000                   | 85,000                                             | 110,000                                            |
| ARM Cortex-A9 Dual-Core<br>System-on-Chip (SoC) | ×                     | 800 MHz                   | 925 MHz                                            | 925 MHz                                            |
| Memory                                          | 64 MB SDRAM           | 1 GB DDR3 SDRAM (HPS)     | 1 GB DDR3 SDRAM (HPS), 64 MB<br>SDRAM (FPGA)       | 1 GB DDR3 SDRAM (HPS),<br>64 MB SDRAM (FPGA)       |
| PLLs                                            | 4                     | 9                         | 9                                                  | 9                                                  |
| GPIO Count                                      | 500                   | 469                       | 469                                                | 469                                                |
| 7 Segment Displays                              | 6                     | ×                         | 6                                                  | 6                                                  |
| Switches                                        | 10                    | 4                         | 10                                                 | 10                                                 |
| Buttons                                         | 2                     | 2                         | 4                                                  | 4                                                  |
| LEDs                                            | 10                    | 8                         | 10                                                 | 10                                                 |
| Clocks                                          | (2x) 50 MHz           | (3x) 50 MHz               | (4x) 50 MHz                                        | (4x) 50 MHz                                        |
| GPIO Count                                      | 40-pin header         | (2x) 40-pin header        | (2x) 40-pin header                                 | 40-pin header                                      |
| Video Out                                       | VGA 12-bit DAC        | HDMI                      | VGA 24-bit DAC                                     | VGA 24-bit DAC                                     |
| ADC Channels                                    | ×                     | 8                         | 8 + programmable voltage range                     | 8 + programmable voltage range                     |
| Video In                                        | ×                     | ×                         | NTSC, PAL, Multi-format                            | NTSC, PAL, Multi-format                            |
| Audio In/Out                                    | ×                     | ×                         | Line In/Out, Microphone In (24 bit<br>Audio CODEC) | Line In/Out, Microphone In<br>(24 bit Audio CODEC) |
| Ethernet                                        | ×                     | Gigabit                   | 10/100/1000 Ethernet (x1)                          | 10/100/1000 Ethernet (x1)                          |
| USB OTG                                         | ×                     | 1x USB OTG                | 2x USB 2.0 (Type A)                                | 2x USB 2.0 (Type A)                                |
| LCD                                             | ×                     | ×                         | ×                                                  | 128x64 backlit                                     |
| Micro SD Card Support                           | *                     | $\checkmark$              | $\checkmark$                                       | $\checkmark$                                       |
| Accelerometer                                   |                       | $\checkmark$              | $\checkmark$                                       | $\checkmark$                                       |
| PS/2 Mouse/Keyboard Port                        | ×                     | ×                         | $\checkmark$                                       | $\checkmark$                                       |
| Infrared                                        | ×                     | ×                         | ✓                                                  | ✓                                                  |
| HSMC Header                                     | ×                     | ×                         | ×                                                  | ✓                                                  |
| Arduino Header                                  | $\checkmark$          | $\checkmark$              | ×                                                  | ×                                                  |

### Undergrad Lab Exercise Suites: Digital Logic

First digital hardware course in EE, CompEng or CS curriculum

Traditionally introduced sophomore year

Offered in VHDL or Verilog

- Lab 1 Switches, Lights, and Multiplexers
- Lab 2 Numbers and Displays
- Lab 3 Latches, Flip-flops, and Registers
- Lab 4 Counters
- Lab 5 Timers and Real-Time Clock
- Lab 6 Adders, Subtractors, and Multipliers

- Lab 7 Finite State Machines
- Lab 8 Memory Blocks
- Lab 9 A Simple Processor
- Lab 10 An Enhanced Processor
- Lab 11 Implementing Algorithms in Hardware
- Lab 12 Basic Digital Signal Processing

### Undergrad Lab Exercise Suites: Comp Organization

Typically second hardware course in EE, CompEng or CS curriculum

Introduction to microprocessors & assembly language program

Use ARM processor (on SOC kits) or NIOS II soft processor

Intel FPGA Monitor Program for compiling & debugging assembly & C code

| Lab 1 - Using an ARM Cortex-A9 System or NIOS<br>II System | Lab 5 - Using Interrupts with Assembly Code    |
|------------------------------------------------------------|------------------------------------------------|
| Lab 2 - Using Logic Instructions with the ARM<br>Processor | Lab 6 - Using C code with the ARM Processor    |
| Lab 3 - Subroutines and Stacks                             | Lab 7 - Using Interrupts with C code           |
| Lab 4 - Input/Output in an Embedded System                 | Lab 8 - Introduction to Graphics and Animation |

## Intel FPGA MONITOR PROGRAM

Design environment used to compile, assemble, download & debug programs for ARM\* Cortex\* A9 processor in Intel's Cyclone® V SoC FPGA devices

- Compile programs, specified in assembly language or C, and download the resulting machine code into the hardware system
- Display the machine code stored in memory
- Run the ARM processor, either continuously or by single-stepping instructions
- Modify the contents of processor registers
- Modify the contents of memory, as well as memory-mapped registers in I/O devices
- Set breakpoints that stop the execution of a program at a specified address, or when certain conditions are met

Clean and simple UX

Tutorials at fpgauniversity.intel.com

Download independently or as part of University Program Installer (always free!)



### Undergrad Lab Exercise Suites: Embedded Systems

Typically third hardware course in EE, CompEng or CS curriculum

Combines hardware and software

Introduction to embedded Linux

Lab 1 - Getting Started with Linux

Lab 2 - Developing Linux Programs that Communicate with the FPGA

Lab 3 - Character Device Drivers

Lab 4 - Using Character Device Drivers

Lab 5 - Using ASCII Graphics for Animation

Lab 6 - Introduction to Graphics and Animation

Lab 7 - Using the ADXL345 Accelerometer

Lab 8 - Audio and an Introduction to Multithreaded Applications



#### Lab Exercise Suites: Machine Learning Basics

Machine Learning on FPGAs

Senior or grad-level course in EE, CompEng, CS or data science curriculum

Teaches how to use the Intel<sup>®</sup> SDK for OpenCL<sup>™</sup> Applications with FPGAs

Basic understanding of AI fundamentals recommended\*

| Lab 1 – Introduction to OpenCL            | Lab 5 – Neural Networks                             |
|-------------------------------------------|-----------------------------------------------------|
| Lab 2 – Image Processing                  | Lab 6 – Using the Deep Learning Accelerator Library |
| Lab 3 – Lane Detection for Autonomous     | Lab 7 – Integration OpenCL Accelerators into        |
| Driving                                   | Existing Software                                   |
| Lob A. Lincon Classifier for Upped witten |                                                     |

Lab 4 – Linear Classifier for Handwritten Digits

\*For foundational AI & Machine Learning curriculums, visit our partner program Intel AI Academy



#### AI Academy Course Outline

Runs in Cloud on Arria 10 PAC card

Contains Slides, Lab exercises, and recordings for each class

https://software.intel.com/en-us/ai-academy/students/kits/dl-inference-fpga

Class 1 - Introduction to FPGAs for deep learning inferencing

Class 2 - Building a deep learning computer vision application w/ Acceleration

Class 3 - Introduction to the OpenVINO<sup>™</sup> toolkit

Class 4 - Introduction to the Deep Learning Accelerator Suite for Intel FPGAs Lab 1 - Deploy an application on an Intel CPU using DL framework

Lab 2 - Deploy an application on an Intel CPU using the OpenVINO toolkit

Lab 3 - Accelerate the application on an Intel FPGA

Class 5 - Introduction to the Acceleration Stack for Intel Xeon CPU with FPGAs



#### In-Person Workshops

Throughout the year our technical outreach team visits universities and industry conferences around the world to conduct hands-on workshops that train professors and students on how to use Intel FPGAs for education and research.

#### **Topics:**

Intro to FPGAs and Quartus (4 hrs.)

High-Speed IO (4 hrs.)

Static Timing Analysis of Digital Circuits (4 hrs.)

Simulation & Debug (4 hrs.)

Embedded Linux (4 hrs.)

Embedded Design using Nios II (4 hrs.)

High-level Synthesis (4 hrs.)

Machine Learning Acceleration (4 hrs.)

Modern Applications of FPGAs (1 hr.)

How to Get Hired in the Tech Industry (1 hr.)

Contact us at FPGAUniversity@intel.com to inquire about scheduling a workshop



#### Find Materials:

## FPGAUniversity.INTEL.com

Search

O

Intel FPGA and SoC > Support > University Program

Overview Boards Materials Members Support

#### Educational Materials

Our educational materials include tutorials, laboratory exercises, ip cores, example computer systems and software. They are intended for use in courses on digital logic, computer organization, and embedded systems.

Available Materials:

- Tutorials
- Laboratory Exercises
- IP Cores
- Computer Systems
- Software
- External Links



#### Membership:

## FPGAUniversity.INTEL.com

 Intel FPGA and SoC > Support
 SOLUTIONS
 SUPPORT
 ABOUT
 BUY
 LOG IN
 Search
 Q

Overview Boards Materials Members Support

#### Membership Overview

The Intel® FPGA University Program offers donations of licenses for software and intellectual property (IP), and donations of FPGA hardware. To submit a donation request, you must be registered as a member of the Intel FPGA University Program. Membership is available to faculty and staff of universities and colleges. To enroll in the program or to make requests, click on the links to the online forms below.

Students are ineligible to be members. But students can download and use Quartus<sup>®</sup> Prime Lite Edition software free of charge, and can purchase boards at the academic price from Terasic\* Technologies.

#### **Online forms:**

- Enrollment Request: Become a member
- License Request: Free academic licenses for software and IP
- Hardware Request: Board and device donations
- Purchase Request: Boards and devices at academic prices
- My Account: View your profile and request history



#### Contact the University Team

#### Rebecca Nevin

Outreach Manager Intel FPGA University Program <u>rebecca.l.nevin@intel.com</u>



#### Larry Landis

Senior Manager New User Experience Group lawrence.landis@intel.com





# BACKUP

**GPU** Comparison

# How do GPUs Deal With Fine Grained Data Sharing?

Some GPU techniques involve implicit SIMT synchronization



FPGA threads aren't warp-locked, so implicit sync doesn't make sense

FPGAs do exactly what you ask them to do the way you code it



# An Even Closer Look: CUDA Execution Model



|                          | FERMI<br>GF100<br>SM | FERMI<br>GF104<br>SM | KEPLER<br>GK104<br>SMX | KEPLER<br>GK110<br>SMX | MAXWELL<br>GM107<br>SMM |
|--------------------------|----------------------|----------------------|------------------------|------------------------|-------------------------|
| Compute Capability       | 2.0                  | 2.1                  | 3.0                    | 3.5                    | 5.0                     |
| Shared Memory/SM         | 48KB                 | 48KB                 | 48KB                   | 48KB                   | 64KB                    |
| 32-bit Registers/SM      | 32768                | 32768                | 64K                    | 64K                    | 64K                     |
| Max Threads/Thread Block | 1024                 | 1024                 | 1024                   | 1024                   | 1024                    |
| Max Thread Blocks/SM     | 8                    | 8                    | 16                     | 16                     | 32                      |
| Max Threads/SM           | 1536                 | 1536                 | 2048                   | 2048                   | 2048                    |
| Threads/Warp             | 32                   | 32                   | 32                     | 32                     | 32                      |
| Max Warps/SM             | 48                   | 48                   | 64                     | 64                     | 64                      |
| Max Registers/Thread     | 63                   | 63                   | 63                     | 255                    | 255                     |

| Th   | read Block              |                                                 |              |              |              |                               |                                            |                                    |                      |                      |                                                    |              | ſ            |                                  | ų<br>t             |       |            |
|------|-------------------------|-------------------------------------------------|--------------|--------------|--------------|-------------------------------|--------------------------------------------|------------------------------------|----------------------|----------------------|----------------------------------------------------|--------------|--------------|----------------------------------|--------------------|-------|------------|
|      | UDA<br>cheduler         | SM                                              |              | Vertex 1     | Fetch        | Atto                          | 1bute S                                    |                                    |                      | ellator              | 2.0                                                | Na           | v            |                                  | Trans <sup>5</sup> |       | *          |
|      | MAXWELL<br>GM107<br>SMM |                                                 | Warp S       | hedui        |              | Dispe<br>Core<br>Core<br>Core | Karp S<br>Ich Unit<br>Core<br>Core<br>Core | Cisco<br>ogistor<br>LDIST<br>LDIST | File (<br>SFU<br>SFU | Core<br>Core<br>Core | Arp Sc<br>3 Unit<br>x 32-b<br>Core<br>Core<br>Core | Core<br>Core | Core<br>Core | Dispañ<br>Dispañ<br>Core<br>Core | Core<br>Core       | LDIST | SFU<br>SFU |
| 3.5  | 5.0                     |                                                 | Γhι<br>      | ea           |              | Core                          | Core                                       | LDIST                              | SFU                  | Core                 | Core                                               | Core         | Core         | Core                             | Core               | LDIST | SFU        |
| 8KB  | 64KB                    | Corr                                            | Core         | Core         | Core         | Core                          | Core                                       | LDIST                              | SFU                  | Core                 | Core                                               | Core         | Core         | Core                             | Core               | LDIST | SFU        |
| 64K  | 64K                     | Gore<br>Core                                    | Core<br>Core | Core<br>Core | Core<br>Core | Core<br>Core                  | Core<br>Core                               | LDIST                              | SFU<br>SFU           | Core<br>Core         | Core<br>Core                                       | Core<br>Core | Core<br>Core | Core<br>Core                     | Core<br>Core       | LDIST | SFU<br>SFU |
| 024  | 1024                    | Core                                            | Core         | Core         | Core         | Core                          | Core                                       | LDIST                              | SFU                  | Core                 | Core                                               | Core         | Core         | Core                             | Core               | LDIST | SFU        |
| 16   | 32                      | Core                                            | Core<br>Core | Core         | Core<br>Core | Core                          | Core<br>Core                               | LDIST                              | SFU                  | Core<br>Core         | Core<br>Core                                       | Core<br>Core | Core<br>Core | Core<br>Core                     | Core<br>Core       | LDIST | SFU<br>SFU |
| 2048 | 2048                    | Core                                            | Core<br>Core | Core<br>Core | Core<br>Core | Core<br>Core                  | Core<br>Core                               | LDIST                              | SFU<br>SFU           | Core<br>Core         | Core<br>Core                                       | Core<br>Core | Core<br>Core | Core<br>Core                     | Core<br>Core       | LDIST | SFU<br>SFU |
| 32   | 32                      | Core                                            | Core         | Core         | Core         | Core                          | Core                                       | LOIST                              | SFU                  | Core                 | Core                                               | Core         | Core         | Com                              | Core               | LDST  | SFU        |
| 64   | 64                      | Core                                            | Core<br>Core | Core<br>Core | Core<br>Core | Core<br>Core                  | Core                                       | LDIST                              | SFU<br>SFU           | Core                 | Core<br>Core                                       | Core<br>Core | Core<br>Core | Core<br>Core                     | Core<br>Core       |       | SFU<br>SFU |
| 255  | 255                     | Texture Cache<br>64 KB Shared Memory / L1 Cache |              |              |              |                               |                                            |                                    |                      |                      |                                                    |              |              |                                  |                    |       |            |
|      |                         |                                                 |              |              |              |                               |                                            | Te<br>Te                           |                      | Tex Tex<br>Tex Tex   |                                                    |              |              | Tex<br>Tex                       |                    |       |            |



### **FPGA Execution Model**





## Divergent Control Flow on GPU

#### **Single instruction**

- Thread-locked work items running through different branches
- Serialized
- Major performance factor

#### GPU uses SIMT pipeline to save area on control logic



**CPUs offer branch prediction** 

for (i=0;i<N;i++)
if (x[i]<y[i])
foo() else bar();</pre>







<sup>(</sup>intel)





## **External Memory Dynamic Coalescing**

For CPU/GPU the cache and memory controller handle

For FPGA, we create dynamic coalescing hardware matched to specific memory characteristics connected to

- Re-order memory accesses at runtime to exploit data locality
- DDR is extremely inefficient at random access
- Access with row bursts whenever possible





# **On-chip FPGA Memory**

"Local" memory uses on-chip block RAM resources

- Very high bandwidth, 8TB/s,
- Random access in 2 cycles
- Limited capacity

The memory system is customized to your application

- Huge value proposition over fixed-architecture accelerators

Banking configuration (number of banks, width), and interconnect all customized for your kernel

- Automatically optimized to eliminate or minimize access contention

Key idea: Let the compiler minimize bank contention

- If your code is optimized for another architecture (e.g. array[tid + 1] to avoid bank collisions), undo the fixed-architecture workarounds
- Can prevent optimal structure from being inferred



### **FPGA Local Memory**



Split memory into logical banks

- An N-bank configuration can handle N-requests per clock cycle as long as each request addresses a different bank
- Manipulate memory addresses so that parallel threads likely to access different banks reduce collisions



## Local Memory Attributes

Annotations added to local memory variables to improve throughput or reduce area

Banking control:

- numbanks
- bankwidth

Port control:

- numreadports/numwriteports
- singlepump/doublepump



### numbanks(N) and bankwidth(N) memory attribute

#### What does it do?

Specifies the banking geometry for your local memory system

A bank = single independent memory system

#### What is it for?

Can be used to optimize LSU-to-memory connectivity in an effort to boost performance

Banking should be set up to maximize "stall-free" accesses







## numbanks(N) and bankwidth(N) memory attribute





### numreadports/numwriteports and singlepump/doublepump memory attribute

#### What does it do?

num<read/write>ports: specifies the number of read/write ports in the local memory system

<single/double>pump: specifies the pumping of the local memory system (1x/2x clock)

#### What is it for?

Controls the number of memory blocks used to implement the local memory system



#### numreadports/numwriteports and singlepump/doublepump memory attribute



