#author("2020-06-22T04:54:21+00:00","default:osana","osana")
[[OpenFC - an Open FPGA Cluster Toolkit]]
As described in [[OpenFC Architecture]], custom Stream PEs can be described in C++, with Vivado HLS. This page describes how to design and implement a custom Stream PE (SPE.)
* Input, Output and Timing considerations [#a8fd9723]
- Stream ports
-- For all FPGA cards, the base design has 2 router ports to SPE. This means an SPE can have up to 2 input and 2 output stream ports.
-- Width of each port is 64bit, and this width can't be changed.
- Data frame headers
-- SPE always receives "Length" field of the data frame header before the payload.
-- SPE doesn't see routing header in the incoming data frame.
-- SPE must transmit "Length" field before transmitting any output frame payload. If there are remaining routing header words, route prepends them to the length field.
-- SPE can add (or elongate) routing header words by sending them before the Length field.
- Currently, Router and SPE are driven at 250MHz. In future version, the frequency may be changed in faster FPGA cards.
* Writing, Synthesizing and Wrapping a simple SPE [#qd96d4c7]
Sample codes for SPE is found in src/examples. Here's the code of "vec-accum.cc".
#geshi(c++,number){{
#include <stdint.h>
#include "hls_stream.h"
void vec_accum(hls::stream<uint64_t>& in, hls::stream<uint64_t>& out){
#pragma HLS INTERFACE axis register both port=in
#pragma HLS INTERFACE axis register both port=out
uint64_t len, sum=0;
len = in.read();
for(uint64_t i=0; i<len; i++) sum += in.read();
out.write(1); // output length is always 1
out.write(sum);
}
}}
To synthesize this code as an SPE on KC705 card,
+ Launch Vivado HLS, create new project.
+ Add file as source code, choose "vec_accum()" for top-level function.
+ Select XC7K325T-FFG900-2 as the target device (if your target card has different FPGA, choose right one), set target period to 4.0 because it runs at 250MHz (also I recommend to set some larger uncertainty such as 0.8.)
+ Run C synthesis and Export RTL, then you'll get the IP core.
The resulting, instantiated module interface should be like this:
#geshi(verilog,number){{
module vec_accum_0
(
.ap_clk (),
.ap_rst_n(),
.ap_start(),
.ap_done (),
.ap_idle (),
.ap_ready(),
.in_V_TDATA (),
.in_V_TVALID(),
.in_V_TREADY(),
.out_V_TDATA (),
.out_V_TVALID(),
.out_V_TREADY()
);
}}
#ref("remove_pe.png",right,around);
This can be fit into src/pe-base/wrappers/axis-1r1w.v. To make it work,
+ Set up the base project along [[Quickstart with Xilinx KC705]] (use different .tcl file if your target isn't KC705)
+ In the module hierarchy, remove or disable "pe : pe (pe-pass.v)"
+ Press Alt+a (or File -> Add Sources) then add src/pe-base/wrappers/axis-1r1w.v into your project
+ Find your HLS generated SPE core in IP catalog and instantiate it
-- This procedure is described in [[Xilinx Vivado HLS tutorial>https://www.xilinx.com/support/documentation/sw_manuals/xilinx2019_1/ug871-vivado-high-level-synthesis-tutorial.pdf]]
+ Modify "CHANGE_ME" in axis-1r1w.v to your SPE instance name
+ Now you can synthesize and generate bitstream. Write the generated bitstream on your card, reboot the FPGA host then you'll be able to access the SPE :)
#clear
* Handling non-integer data types [#ied87c99]
The stream interface is 64-bit width, and the example above is about to handle unsigned 64bit integer. However, the data types are not always integer. Moreover, there are always needs for non-64bit data types such as float, char or even structs. This section shows a brief programming guide for non uint64_t types.
** 64bit data types: adding double-precision vectors [#r11642d6]
The first example is double: 64bit floating-point type. In this case, there are 2 major problems.
- The payload comes with double, but the length header is 64bit integer. So we have to handle both data types in single data stream.
- Vivado HLS allow only basic data types for top-level module interface (we can't use struct or union)
So, the input/output streams are declared as double, and the length header is read through a union named "ud_t" in the example. This way of type conversion is useful for 64bit data types.
vecadd-double.cc:
#geshi(c++,number){{
#include <stdint.h>
#include "hls_stream.h"
typedef hls::stream<double> my_str;
typedef union {
uint64_t u;
double d;
} ud_t;
void vecadd_double(my_str& in1, my_str& in2, my_str& out1){
#pragma HLS INTERFACE axis register both port=in1
#pragma HLS INTERFACE axis register both port=in2
#pragma HLS INTERFACE axis register both port=out1
ud_t len;
len.d = in1.read();
in2.read(); // in1 and in2 must have exactly same length
out1.write(len.d); // output length = input length;
for(uint64_t i=0; i<len.u; i++){
#pragma HLS PIPELINE
double a, b, x;
a = in1.read(); b = in2.read();
x = a+b;
out1.write(x);
}
}
}}
Note: host code example will come soon.
** Non-64bit data types: adding single-precision vectors [#x3e2d4e9]
Next example is addition of use of 32bit single precision, float type vector. In this case, each word of input/output streams conveys 2 variables. This means 2 independent scalar addition is executed in every loop iteration. To split one 64bit word to two 32bit words, the arbitrary precision type of Vivado HLS C++ library is quite convenient.
For example, type "ap_uint<64>" means unsigned 64bit integer (the bit width is not limited to 8n: any number is possible.) Interesting with the ap_uint class is range() method, that enables to access specific bit range of a variable. Integer <-> floating point thing is same to previous double-precision example, so this code is much more tricky than previous one.
vecadd-float.cc:
#geshi(c++,number){{
#include <stdint.h>
#include "hls_stream.h"
#include "ap_int.h"
typedef hls::stream<ap_uint<64> > my_str;
typedef union {
uint32_t u;
float f;
} uf_t;
void vecadd_float(my_str& in1, my_str& in2, my_str& out1){
#pragma HLS INTERFACE axis register both port=in1
#pragma HLS INTERFACE axis register both port=in2
#pragma HLS INTERFACE axis register both port=out1
uint64_t len;
len = in1.read();
in2.read(); // in1 and in2 must have exactly same length
out1.write(len); // output length = input length;
for(uint64_t i=0; i<len; i++){
#pragma HLS PIPELINE
ap_uint<64> au, bu, xu;
uf_t a[2], b[2], x[2];
au = in1.read();
bu = in2.read();
a[0].u = au.range(31,0); b[0].u = bu.range(31,0);
a[1].u = au.range(63,32); b[1].u = bu.range(63,32);
x[0].f = a[0].f + b[0].f;
x[1].f = a[1].f + b[1].f;
xu.range(31,0) = x[0].u; xu.range(63,32) = x[1].u;
out1.write(xu);
}
}
}}