Diff of Building custom Stream PE with Vivado HLS

The added line is THIS COLOR.
The deleted line is THIS COLOR.
Go to Building custom Stream PE with Vivado HLS.
#author("2020-06-22T04:54:21+00:00","default:osana","osana")
[[OpenFC - an Open FPGA Cluster Toolkit]]

As described in [[OpenFC Architecture]], custom Stream PEs can be described in C++, with Vivado HLS. This page describes how to design and implement a custom Stream PE (SPE.)

* Input, Output and Timing considerations  [#a8fd9723]

- Stream ports
-- For all FPGA cards, the base design has 2 router ports to SPE.  This means an SPE can have up to 2 input and 2 output stream ports.
-- Width of each port is 64bit, and this width can't be changed.
- Data frame headers
-- SPE always receives "Length" field of the data frame header before the payload.
-- SPE doesn't see routing header in the incoming data frame.
-- SPE must transmit "Length" field before transmitting any output frame payload. If there are remaining routing header words, route prepends them to the length field.
-- SPE can add (or elongate) routing header words by sending them before the Length field.
- Currently, Router and SPE are driven at 250MHz. In future version, the frequency may be changed in faster FPGA cards.

* Writing, Synthesizing and Wrapping a simple SPE [#qd96d4c7]

Sample codes for SPE is found in src/examples. Here's the code of "vec-accum.cc".

#geshi(c++,number){{
#include <stdint.h>
#include "hls_stream.h"

void vec_accum(hls::stream<uint64_t>& in, hls::stream<uint64_t>& out){
#pragma HLS INTERFACE axis register both port=in
#pragma HLS INTERFACE axis register both port=out

  uint64_t len, sum=0;
  len = in.read();

  for(uint64_t i=0; i<len; i++) sum += in.read();

  out.write(1);  // output length is always 1
  out.write(sum);
}
}}

To synthesize this code as an SPE on KC705 card,
+ Launch Vivado HLS, create new project.
+ Add file as source code, choose "vec_accum()" for top-level function.
+ Select XC7K325T-FFG900-2 as the target device (if your target card has different FPGA, choose right one), set target period to 4.0 because it runs at 250MHz (also I recommend to set some larger uncertainty such as 0.8.)
+ Run C synthesis and Export RTL, then you'll get the IP core.

The resulting, instantiated module interface should be like this:

#geshi(verilog,number){{
   module vec_accum_0 
     (
      .ap_clk  (),
      .ap_rst_n(),
      .ap_start(),
      .ap_done (),
      .ap_idle (),
      .ap_ready(),

      .in_V_TDATA (),
      .in_V_TVALID(),
      .in_V_TREADY(),
      .out_V_TDATA (),
      .out_V_TVALID(),
      .out_V_TREADY()
      );
}}

#ref("remove_pe.png",right,around);

This can be fit into src/pe-base/wrappers/axis-1r1w.v. To make it work,
+ Set up the base project along [[Quickstart with Xilinx KC705]] (use different .tcl file if your target isn't KC705)
+ In the module hierarchy, remove or disable "pe : pe (pe-pass.v)"
+ Press Alt+a (or File -> Add Sources) then add src/pe-base/wrappers/axis-1r1w.v into your project
+ Find your HLS generated SPE core in IP catalog and instantiate it
-- This procedure is described in [[Xilinx Vivado HLS tutorial>https://www.xilinx.com/support/documentation/sw_manuals/xilinx2019_1/ug871-vivado-high-level-synthesis-tutorial.pdf]]
+ Modify "CHANGE_ME" in axis-1r1w.v to your SPE instance name
+ Now you can synthesize and generate bitstream. Write the generated bitstream on your card, reboot the FPGA host then you'll be able to access the SPE :)

#clear
* Handling non-integer data types [#ied87c99]

The stream interface is 64-bit width, and the example above is about to handle unsigned 64bit integer. However, the data types are not always integer. Moreover, there are always needs for non-64bit data types such as float, char or even structs. This section shows a brief programming guide for non uint64_t types.

** 64bit data types: adding double-precision vectors [#r11642d6]

The first example is double: 64bit floating-point type. In this case, there are 2 major problems.
- The payload comes with double, but the length header is 64bit integer. So we have to handle both data types in single data stream.
- Vivado HLS allow only basic data types for top-level module interface (we can't use struct or union)

So, the input/output streams are declared as double, and the length header is read through a union named "ud_t" in the example. This way of type conversion is useful for 64bit data types.

vecadd-double.cc:
#geshi(c++,number){{
#include <stdint.h>
#include "hls_stream.h"

typedef hls::stream<double> my_str;
typedef union {
  uint64_t u;
  double d;
} ud_t;

void vecadd_double(my_str& in1, my_str& in2, my_str& out1){
#pragma HLS INTERFACE axis register both port=in1
#pragma HLS INTERFACE axis register both port=in2
#pragma HLS INTERFACE axis register both port=out1

  ud_t len;
  len.d = in1.read();
  in2.read(); // in1 and in2 must have exactly same length
  out1.write(len.d);  // output length = input length;

  for(uint64_t i=0; i<len.u; i++){
#pragma HLS PIPELINE
    double a, b, x;
    a = in1.read(); b = in2.read();
    x = a+b;
    out1.write(x);
  }
}
}}

Note: host code example will come soon.

** Non-64bit data types: adding single-precision vectors [#x3e2d4e9]

Next example is addition of use of 32bit single precision, float type vector. In this case, each word of input/output streams conveys 2 variables. This means 2 independent scalar addition is executed in every loop iteration. To split one 64bit word to two 32bit words, the arbitrary precision type of Vivado HLS C++ library is quite convenient.

For example, type "ap_uint<64>" means unsigned 64bit integer (the bit width is not limited to 8n: any number is possible.) Interesting with the ap_uint class is range() method, that enables to access specific bit range of a variable. Integer <-> floating point thing is same to previous double-precision example, so this code is much more tricky than previous one.

vecadd-float.cc:
#geshi(c++,number){{
#include <stdint.h>
#include "hls_stream.h"
#include "ap_int.h"

typedef hls::stream<ap_uint<64> > my_str;
typedef union {
  uint32_t u;
  float f;
} uf_t;

void vecadd_float(my_str& in1, my_str& in2, my_str& out1){
#pragma HLS INTERFACE axis register both port=in1
#pragma HLS INTERFACE axis register both port=in2
#pragma HLS INTERFACE axis register both port=out1

  uint64_t len;
  len = in1.read();
  in2.read(); // in1 and in2 must have exactly same length
  out1.write(len);  // output length = input length;

  for(uint64_t i=0; i<len; i++){
#pragma HLS PIPELINE
    ap_uint<64> au, bu, xu;
    uf_t a[2], b[2], x[2];

    au = in1.read();
    bu = in2.read();

    a[0].u = au.range(31,0);  b[0].u = bu.range(31,0);
    a[1].u = au.range(63,32); b[1].u = bu.range(63,32);

    x[0].f = a[0].f + b[0].f;
    x[1].f = a[1].f + b[1].f;

    xu.range(31,0) = x[0].u;  xu.range(63,32) = x[1].u;

    out1.write(xu);
  }
}
}}