Diff of OpenFC Architecture

The added line is THIS COLOR.
The deleted line is THIS COLOR.
Go to OpenFC Architecture.
#author("2019-09-03T08:57:12+00:00","default:osana","osana")
[[OpenFC - an Open FPGA Cluster framework]]


* System overview [#u714fb67]

#ref(overview.png,around,right);

OpenFC provides:
- PCIe DMA (host-FPGA communication)
- Serial transceivers (FPGA-FPGA communication)
- ICAP (internal configuration access port: to loading/unloading stream PE)
- and Router to combine everything above with user-designed Stream PE

Users of OpenFC can design and load their own Stream PE accelerator module, in HLS or of course RTL. Multiple FPGAs can be connected together to enable large-scale accelerated computing. 

On the host, simple APIs are provided to enable user programs to communicate with the Stream PE on FPGA(s).

#clear
* Routing basics [#h178ccc1]

#ref(routing.png,around,right);

To communicate with Stream PEs, host program transmits one or more data frames via PCIe DMA. Data frames are routed by its routing header: header is basically generated by the host program. Stream PEs can consume the payload part, then generate new payload that contains result of the computation.

A data frame is a stream of 64-bit words, composed of these 3 parts:
+ Routing header: Values of 64'h0100_0000_xxxx_xxxx, where xxxx_xxxx is the router's port #. 
-- On arrival to router, the first word of routing header is "consumed" to choose the destination port. To describe a route of multiple hops, multiple routing header words are arranged. Any length of routing header is allowed. 
-- Basically, host program prepares routing header to the final destination (i.e., the host itself.) Or, Stream PEs can add routing header words for adaptiveness/flexibility.
+ Length: Number of 64-bit words follows as the payload. The minimum length is 1 word, the maximum length is 2^32-1 words.
+ The payload.

** Simple send/receive example [#sf2a3208]

This host C code will generate a data frame to Stream PE. Stream PE output will be sent to PCIe thus the host will receive the result from Stream PE.

#geshi(c,number){{
  // Buffer allocation
  uint64_t* out = (uint64_t*)buf_alloc(vec_len*8);
  uint64_t* in  = (uint64_t*)buf_alloc(vec_len*8);

  // Header
  uint64_t header[HEADER_MAX];

  header[0] = ROUTING_HEADAER | 1; // Stream PE
  header[1] = ROUTING_HEADER | 6;  // PCIe
  header[2] = vec_len;
  buf_set_header((uint64_t*)out, header,  3);

  // Generate Payload
  for(int i=0; i<vec_len; i++) out[i] = i;

  // Send & Receive
  buf_send_async(handles.fd_o1, (uint64_t*)out, vec_len*8);
  buf_recv(handles.fd_i, (uint64_t*)in, vec_len*8);
}}

** Simple 64-bit integer vector accumulate example with Vivado HLS [#m7a60976]

The following Vivado HLS C++ code gives a Stream PE to calculate integer vector sum. The routing header is not sent to Stream PE. 

For details about hls::stream or #pragma HLS in this code, please refer [[Xilinx UG902: High-level synthesis>https://www.xilinx.com/support/documentation/sw_manuals/xilinx2019_1/ug902-vivado-high-level-synthesis.pdf]]. For more things in writing custom Stream PE, proceed to [[Building custom Stream PE with Vivado HLS]].

#geshi(c,number){{
void vec_accum(hls::stream<uint64_t>& in, hls::stream<uint64_t>& out){
#pragma HLS INTERFACE axis register both port=in
#pragma HLS INTERFACE axis register both port=out

  uint64_t len, sum=0;
  len = in.read();

  for(uint64_t i=0; i<len; i++) sum += in.read();

  out.write(1);  // output length is always 1
  out.write(sum);
}
}}