
Modern Computer Systems Lab 2 Guide by Sang-Woo Jun
This guide covers the installation and setup of an ECP5 FPGA with Bluespec and Yosys for Lab 2 of Modern Computer Systems. It includes instructions on building and running the accelerator model, details on directory structure, application parameters, and the architecture of the accelerator. The guide emphasizes parallel computation organization for optimal performance and provides insight into the PE architecture in a simplified view.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
CS250B : Modern Computer Systems Lab 2 Guide Sang-Woo Jun
Getting started Install the VM with Yosys, Bluespec installed, according to BluespecYosysVirtualbox.pdf Start Virtualbox VM Clone https://github.com/sangwoojun/ulx3s_bsv.git The files for lab 2 are located in ulx3s_bsv/projects/nn_fc/ Using Lattice ECP5 FPGA Open source support, very low power (<1W) But also small size, slow speed Good fit for IoT devices?
Building and running for the first time Cd to projects/nn_fc Type make bsim , and then make runsim to build and run the simulated model of the accelerator You will see this output. What does it mean? Total of 2048 element output Each element takes about 3,213,952 / 256 = 12,554 cycles Is this good? Can we make it better? Calculation results are close enough!
Basic directory structure nn_Fc/ includes many files, but our interests are o NnFc.bsv : Bluespec file implementing the accelerator o cpp/nn_fc.cpp : Cpp file communicating with the accelerator You will only need to edit these two files!
The application Input matrix Weight matrix Output matrix Input 0 Output 0 = Input_cnt Input_cnt Output_dim Input_dim Output_dim input_cnt = 64, input_dim = 1024, output_dim = 64, defined in cpp/main.cpp Assume these values are fixed, for lab2!
Accelerator architecture mkNnFc, in NnFc.bsv Weight matrix To PE0 To PE1 SDRAM To PE2 PE Round robin Weights PE (method Action weightIn) To PE0 PE Input from host (method Action dataIn) Output to host (Method ActionValue resultGet) Gather output Broadcast input (Replicate) PE Initially there are 8 PEs Disclaimer: Design favors ease of understanding than performance
Computation organization For 8 Pes, 8 columns are sent to completion And then the next 8 columns, so on (see nn_fc( ) in cpp/nn_fc.cpp) Input matrix Weight matrix Output matrix = PE0 PE1 PE2 Should be parallel!
PE Architecture mkMacPe, in NnFc.bsv Simplified view! Look at code to understand better partialSumIdxQ Bit#(8) input_idx, Bit#(8) output_idx, Bit#(16) row putInput resultGet Float value, Bit#(8) input_idx Add Mult Bit#(8) input_idx, Bit#(8) output_idx, Float result Is column done? putWeight Float value addForwardQ partialSumQ Do you notice a performance issue? One column processed at a time, next add can only start after previous one is done!
Looking at performance again 12,554 cycles per 1024 elements in a column 12+ cycles per MAC! This is not pipelined at all o Sum of Mult, Add latency is around this much. No parallelism across PEs! o We want this to go to ~1/8 (because we have 8 PEs) Total of 2048 element output Each element takes about 3,213,952 / 256 = 12,554 cycles Calculation results are close enough!
How do we solve this Output matrix I could ask you to figure it out but . We need to either work on multiple rows, multiple columns, or both! o Working on multiple rows: multiple input rows processed at once o Working on multiple columns: multiple weight columns processed at once Which one do we start with? o Spoliers: We need to do both to get peak performance
How do we solve this I could ask you to figure it out but . SDRAM is slow on our embedded platform! o Src/Sdram.bsv has the controller implementation o 5+ cycles to read 16 bits, 10+ cycles to read one Float! o Counting req/resp rules, 12+ cycles per computation sounds about right SDRAM is where weights are stored o Bandwidth not feasible to load and work on multiple weights at once o Let s first start working on multiple inputs at once
Working on multiple inputs Input order must change, to send more rows interleaved Input matrix Weight matrix Output matrix = PE0 PE1 and onwards not shown, but should be doing the same Per PE, each weight value must be applied N times, for N parallel input rows
Guide in Bluespec Let s say we process 8 rows at once weightInQ must be replicated 8 times Weight to replicate, and how many times
Guide in Bluespec In rule enqMac o partialSumIdxQ needs input_idx, output_idx, row o Input_idx given with input, row is increased by one every MAC, and output_idx increased by PeWays (PE count) every time row exceeds inputDim Multiple inputs for same output_idx now arrive as a burst. Row must increase every 8 MAC, output_idx also increased every 8 MAC
Guide in Bluespec At the beginning of each column, we needed to enq zero to add o Initial partial sum. o Only have a single add request in the pipeline results in non-pipeline! o We can now have more requests in pipeline, since multiple independent rows are processed at once
Input matrix Guide in driving software cpp/nn_fc.cpp The hardware now assumes 8 input rows will arrive interleaved Input_cnt/8 row chunks, for each row, send 8 rows at once (kk)
How has performance improved? 1.53 cycles per computation! Much better, but not quite 1/8 cycles yet
How about wallclock time? Instead of make bsim , try make | tee build.log o Takes a few minutes, creates build.log, build/hardware.bit, among others o Hardware.bit is the file programmed to the FPGA On-chip mem Where is the critical path? CLB utilization Multiplier utilization Max clock frequency of the core: 62.53 MHz The 25 MHz clock is used only for interfaces. Ignore for now
What to do, what to submit What to do o Apply the changes introduced above. Measure performance in terms of cycles o Evaluate: Can this go further? With larger number of parallel rows? o Evaluate: If not, what do you think is the limiting factor? Hint: src/Uart.bsv. mkUart_bsim is the module that transports data from host software to FPGA o Evaluate: What limits clock speed? What to submit: 3 files! o NnFc.bsv, nn_fc.cpp with best effort implementation o A report file with short answers to the evaluate questions above Due: 2020-06-10