
Exploring Many-Task Emulator (MTE) for Simulation in Plural Architecture
Dive into the realm of Many-Task Emulator (MTE) and its application in the simulation of plural architecture. Discover the functionalities of MTE in handling multiple tasks, the setup process, task segmentation, and more. Uncover the intricacies of creating task graphs, managing various task types, and prioritizing high-impact tasks within this innovative emulation environment.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
The Plural Architecture: Simulation using Many-Task Emulator (MTE) 1
Simulation on laptop using MTE Task graph {A, then B || C} A MTE emulator (P=1) Issues tasks based on dependencies Reconstructs time line B C A, tAB, tBC, tC tB tA tC start end real Plural execution B, tB A, tA C, tC 2
MTE (Many-Task Emulator) Plural Manycore (e.g. RC64) e.g. 64 processors Parallel e.g. DSP Instruction cache (I$) Data cache (D$) Local Memory (LM) Hardware Scheduler Shared memory (e.g. 4MB) I/O Accelerators Host control & monitor MTE programmable, from 1 to any number Emulated serial execution X86 None. May be emulated by user SW Emulated managed by MTE Unlimited. May be emulated by user SW Emulated None. May be emulated by user SW None 3
Tasks Regular task Sequential, single instance Returns 1/0 (true/false) token (may be ignored) Duplicable task Sequential, many concurrent instances Quota set/changed by program (or set in task.map) Instance number available to instance code Dummy task Unallocated, useful for token algebra File specifying the task graph 4
Other tasks High priority task Pre-empts other tasks on a core For handling I/O etc. On termination, send software event interrupt to scheduler 5
Task graph segments (in task.map) duplicable TASKNAME [QUOTA] regular TASKNAME dummy TASKNAME duplicable TASKNAME [QUOTA] regular TASKNAME dummy TASKNAME False False OR AND dummy TASKNAME-1 dummy TASKNAME-2 OR-AND OR-AND regular TASKNAME regular TASKNAME 6 False False
Setting up MTE Make sure in BIOS that Intel/AMD Virtualization Technology (one or two options! Everything starting with V?) is/are enabled Install Oracle VM Virtualbox from https://www.virtualbox.org/wiki/Downloads Download (also) the extension (if not offered to do so by installer) Virtual Box Manager (VBM) file-->preferences-->extensions (add package button) click on the obvious item, install After starting the Virtual Box Manager, possibly need to disable Display 3D acceleration (on VBM home page)(if you get such a warning during Login) Get the virtual machine 4GB file MTE-RC-ubuntu-20161103.OVA from this link (https://technionmail-my.sharepoint.com/personal/ran_technion_ac_il/_layouts/15/guestaccess.aspx?guestaccesstoken=%2bPodu8tTL3%2bey82NJJDqdaW2RuScRtyuP4siMKZIi8g%3d&docid=0705a53c88b064fed81322dbc3ae389d3&rev=1) Import VM into VB VBM file-->import appliance --> select MTE-RC-ubuntu-20161103.OVA, Import Set up sharing with your Windows host file system (HFS) VBM Settings (button) Shared Folders Add button (+), select your directory (can repeat many), check Auto-mount Start VM VBM Start (green arrow button) Login User ramon-users Password ramon Start eclipse 7
New project in MTE Eclipse Project Explorer (EPE) Right click new project, select wizard C/C++ C/C++ project, NEXT Enter project name, select Project type: Makefile:Empty project, select Toolchains: Linux GCC, Finish Select the new project, right click Import General:File System, NEXT Either: Browse to /usr/local/ramon-chips/examples/template_emulator_project (or pulldown) Select Makefile, task.map, source/source.c, Finish Or: Browse to a HFS archive Select Makefile, task.map[*], source/*.c, *.h Finish 8
Execute a project EPE (Eclipse Project Explorer), select project, right click, Close Unrelated Projects EPE, select project, right click, Clean Project EPE, select project, right click, Build Project Watch Console for errors and warning Run button (>) or right click, Run As, EPE, select project, right click, Refresh (F5) Peruse rc_utilization.csv rc64.log 9
Simple do-nothing example Task graph dummy regular duplicable B(A) 2000 duplicable C(B) 2500 duplicable D(A) 2600 duplicable E(C && D) 2300 regular cnt(E) // 5 regular F(cnt==true) // 3 d() A(d || cnt==false) //10 // 0 source.c int //15 //20 //25 //30 round_counter = 0; int { set_current_task_time_cycles(10); printf("start parallel\n"); } A_func (void) void B_func (unsigned int instance) { set_current_task_time_cycles(15); } void C_func (unsigned int instance) { set_current_task_time_cycles(20); } d void D_func (unsigned int instance) { set_current_task_time_cycles(25); } A void E_func (unsigned int instance) { set_current_task_time_cycles(30); } B D C int { set_current_task_time_cycles(35); printf("end parallel\n"); } F_func (void) E int cnt_func(void) { set_current_task_time_cycles(5); round_counter++; if (round_counter < 4) { return 0; } else { return 1; } false cnt true F } 10
Simple: utilization chart 4 rounds A B D C E F 11
Changing number of processing cores A command line argument to MTE -cores=NUMBER 12
Simple: Speedup & Efficiency on 1-1024 cores Pull down the Run As menu Select Run Configurations Go to (x)= Arguments tab Type cores=256 or any p Rerun, refresh and record the new Tp P Tp SU EFF 1.00 1.00 1.00 1.00 1.00 1.00 0.99 0.98 0.97 0.91 0.82 1 2 4 8 856,095 428,096 214,096 107,136 53,641 26,881 13,496 6,816 3,456 1,836 1,021 1 2 4 8 16 32 64 128 256 512 1024 16 32 63 126 248 466 838 13
Matrix Multiplication (N2tasks) #define MSIZE 100 float A[MSIZE][MSIZE], B[MSIZE][MSIZE], C[MSIZE][MSIZE]; #define MSIZE 100 #define MMSIZE 10000 regular duplicable mm(program_start) MMSIZE regular program_end(mm) program_start() int program_start_func () { read / generate input matrices } void mm_func(unsigned int id) { int i,k,m; float sum = 0; i = id % MSIZE; k = id / MSIZE; for (m=0; m < MSIZE; m++) sum += A[i][m]*B[m][k]; C[i][k]=sum; } ??,?= ??,? ??,? ? int program_end_func() { printf("finished mm\n"); } 14
Force my own estimated run times #include <stdlib.h> #define MSIZE 100 #define MUL_TIME 1 #define ADD_TIME 1 #define LDST_TIME 5 #define DIV_TIME 5 float A[MSIZE][MSIZE], B[MSIZE][MSIZE], C[MSIZE][MSIZE]; int program_start_func () { read / generate input matrices ; set_current_task_time_cycles(10); } void mm_func(unsigned int id) { int i,j,m; float sum = 0; int runTime = 0; i = id % MSIZE; k = id / MSIZE; for (m=0; m < MSIZE; m++) { sum += A[i][m]*B[m][k]; runTime += MUL_TIME*5 + ADD_TIME*3 + LDST_TIME*0 + DIV_TIME*0; } C[i][k]=sum; runTime += MUL_TIME*5 + ADD_TIME*4 + LDST_TIME*1 + DIV_TIME*1; set_current_task_time_cycles(runTime); } Int program_end_func() { printf("finished mm\n"); set_current_task_time_cycles(10); } 15
Matrix Multiplication: works well P Tp SU Eff 1 2 4 8 8,190,021 4,095,021 2,047,521 1,023,771 511,896 256,368 128,604 64,722 32,781 16,401 8,211 1 1.00 2 1.00 4 1.00 8 1.00 16 1.00 32 1.00 64 1.00 127 0.99 250 0.98 499 0.98 997 0.97 16 32 64 128 256 512 1024 Why is SU(1024) still less than 1024? 16
Matrix Multiplication with only N=100 tasks P Tp SU Eff 1 2 4 8 8,140,021 4,070,021 2,035,021 1,058,221 569,821 325,621 162,821 81,421 81,421 81,421 81,421 1 1.00 2 1.00 4 1.00 8 0.96 14 0.89 25 0.78 50 0.78 100 0.78 100 0.39 100 0.20 100 0.10 16 32 64 128 256 512 1024 17