Comparison-Free Sorting Algorithm on CPUs: Performance Insights

a comparison free sorting algorithm on cpus n.w
1 / 15
Embed
Share

Explore a comparison-free sorting algorithm designed for CPUs, focusing on principles, examples, key factors, and performance simulations in single-threaded and multi-threaded environments. Learn about optimizing memory locality, reducing computations, and improving spatial and temporal locality for efficient sorting. Dive into execution time comparisons, input data transformations, and the impact of memory locality on performance.

  • CPU Sorting Algorithm
  • Performance Analysis
  • Memory Locality
  • Single-threaded Execution
  • Multi-threaded Simulation

Uploaded on | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. A COMPARISON-FREE SORTING ALGORITHM ON CPUs Saleh Abdel-hafeez, Jordan (JUST) Ann Gordon-Ross, USA (UF) Samer AbuBaker, Jordan (JUST)

  2. Highlights Principle Example Potential Key Factors CPU Simulation Single Threaded (no Parallelism) C-Code (Memory Locality) Execution Time Simulations Multi-threaded (Parallelism) C-Code (Atomic and Semaphore Vs. Memory) Execution Time Simulations Conclusions

  3. Principle Example 13 INPUT BUS N=4 Numbers K= 2-bit 32 01 20 Input Buffers One-Hot Matrix (4 X 4) Binary Matrix (4 X 1) 0 1 0 0 2 0 0 1 0 0 3 1 0 0 0 0 0 1 0 1 Transposed Matrix (4 X 4) Binary Matrix (4 X 1) Sort Matrix (4 X 1) 0 0 1 0 2 3 0 2 0 0 1 0 3 1 0 0 1 0 0 1 0 0 1 0

  4. Potential Key Factors Two Representations Binary One-Hot N=2K Computations less Memory Transpose Memory Mapping Idea Reduce the size of One-Hot (NxN) to NX1 Improve Locality (Spatial and Temporal)

  5. CPU Single Thread Input Data vs. Index array 2 1 3 1 0 1 2 3 4. for i = 0 to n do 5. TM[BS[i]] = TM[BS[i]] + 1 6. endfor Input data becomes the Index Index vs. Repeated 0 2 1 1 4. for i = 0 to n do 5. if (TM[i]>0){ 6. S[j]=i; TM[i]--;j++;i--;} 7. endfor 0 1 2 3 1 1 2 3 New sorting index New Index vs. Input data 0 1 2 3

  6. Loop1 Time vs. Loop2 Time (MEMORY LOCALITY) 10 100% 9 90% 8 80% 70% 7 60% Execution Time (Sec.) 6 50% 40% 5 30% 4 20% 10% 3 0% 7 8 10 12 14 16 18 20 22 24 26 28 30 2 LOOP1 LOOP2 1 0 7 8 10 12 14 16 18 2^K 20 22 24 26 28 30 LOOP1 LOOP2

  7. Dependent Less on Input Distribution 100% 90% 80% 70% Execution Time Percentage 60% 50% 40% 30% 20% 10% 0% 7 8 10 12 14 16 18 20 22 24 26 28 29 Size of the 2^(n) Random Revers Nearly Sorted Few Unique

  8. CPU Single thread (Time Simulation) 140000000 1 Millions 0.9 120000000 0.8 execution time in microsecond 0.7 100000000 0.6 Execution Time in Microsecond 0.5 0.4 80000000 0.3 0.2 60000000 0.1 0 40000000 16 18 20 22 size of 2^(n) 20000000 Free-comparison quick merge radix 0 7 8 10 12 14 16 18 20 22 24 26 28 29 Size of 2^(n) Comaprison-Free Quick Merge Radix

  9. CPU Single Thread Significant The Fastest Minor Effect on Data Type Distribution One Dimensional Memory Less Computations Easy to work with Less Energy & Power 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 7 8 10 12 14 16 18 20 22 24 26 28 29 Comparison-Free Quicksort3 7 8 10 12 14 16 18 20 22 24 26 28 29 Free- compari son 1668940 6 10 41 145 584 2317 6839 31414 69519 418684 1828644 7654605 4 3366248 6894229 quick 15 30 140 602 2673 11409 47064 148004 456904 1842128 7859271 9 9

  10. CPU Multiple Threads (8-Threads & 4-Core) 3 7 5 6 4 2 1 3 2 1 9 8 THR1 1 1 THR2 1 2 3 4 5 6 7 8 9 C11=3 THR3 0 1 0 C11=2 0 0 0 1 0 1 0 2 3 4 5 6 7 8 9 C11=1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 1 1 1 2 3 4 5 6 7 8 9 0-3 4-7 C12=3 C12=1 C12=0 8-10 BARRIER BOUND C13=0 C13=0 C13=2 0 1 2 3 4 5 6 7 8 9 [ ] [ ] [ ] [ ] [ ] [ ] THR1 =C11 + C21 + C31 = 6 [ ] [ ] [ ] [ ] THR2 =C11 + C21 + C31+ C12+C22+C32 = 10 [ ] [ ] THR2 =C11 + C21 + C31+ C12+ C22 + C32 + C13 + C23 + C33 = 12 No Need

  11. CPU Multiple threaded (TIME) 100% 90% 2500 80% 70% 2000 60% 50% Execution Time (micro-second) 40% 1500 30% 20% 10% 1000 0% 7 8 10 12 14 16 18 20 22 24 26 28 30 32 34 Parallel1 Parallel2 Single 500 0 7 8 10 12 14 16 2^(K) Parallel1 Parallel2 Single

  12. Execution Time vs. Data Sizes 7 8 10 12 14 16 18 20 22 24 26 28 30 32 34 900000000 8-thread Non- thread 345 333 363 386 1070 2085 7658 17309 58822 234639 1084792 4411107 11969863 32481103 88139858 6 10 41 145 584 2317 6839 31414 69519 418684 1828644 7654605 60934070 2.22E+08 8.12E+08 800000000 700000000 100% 90% Execution Time (Micro-Seconed) 600000000 80% 70% 500000000 60% 50% 400000000 40% 300000000 30% 20% 200000000 10% 0% 100000000 7 8 10 12 14 16 18 20 22 24 26 28 32 34 36 Single Paralell1 Parallel2 Atomic 0 20 22 24 26 28 32 34 36 2^(K)

  13. Memory Usage 9E+11 100% 8E+11 90% 80% 7E+11 70% 6E+11 60% 50% 5E+11 40% 4E+11 30% 20% 3E+11 10% 2E+11 0% 7 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 1E+11 Single Parallel-memory Parallel-Atomic 0 7 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 single Parallel-Memory Parallel-atomic

  14. Comparison with Parallel Sorting Algorithms Avoid Mutual Exclusive (Memory Blocked) Use More Memory for threaded Use Atomic for less memory Execution Time (Second) 14 20 24 26 1.08 2.23 0.33 0.27 Comparison-Free [1]-2011-Bitonic-Sort-CPU&GPU [2]-2010-Intel (Radix) CPU [3]-2009-Invidia (Radix) GPU 0.00107/0.0005 0.002 0.235 0.0012 0.0075 0.008 0.076 0.025 0.081 0.031 1.97 0.12

  15. CONCLUSION The Design is novel and is not an incremental of other hybrid sorting algorithms (Future Work); the C-Code is clear and is available Comparison-free: Single-Threaded The fastest for data sizes < 216 Comparison-free: Multi-threaded CPU (Simple 4-Core) fastest at data 220 CPU (Advance Multi-Core) need to investigate GPU (Simple and Advance) need to investigate Use less memory, and expecting less energy

More Related Content