
High Performance Computing Course Overview
This course on high performance computing delves into parallel processing concepts, parallel programming principles, and developing efficient algorithms in distributed environments. It covers topics such as parallel computing platforms, algorithm design, memory systems, and programming models. The course objectives include enhancing problem-solving abilities using HPC and studying algorithmic examples in concurrent and parallel settings. Through in-depth analysis and practical implementations, students will learn to optimize performance, handle scientific computations, and improve program efficiency. Taught by Prof. Vijay More at MET's IOE, BKC, Adgaon Nashik.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
High Performance Computing (410450) Subject Teacher: Prof Vijay More Examination Scheme In semester Assessment: 30 End Semester Assessment : 70 Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Course Objectives To develop problem solving abilities using HPC To develop time and space efficient algorithms To study algorithmic examples in distributed, concurrent and parallel environments Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Course Outcomes To transform algorithms in the computational area to efficient programming code for modern computer architectures To write, organize and handle programs for scientific computations To create presentation of using tools for performance optimization and debugging To present analysis of code with respect to performance and suggest performance improvements To present test cases to solve problems for multi-core or distributed, concurrent/Parallel environments and implement Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Unit 1 Parallel Processing Concepts Introduction to Parallel Computing: Motivating Parallelism, Scope of Parallel Computing, Organization and Contents of the Text, Parallel Programming Platforms: Implicit Parallelism: Trends in Microprocessor & Architectures, Limitations of Memory System Performance, Dichotomy of Parallel Computing Platforms, Physical Organization of Parallel Platforms, Communication Costs in Parallel Machines Levels of parallelism (instruction, transaction, task, thread, memory, function) Models (SIMD, MIMD, SIMT, SPMD, Dataflow Models, Demand- driven Computation) Architectures: N-wide superscalar architectures, multi-core, multi-threaded Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Unit 2 Parallel Programming Principles of Parallel Algorithm Design: Preliminaries, Decomposition Techniques, Characteristics of Tasks and Interactions, Mapping Techniques for Load Balancing, Methods for Containing Interaction Overheads, Parallel Algorithm Models, Processor Architecture, Interconnect, Communication, Memory Organization, and Programming Models in high performance computing architecture examples: IBM CELL BE, Nvidia Tesla GPU, Intel Larrabee Micro architecture and Intel Nehalem Micro-architecture Memory hierarchy and transaction specific memory design, Thread Organization Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Unit 3 Fundamental Design Issues in HPC Programming Using the Message-Passing Paradigm: Principles of Message-Passing Programming, The Building Blocks: Send and Receive Operations, MPI: the Message Passing Interface, Topology and Embedding, Overlapping Communication with Computation, Collective Communication and Computation Operations, One-Dimensional Matrix- Vector Multiplication, Single-Source Shortest-Path, Sample Sort, Groups and Communicators, Two-Dimensional Matrix-Vector Multiplication Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Unit 4 Synchronization and related algorithms Synchronization: Scheduling, Job Allocation, Job Partitioning, Dependency Analysis Mapping Parallel Algorithms onto Parallel Architectures, Performance Analysis of Parallel Algorithms Programming Shared Address Space Platforms: Thread Basics, Why Threads?, The POSIX Thread API, Thread Basics: Creation and Termination, Synchronization Primitives in Pthreads, Controlling Thread and Synchronization Attributes, Thread Cancellation, Composite Synchronization Constructs, Tips for Designing Asynchronous Programs, OpenMP: a Standard for Directive Based Parallel Programming Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Unit 5 Advanced tools, techniques and applications Bandwidth Limitations, Latency Limitations, Latency Hiding/Tolerating Techniques and their limitations, Dense Matrix Algorithms: Matrix-Vector Multiplication, Matrix-Matrix Multiplication, Sorting: Issues, Sorting on Parallel Computers, Sorting Networks, Bubble Sort and its Variants, Quicksort, Bucket and Sample Sort, Shared-Address-Space Parallel Formulation, Single -Source Shortest Paths- Distributed Memory Formulation Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Unit 6 HPC enabled Advanced technologies Search Algorithms for Discrete Optimization Problems: Search Overhead Factor, Parallel Depth-First Search, Parallel Best-First Search, Introduction to (Block Diagrams only if any) Petascale Computing, Optics in Parallel Computing Quantum Computers, Recent developments in Nanotechnology and its impact on HPC Power-aware Processing Techniques in HPC Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Text Books: 1. Kai Hwang,"Advanced Computer Architecture: Parallelism, Scalability, Programmability", McGraw Hill 1993 2. David Culler Jaswinder Pal Singh, "Parallel Computer Architecture: A hardware/Software Approach", Morgan Kaufmann,1999. Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Reference Books: 1. Kai Hwang,, "Scalable Parallel Computing", McGraw Hill 1998. 2. George S. Almasi and Alan Gottlieb, "Highly Parallel Computing", The Benjamin and Cummings Pub. Co., Inc 3. William James Dally and Brian Towles, "Principles and Practices on Interconnection Networks", Morgan Kauman 2004. 4. Hubert Nguyen, GPU Gems 3 - by (Chapter 29 to Chapter 41) 5. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar, "Introduction to Parallel Computing", 2nd edition, Addison- Welsey, c 2003 6. David A. Bader (Ed.), Petascale Computing: Algorithms and Applications, Chapman & Hall/CRC Computational Science Series, c 2007. 7. BoS Content: Books, Course Notes, Digital contents, Blogs developed by the BoS for bridging the gaps in the syllabus, problem solving approaches and advances in the course Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Introduction to Parallel Computing: 1. Motivating Parallelism 2. Scope of Parallel Computing
Motivating Parallelism The role of parallelism in accelerating computing speeds has been recognized for several decades. Its role in providing range of datapaths and increased access to storage elements has been significantly large in commercial applications. Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
The scalable performance and lower cost of parallel platforms is reflected in the wide variety of applications. Processor (CPU) is the active part of the computer, which does all the work of data manipulation and decision making. Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Datapath is the hardware that performs all the required data processing operations, for example, ALU, registers, and internal buses. Control is the hardware that tells the datapath what to do, in terms of switching, operation selection, data movement between ALU components, etc. Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Developing parallel hardware and software has traditionally been time and effort intensive. If one is to view this in the context of rapidly improving uniprocessor speeds, one is tempted to question the need for parallel computing. Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
There are some unmistakable trends in hardware design, which uniprocessor (or architectures may not be able to sustain the rate of realizable performance increments in the future. indicate that implicitly parallel) Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
This is the result of a number of fundamental physical and computational limitations. The materialization of standardized parallel programming environments, libraries, and hardware has significantly reduced the time to (parallel) solution. Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
The Computational Power Argument Moore s law states [1965]: The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000. Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Gordon Moore at Fairchild R & D in 1962. Moore attributed this doubling rate to exponential behavior of die sizes, finer minimum dimensions, and circuit and device cleverness . Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
In 1975, he revised this law as follows: There is no room left to squeeze anything out by being clever. Going forward from here we have to depend on the two size factors - bigger dies and finer dimensions. He revised his rate of circuit complexity doubling to 18 months and projected from 1975 onwards at this reduced rate. Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
A die in the context of integrated circuits is a small block of semiconducting material, on which a given functional circuit is fabricated. By 2004, clock frequencies had gotten fast enough- around 3 GHz that any further increases would have caused the chips to melt from the heat they generated. So while the manufacturers continued to increase the number of transistors per chip, they no longer increased the clock frequencies. Instead, they started putting multiple processor cores on the chip. Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
How does one translate transistors into useful OPS (operations per second)? The logical alternative is to rely on parallelism, both implicit and explicit. Most serial processors rely extensively on implicit parallelism. Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Implicit parallelism is a characteristic of a programming language compiler or interpreter to automatically exploit the parallelism inherent to the computations expressed by some of the language s constructs that allows a A pure implicitly parallel language does not need special directives, functions to enable parallel execution operators or Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
The Memory/Disk Speed Argument While clock rates of high-end processors have increased at roughly 40% per year over the past decade, DRAM access times have only improved at the rate of roughly 10% per year over this interval. This mismatch in speed causes significant performance bottlenecks. Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Parallel platforms provide increased bandwidth to the memory system. Parallel platforms also provide higher aggregate caches. Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Principles of locality of data reference and bulk access, which guide parallel algorithm design also apply to memory optimization. Some of the fastest growing applications of parallel computing utilize not their raw computational speed, rather their ability to pump data to memory and disk faster Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Locality of Reference Property: Particular portion of memory address space is accessed frequently by a program during its execution in any time window. E.g. Innermost loop. There are three dimensions of locality property. 1. Temporal locality 2. Spatial locality 3. Sequential locality Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
1. Temporal Locality Items recently referred are likely to be referenced in near future by loop, stack, temporary variables, or subroutines, etc. Once a loop is entered (or subroutine is called), a small code segment is referenced many times repeatedly. This temporal locality is clustered in recently used areas. Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
2. Spatial Locality It is the tendency of a process to access items whose addresses are near to each other, e.g. tables, arrays involves access to special areas which are clustered together. Program segment containing subroutines and macros are kept together in the neighbourhood of memory space. Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
3. Sequential Locality Execution of instructions in a program follow a sequential order unless out-of-order instructions encountered. The ratio of in-order execution to out-of-order execution is generally 5 to 1. Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
The Data Communication Argument As the network evolves, the vision of the Internet as one large computing platform has emerged. This view is exploited by applications such as SETI@home and Folding@home. In many other applications (typically databases and data mining) the volume of data is such that they cannot be moved. Any analyses on this data must be performed over the network using parallel techniques. Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
The search for extraterrestrial intelligence (SETI@home) is the collective name for a number of activities undertaken to search for intelligent extraterrestrial life. Folding@home is a distributed computing project for disease research that simulates protein folding, computational drug design, and other types of molecular dynamics. The project uses the idle processing resources of thousands of personal computers owned by volunteers who have installed the software on their systems. Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Scope of Parallel Computing Applications Parallelism finds applications in very diverse application domains for different motivating reasons. These range from improved application performance to cost considerations Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Applications in Engineering and Design Design of airfoils (optimizing lift, drag, stability), internal combustion engines (optimizing charge distribution, burn), high-speed circuits (layouts for delays and capacitive and inductive effects), and structures (optimizing structural integrity, design parameters, cost, etc.). Design and simulation of micro- and nano-scale systems. Process optimization, operations research Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Scientific Applications Functional and structural characterization of genes and proteins. Advances in computational physics and chemistry have explored new materials, understanding of chemical pathways, chemical bonds, and more efficient processes. Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Applications in astrophysics have explored the evolution of galaxies, thermonuclear processes, and the analysis of extremely large datasets from telescopes. Weather modeling, mineral prospecting, flood prediction, etc., are other important applications. Bioinformatics and astrophysics also present some of the most challenging problems with respect to analyzing extremely large datasets Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Commercial Applications Some of the largest parallel computers power the Wall Street risk analysis, portfolio management, automated trading Data mining and analysis for optimizing business and marketing decisions. Large scale servers (mail and web servers) are often implemented using parallel platforms. Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Applications such as information retrieval and search are typically powered by large clusters. Cloud Computing is inherently on distributed systems processed by parallel systems over widespread network. Big Data is also emerging technology which uses integration of various parallel and distributed systems over the network Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Computer aided engineering (CAE): Automotive design and testing, transportation, structural, mechanical design Chemical engineering: Process and molecular design Digital content creation (DCC) and distribution: Computer aided graphics in film and media Economics/financial: Wall Street risk analysis, portfolio management, automated trading Electronic design and automation (EDA): Electronic component design and verification Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Geosciences and geo-engineering: Oil and gas exploration and reservoir modeling Mechanical design and drafting: 2D and 3D design and verification, mechanical modeling Defence and energy: Nuclear stewardship, basic and applied research Government labs: Basic and applied research University/academic: Basic and applied research Weather forecasting: Near term and climate/earth modeling Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Applications in Computer Systems Network intrusion detection, cryptography, multiparty computations are some of the core users of parallel computing techniques. Embedded systems increasingly rely on distributed control algorithms. Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
A modern automobile consists of number of processors communicating to perform complex tasks for optimizing handling and performance. Conventional structured peer-to-peer networks impose overlay networks and utilize algorithms directly from parallel computing. Prof Vijay More, MET's IOE, BKC, Adgaon Nashik
Organization and Contents of this Course As per SPPU syllabus. Prof Vijay More, MET's IOE, BKC, Adgaon Nashik