Understanding Basic Computer Architecture and Performance

cosc121 computer systems basic architecture n.w

1 / 55

Embed Share

Explore the foundations of computer systems, from hardware design to software optimization, to assess and improve computing performance efficiently. Learn about different classes of computers and their characteristics. Dive into areas like processor design, memory management, and system architectures to enhance your understanding of computer technology.

mlamb Follow

Uploaded on May 09, 2025 | 4 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

COSC121: Computer Systems. Basic Architecture and Performance Jeremy Bolton, PhD Assistant Teaching Professor Constructed using materials: - Patt and Patel Introduction to Computing Systems (2nd) - Patterson and Hennessy Computer Organization and Design (4th) **A special thanks to Eric Roberts and Mary Jane Irwin

Notes Project 3 Assigned due soon. Read PH.1

Outline Review: From Electron to Programming Languages Basic Computer Architecture How can we improve our computing model? (Improve wrt efficiency) Performance Measures Speed Cost Power consumption Memory Hardware is fast, software is slow CISC RISC Parallelism ILP DLP MLP Going to memory is slow Caching schemes Interrupt-based communication: Polling can be slow. More and more parallelism Multi-threading, multi-cores, distributed computing and more

This week our journey takes us Application (Browser) COSC 121: Computer Systems Operating System Compiler Assembler (Win, Linux) COSC 255: Operating Systems Software Drivers Instruction Set Architecture Hardware Processor Memory I/O system Datapath & Control Digital Design Circuit Design transistors COSC 120: Computer Hardware

Assessing the performance of a Computer Consider Computers are defined by and limited by their hardware / ISAs Many different computing models

Classes of Computers Personal computers Designed to deliver good performance to a single user at low cost usually executing 3rd party software, usually incorporating a graphics display, a keyboard, and a mouse Servers Used to run larger programs for multiple, simultaneous users typically accessed only via a network and that places a greater emphasis on dependability and (often) security Supercomputers A high performance, high cost class of servers with hundreds to thousands of processors, terabytes of memory and petabytes of storage that are used for high-end scientific and engineering applications Embedded computers (processors) A computer inside another device used for running one predetermined application

How do we define better: Performance Metrics Purchasing perspective given a collection of machines, which has the best performance/cost? least cost ? best cost/performance? Design perspective faced with design options, which has the best performance improvement ? least cost ? best cost/performance? Both require 1. basis for comparison 2. metric for evaluation Our goal is to understand what factors in the architecture contribute to overall system performance and the relative importance (and cost) of these factors

Throughput versus Response Time Response time (execution time) the time between the start and the completion of a task Important to individual users Throughput (bandwidth) the total amount of work done in a given time Important to data center managers Will need different performance metrics as well as a different set of applications to benchmark embedded and desktop computers, which are more focused on response time, versus servers, which are more focused on throughput

Defining (Speed) Performance To maximize performance, need to minimize execution time performanceX = 1 / execution_timeX If X is n times faster than Y, then performanceX execution_timeY -------------------- = --------------------- = n performanceY execution_timeX Decreasing response time almost always improves throughput

Relative Performance Example If computer A runs a program in 10 seconds and computer B runs the same program in 15 seconds, how much faster is A than B? We know that A is n times faster than B if performanceA execution_timeB -------------------- = --------------------- = n performanceB execution_timeA The performance ratio is 15 ------ = 1.5 10 So A is 1.5 times faster than B

Review: Machine Clock Rate Clock rate (clock cycles per second in MHz or GHz) is inverse of clock cycle time (clock period) CC = 1 / CR one clock period 10 nsec clock cycle => 100 MHz clock rate 5 nsec clock cycle => 200 MHz clock rate 2 nsec clock cycle => 500 MHz clock rate 1 nsec (10-9) clock cycle => 1 GHz (109) clock rate 500 psec clock cycle => 2 GHz clock rate 250 psec clock cycle => 4 GHz clock rate 200 psec clock cycle => 5 GHz clock rate

Performance Factors CPU execution time (CPU time) time the CPU spends working on a task Does not include time waiting for I/O or running other programs CPU execution time # CPU clock cycles for a program for a program = x clock cycle time or CPU execution time # CPU clock cycles for a program for a program clock rate = ------------------------------------------- Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program

Improving Performance Example A program runs on computer A with a 2 GHz clock in 10 seconds. What clock rate must computer B run at to run this program in 6 seconds? Assume (unfortunately), to accomplish this, computer B will require 1.2 times as many clock cycles as computer A to run the program. First: we need to know the number of clock cycles on A CPU timeA CPU clock cyclesA = ------------------------------- clock rateA CPU clock cyclesA = 10 sec x 2 x 109 cycles/sec = 20 x 109 cycles CPU timeB 1.2 x 20 x 109 cycles = ------------------------------- Given that we know B will require 1.2 times as many cycles, and must execute in 6s, we can solve for clock rate. clock rateB clock rateB 1.2 x 20 x 109 cycles = ------------------------------- = 4 GHz 6 seconds

THE Performance Equation Our basic performance equation is then CPU time = Instruction_count x CPI x clock_cycle or Instruction_count x CPI CPU time = ----------------------------------------------- clock_rate These equations separate the three key factors that affect performance Can measure the CPU execution time by running the program The clock rate is usually given Can measure overall instruction count by using profilers/ simulators without knowing all of the implementation details CPI varies by instruction type and ISA implementation for which we must know the implementation details

Using the Performance Equation: Example Computers A and B implement the same ISA. Computer A has a clock cycle time of 250 ps and an effective CPI of 2.0 for some program and computer B has a clock cycle time of 500 ps and an effective CPI of 1.2 for the same program. Which computer is faster and by how much? Each computer executes the same number of instructions, I, so CPU timeA = I x 2.0 x 250 ps = 500 x I ps CPU timeB = I x 1.2 x 500 ps = 600 x I ps Clearly, A is faster by the ratio of execution times performanceA execution_timeB 600 x I ps ------------------- = --------------------- = ---------------- = 1.2 performanceB execution_timeA 500 x I ps

Determinates of CPU Performance CPU time = Instruction_count x CPI x clock_cycle Instruction_ count CPI clock_cycle Algorithm X X Programming language Compiler X X X X ISA X X X

Clock Cycles per Instruction Not all instructions take the same amount of time to execute One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction # CPU clock cycles # Instructions Average clock cycles for a program for a program per instruction = x Clock cycles per instruction (CPI) the average number of clock cycles each instruction takes to execute A way to compare two different implementations of the same ISA CPI for this instruction class A B 1 2 C 3 CPI

Effective (Average) CPI Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging n Overall effective CPI = (CPIi x ICi) i = 1 Where ICi is the count (percentage) of the number of instructions of class i executed CPIi is the (average) number of clock cycles per instruction for that instruction class n is the number of instruction classes The overall effective CPI varies by instruction mix a measure of the dynamic frequency of instructions across one or many programs

A Simple Example Op Freq CPIi Freq x CPIi ALU 50% 1 .5 .5 .5 .25 Load 20% 5 1.0 .4 1.0 1.0 Store 10% 3 .3 .3 .3 .3 Branch 20% 2 .4 .4 .2 .4 = 2.2 1.6 2.0 1.95 How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? CPU time new = 1.6 x IC x CC so 2.2/1.6 means 37.5% faster How does this compare with using branch prediction to shave a cycle off the branch time? CPU time new = 2.0 x IC x CC so 2.2/2.0 means 10% faster What if two ALU instructions could be executed at once? CPU time new = 1.95 x IC x CC so 2.2/1.95 means 12.8% faster

Workloads and Benchmarks Benchmarks a set of programs that form a workload specifically chosen to measure performance SPEC (System Performance Evaluation Cooperative) creates standard sets of benchmarks starting with SPEC89. The latest is SPEC CPU2006 which consists of 12 integer benchmarks (CINT2006) and 17 floating-point benchmarks (CFP2006). www.spec.org There are also benchmark collections for power workloads (SPECpower_ssj2008), for mail workloads (SPECmail2008), for multimedia workloads (mediabench),

Other Performance Metrics Power consumption especially in the embedded market where battery life is important For power-limited applications, the most important metric is energy efficiency

Power Trends In CMOS IC technology = 2 Power Capacitive load Voltage Frequency 30 1000 5V 1V

The Power Issue: Relatively speaking Fred Pollack, Intel Corp. Micro32 conference key note - 1999

Reducing Power Suppose a new CPU has 85% of capacitive load of old CPU 15% voltage and 15% frequency reduction 2 P C 0.85 (V 0.85) 2 old F 0.85 = = = 4 0.85 0.52 new old old old P C V F old old old The power wall We can t reduce voltage further We can t remove more heat How else can we improve performance?

Parallelism for improved performance Given physical constraints Clock Speeds (Power) Transistor size (Moore s Law is coming to an end ?) most current improvements have come from parallelism Parallelism in both hardware and software

Parallelism Classes of parallelism in applications: Data-Level Parallelism (DLP) Task-Level Parallelism (TLP) Classes of architectural parallelism: Instruction-Level Parallelism (ILP) Vector architectures/Graphic Processor Units (GPUs) Thread-Level Parallelism Request-Level Parallelism

Instruction Level Parallelism F You have already seen parallelism in action! D Pipelining allows for multiple instructions to be executed simultaneously EA OP Lets briefly investigate another ISA: MIPS EX S

MIPS Pipeline Five stages, one step per stage 1. IF: Instruction fetch from memory 2. ID: Instruction decode & register read 3. EX: Execute operation or calculate address 4. MEM: Access memory operand 5. WB: Write result back to register ALU IM DM Reg Reg

Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Non-stop: Speedup = 2n/0.5n + 1.5 4 = number of stages

Why Pipeline? For Performance! Time (clock cycles) Once the pipeline is full, one instruction is completed every cycle, so CPI = 1 Inst 0 ALU IM DM Reg Reg I n s t r. Inst 1 ALU IM DM Reg Reg ALU Inst 2 IM DM Reg Reg O r d e r ALU Inst 3 IM DM Reg Reg ALU IM DM Reg Reg Inst 4 Time to fill the pipeline

Can Pipelining Get Us Into Trouble? Yes: Pipeline Hazards structural hazards: attempt to use the same resource by two different instructions at the same time data hazards: attempt to use data before it is ready An instruction s source operand(s) are produced by a prior instruction still in the pipeline control hazards: attempt to make a decision about program control flow before the condition has been evaluated and the new PC target address calculated branch and jump instructions, exceptions Can usually resolve hazards by waiting pipeline control must detect the hazard and take action to resolve hazards

A Single Memory Would Be a Structural Hazard Time (clock cycles) Reading data from memory lw ALU Mem Mem Reg Reg I n s t r. Inst 1 ALU Mem Mem Reg Reg ALU Inst 2 Mem Mem Reg Reg O r d e r ALU Inst 3 Mem Mem Reg Reg ALU Mem Mem Reg Reg Inst 4 Reading instruction from memory Fix with separate instr and data memories (I$ and D$)

How About Register File Access? Time (clock cycles) Fix register file access hazard by doing reads in the second half of the cycle and writes in the first half add $1, ALU IM DM Reg Reg I n s t r. Inst 1 ALU IM DM Reg Reg ALU Inst 2 IM DM Reg Reg O r d e r add $2,$1, ALU IM DM Reg Reg clock edge that controls loading of pipeline state registers clock edge that controls register writing

Register Usage Can Cause Data Hazards Dependencies backward in time cause hazards add $1, ALU IM DM Reg Reg I n s t r. sub $4,$1,$5 ALU IM DM Reg Reg and $6,$1,$7 ALU IM DM Reg Reg O r d e r or $8,$1,$9 ALU IM DM Reg Reg ALU xor $4,$1,$5 IM DM Reg Reg Read before write data hazard

Register Usage Can Cause Data Hazards Dependencies backward in time cause hazards add $1, ALU IM DM Reg Reg sub $4,$1,$5 ALU IM DM Reg Reg ALU and $6,$1,$7 IM DM Reg Reg ALU or $8,$1,$9 IM DM Reg Reg ALU IM DM Reg xor $4,$1,$5 Reg Read before write data hazard

Loads Can Cause Data Hazards Dependencies backward in time cause hazards lw $1,4($2) ALU IM DM Reg Reg I n s t r. sub $4,$1,$5 ALU IM DM Reg Reg ALU and $6,$1,$7 IM DM Reg Reg O r d e r ALU or $8,$1,$9 IM DM Reg Reg ALU IM DM Reg xor $4,$1,$5 Reg Load-use data hazard

Branch Instructions Cause Control Hazards Dependencies backward in time cause hazards beq ALU IM DM Reg Reg I n s t r. lw ALU IM DM Reg Reg ALU Inst 3 IM DM Reg Reg O r d e r ALU IM DM Reg Reg Inst 4

Even with all the hazards Pipelining is worth the overhead! We will discuss the different hazards in more detail

Pipeline Performance Assume time for stages is 100ps for register read or write 200ps for other stages Compare pipelined datapath with single-cycle datapath Instr Instr fetch Register ALU op Memory access 200ps Register write 100 ps Total time read 100 ps lw 200ps 200ps 800ps sw 200ps 100 ps 200ps 200ps 700ps R-format 200ps 100 ps 200ps 100 ps 600ps beq 200ps 100 ps 200ps 500ps

Pipeline Performance Single-cycle (Tc= 800ps) Pipelined (Tc= 200ps)

Pipeline Speedup If all stages are balanced i.e., all take the same time Time between instructionspipelined= Time between instructionsnonpipelined Number of stages If not balanced, speedup is less! Speedup due to increased throughput Latency (time for each instruction) does not decrease

Pipelining and ISA Design MIPS ISA designed for pipelining All instructions are 32-bits Easier to fetch and decode in one cycle c.f. x86: 1- to 17-byte instructions Few and regular instruction formats Can decode and read registers in one step Load/store addressing Can calculate address in 3rd stage, access memory in 4th stage Alignment of memory operands Memory access takes only one cycle

Current Trends in Architecture Benefits reaped from Instruction-Level parallelism (ILP) have largely been maxed out. New models for performance: Data-level parallelism (DLP) Thread-level parallelism (TLP) These require explicit restructuring of the application

Summary All modern day processors use pipelining Pipelining doesn t help latency of single task, it helps throughput of entire workload Potential speedup: a CPI of 1 and fast a CC Pipeline rate limited by slowest pipeline stage Unbalanced pipe stages makes for inefficiencies The time to fill pipeline and time to drain it can impact speedup for deep pipelines and short code runs Must detect and resolve hazards Stalling negatively affects CPI (makes CPI less than the ideal of 1)

Future Topics (time permitting) Parallel Computing Distributed and Cloud Computing GPUs and Vectored Architectures

Appendix

CPU Time = CPU Time CPU Clock Cycles Clock Cycle Time CPU Clock Cycles = Clock Rate Performance improved by Reducing number of clock cycles Increasing clock rate

CPU Time Example Computer A: 2GHz clock, 10s CPU time Designing Computer B Aim for 6s CPU time Can do faster clock, but causes 1.2 clock cycles How fast must Computer B clock be? Clock Cycles 1.2 Clock Cycles = = Clock Rate B A B CPU Time 6s B = Clock Cycles CPU Time Clock Rate A A A = = 9 10s 2GHz 20 10 9 9 1.2 20 10 24 10 = = = Clock Rate 4GHz B 6s 6s

Instruction Count and CPI = Clock Cycles Instructio Count n Cycles per Instructio n = CPU Time Instructio Count n CPI Clock Cycle Time Instructio Count n CPI = Clock Rate Instruction Count for a program Determined by program, ISA and compiler Average cycles per instruction Determined by CPU hardware If different instructions have different CPI Average CPI affected by instruction mix

CPI Example Computer A: Cycle Time = 250ps, CPI = 2.0 Computer B: Cycle Time = 500ps, CPI = 1.2 Same ISA Which is faster, and by how much? = CPU Time Instructio Count n CPI Cycle Time A A A = = I 2.0 250ps I 500ps A is faster = CPU Time Instructio Count n CPI Cycle Time B B B = = I 1.2 500ps I 600ps CPU Time I 600ps B = = 1.2 by this much CPU Time I 500ps A

Understanding Basic Computer Architecture and Performance

Download Presentation

Presentation Transcript

Related

More Related Content