Floating-Point Representation and Operations

1 / 30

Embed Share

Explore the world of floating-point numbers, IEEE standards, normalized representation, and the intricacies of floating-point operations. Learn about round-off errors, special values, and the limitations of floating-point arithmetic compared to real numbers.

mani_5 Follow

Uploaded on Apr 08, 2025 | 3 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Lecture 17 Floating point

What is floatingpoint? A representation 2.5732 1022 NaN Single, double, extended precision (80 bits) A set of operations + = * / rem Comparison < = Conversions between different formats, binary to decimal Exceptionhandling Language and library support > 2 Scott B. Baden / CSE 160 /Wi '16

IEEE FloatingpointstandardP754 Universally accepted standard for representing and using floating point, AKA P754 Universallyaccepted W. KahanreceivedtheTuringAwardin 1989 for designof IEEEFloating PointStandard Revision in2008 Introduces special values to represent infinities, signed zeroes, undefined values ( Not a Number ) 1/0 = + , 0/0 =NaN Minimal standards for how numbers are represented Numerical exceptions 3 Scott B. Baden / CSE 160 /Wi '16

Representation Normalizedrepresentation 1.d d 2exp Macheps = Machine epsilon= = 2-#significand bits relativeerrorin eachoperation OV= overflowthreshold=largestnumber UN= underflowthreshold=smallest number Zero: significand and exponent =0 Format ---------- Single Double Double # bits -------- 32 64 80 #significand bits ---------------------- 23+1 52+1 64 macheps ------------ 2-24(~10-7) 2-53(~10-16) 2-64(~10-19) #exponent bits -------------------- 8 11 15 exponent range ---------------------- 2-126 - 2127(~10+-38) 2-1022 - 21023(~10+-308) 2-16382 - 216383(~10+-4932) JimDemmel 4 Scott B. Baden / CSE 160 /Wi '16

What happens in a floating point operation? Round to the nearestrepresentablefloatingpoint numberthat correspondsto the exactvalue(correctrounding) 3.75 +.425 = .4175 .418 (3 significantdigits, round toward even) Whenthereareties, roundtothe nearestvaluewiththelowestorder bit = 0 (roundingtowardnearesteven) Applies to + = * / rem and to formatconversion Errorformula:fl(a op b) = (a op b)*(1 + ) op oneof + , - , * , / | | = machineepsilon assumesnooverflow, underflow,ordividebyzero Example fl(x1+x2+x3) = [(x1+x2)*(1+ 1) +x3]*(1+ 2) = x1*(1+ 1)*(1+ 2)+ x2*(1+ 1)*(1+ 2)+x3*(1+ x1*(1+e1) + x2*(1+e2) + x3*(1+e3) where |ei| 2*machineepsilon where 5 Scott B. Baden / CSE 160 /Wi '16

Floating point numbers are not real numbers! Floating-pointarithmeticdoes not satisfythe axiomsof realarithmetic Trichotomyis not the only propertyof realarithmeticthat doesnot hold forfloats, noreventhe most important Addition is notassociative The distributivelawdoesnothold There are floating-point numbers withoutinverses It is not possible to specifya fixed-sizearithmetictype that satisfies all of the properties of real arithmetic that we learnedin school.TheP754 committeedecidedto bend or breaksome of them, guidedby some simpleprinciples Whenwecan, wematchthebehaviorofrealarithmetic Whenwecan t,wetryto make theviolationsas predictableandas easytodiagnose as possible(e.g.Underflow,infinites,etc.) 6 Scott B. Baden / CSE 160 /Wi '16

Consequences Floating point arithmetic is notassociative (x + y) + z x+ (y+z) Example a = 1.0..0e+38 (a+b)+c: 1.00000000000000e+00 a+(b+c):0.00000000000000e+00 c=1.0 b= -1.0..0e+38 Whenweadda list of numberson multiple cores,we can get differentanswers Canbeconfusingifwe havearaceconditions Canevendependonthecompiler Distributive law doesn talwayshold When y z, x*y x*z In Matlab v15, let x=1e38,y=1.0+1.0e12, z=1.0-1.0e-12 x*y x*z = 2.0000e+26,x*(y-z) = 2.0001e+26 x(y-z) 7 Scott B. Baden / CSE 160 /Wi '16

What is the most accurateway to sum thelist of signednumbers (1e-10,1e10, -1e10)when we haveonly8 significantdigits ? A. Just add the numbers in the givenorder B. Sort the numbers numerically, then add in sorted order C. Sort the numbers numerically, then pickoff the largest of values coming from either end, repeating the process untildone 16 Scott B. Baden / CSE 160 /Wi '16

What is the most accurateway to sum thelist of signednumbers (1e-10,1e10, -1e10)when we haveonly8 significantdigits ? A. Just add the numbers in the givenorder B. Sort the numbers numerically, then add in sorted order C. Sort the numbers numerically, then pickoff the largest of values coming from either end, repeating the process untildone 16 Scott B. Baden / CSE 160 /Wi '16

Exceptions An exception occurs when the result of a floating point operation is not a real number, or too extreme to represent accurately 1/0, -1 P754 floating point exceptions aren t the same as C++11 exceptions The exception need not be disastrous (i.e. program failure) Continue;tolerate theexception,repairthe error 1.0e-30 + 1.0e+30 1e38*1e38 17 Scott B. Baden / CSE 160 /Wi '16

An example Graph the function f(x) = sin(x) / x f(0) = 1 But we get a singularity @ x=0: 1/x = This is an accident in how we represent the function (W. Kahan) We catch the exception (divide by 0) Substitute the value f(0) = 1 18 Scott B. Baden / CSE 160 /Wi '16

Whichof these expressionswill generatean exception? A. -1 B. 0/0 C. Log(-1) D. A andB E. All of A, B, C 19 Scott B. Baden / CSE 160 /Wi '16

Whichof these expressionswill generatean exception? A. -1 B. 0/0 C. Log(-1) D. A andB E. All of A, B, C 19 Scott B. Baden / CSE 160 /Wi '16

Whyis it importantto handleexceptionsproperly? Crash ofAir Franceflight #447 in the mid-atlantic http://www.cs.berkeley.edu/~wkahan/15June12.pdf Flight #447encountereda violentthunderstormat35000 feet and super-cooled moisture clogged the probes measuringairspeed Theautopilotcouldn thandlethe situation and relinquished controlto the pilots It displayedthe message INVALIDDATAwithout explainingwhy Withoutknowing whatwasgoing wrong, the pilots were unableto correctthesituation in time Theaircraftstalled,crashinginto the ocean3 minutes later At 20,000feet,the ice melted on the probes,but thepilots didn't t know this so couldn t know which instruments to trust ordistrust. 20 Scott B. Baden / CSE 160 /Wi '16

Infinities Infinities extend the range of mathematical operators 5 + , 10* * No exception:the resultis exact How do we get infinity? Whentheexactfinite result is too large to represent accurately Example:2*OV [recall: OV = largestrepresentablenumber] We also getOverflowexception Divide by zero exception Return =1/ 0 21 Scott B. Baden / CSE 160 /Wi '16

What is the valueof the expression-1/(-)? A. -0 B. +0 C. D. - 22 Scott B. Baden / CSE 160 /Wi '16

Signedzeroes We get a signedzero when theresult is too small to berepresented Examplewith 32 bit single precision a = -1 / 1000000000000000000.0 ==-0 b = 1 / a Because a is float, it will result in - but the correct value is : -1000000000000000000.0 Format ---------- Single Double Double # bits -------- 32 64 80 #significand bits ---------------------- 23+1 52+1 64 macheps ------------ 2-24(~10-7) 2-53(~10-16) 2-64(~10-19) #exponent bits -------------------- 8 11 15 exponent range ---------------------- 2-126 - 2127(~10+-38) 2-1022 - 21023(~10+-308) 2-16382 - 216383(~10+-4932) 23 Scott B. Baden / CSE 160 /Wi '16

If we donthave signed zeroes, for which value(s) of x will thefollowingequalitynot hold true: 1/(1/x) =x A. - and + B. +0 and -0 C. -1 and 1 D. A &B E. A & C 25 Scott B. Baden / CSE 160 /Wi '16

If we donthave signed zeroes, for which value(s) of x will thefollowingequalitynot hold true: 1/(1/x) =x A. - and + B. +0 and -0 C. -1 and 1 D. A &B E. A & C 25 Scott B. Baden / CSE 160 /Wi '16

NaN (Not aNumber) Invalid exception Exactresultis not a well-definedrealnumber - NaN-10 NaN<2? 0/0 -1 We can have a quiet NaN or a signaling Nan Quiet does not raise an exception,but propagatesa distinguishedvalue E.g. missing data:max(3,NAN)= 3 Signaling - generateanexceptionwhenaccessed Detect uninitializeddata 26 Scott B. Baden / CSE 160 /Wi '16

What is the valueof the expressionNaN<2? A. True B. False C. NaN 27 Scott B. Baden / CSE 160 /Wi '16

What is the valueof the expressionNaN<2? A. True B. False C. NaN 27 Scott B. Baden / CSE 160 /Wi '16

Why are comparisons with NaN different from other numbers? Why does NaN < 3yield false? Because NaN is not ordered with respect to any value Clause 5.11, paragraph 2 of the 754-2008 standard: 4 mutually exclusive relations are possible: less than, equal,greaterthan, andunordered Thelast case ariseswhenatleast oneoperandis NaN. EveryNaNshallcompareunorderedwitheverything, includingitself See Stephen Canon s entry dated 2/15/09@17.00 on stackoverflow.com/questions/1565164/what-is-the-rationale-for-all- comparisons-returning-false-for-ieee754-nan-values 28 Scott B. Baden / CSE 160 /Wi '16

Working withNaNs ANaNis unusual in that you cannotcompareit with anything, includingitself! #include<stdlib.h> #include<iostream> #include<cmath> using namespace std; int main(){ float x =0.0 /0.0; cout << "0/0 = " << x <<endl; if (std::isnan(x)) cout <<"IsNan\n"; if (x != x) cout << "x != x is another way of sayingisnan\n"; return0; } 0/0 = nan IsNan x != x is another way of saying isnan 29 Scott B. Baden / CSE 160 /Wi '16

Summary of representablenumbers 0 0 0 0 0 1 .1 0 0 Normalizednonzeros Not 0or all1s anything Denmormalizednumbers nonzero 0 0 NaNs Signaling and quiet (Signalinghas mostsignificantmantissabitsetto0 Quiet,themostsignificantbitsetto1) OftensupportedasquietNaNsonly nonzero 1 1 31 Scott B. Baden / CSE 160 /Wi '16

Denormalizednumbers Let scompute: if (a b) then x = a/(a-b) We should neverdivide by 0, evenif a-bis tiny Underflowexceptionoccurswhen exactresult a-b< underflowthreshold UN (too small torepresent) We returna denormalizednumberfor a-b Relaxrestrictionthatleadingdigitis1: 0.d d x 2min_exp Fillsin the gapbetween0andUN uniformdistributionofvalues Ensuresthatweneverdivideby0 Some loss ofprecision JimDemmel 32 Scott B. Baden / CSE 160 /Wi '16

Extra precision Theextendedprecisionformatprovides80 bits Enablesus to reduceroundingerrors Not obligatory,though manyvendorssupportit We see extra precisionin registersonly Thereis a loss of precisionwhenwe store to memory (80 64bits) Not supportedin SSE, scalarvaluesonly Format ---------- Single Double Double # bits -------- 32 64 80 #significand bits ---------------------- 23+1 52+1 64 macheps ------------ 2-24(~10-7) 2-53(~10-16) 2-64(~10-19) #exponent bits -------------------- 8 11 15 exponent range ---------------------- 2-126 - 2127(~10+-38) 2-1022 - 21023(~10+-308) 2-16382 - 216383(~10+-4932) 33 Scott B. Baden / CSE 160 /Wi '16

When compiler optimizations alter precision If we support80 bits extendedformatin registers.. Whenwestorevaluesinto memory, valueswill be truncatedto the lower precisionformat,e.g. 64 bits Compilerscankeepthings in registers andwe maylose referentialtransparency,dependingon the optimization Example: round(4.55,1)shouldreturn4.6, but returns4.5 with O1 through-O3 double round(double v, doubledigit) { double p = std::pow(10.0, digit); double t = v * p; double r = std::floor(t + 0.5); return r / p; } With optimization turned on, p is computed to extra precision; it is not stored as a float (and rounded to 4.55), but lives in a registerand is stored as 4.550000190734863 t = v*p = 45. 5000019073486 r = floor(t+45) = 46 r/p = 4.6 34 Scott B. Baden / CSE 160 /Wi '16

Exception handling- interface P754 standardizeshowwe handle exceptions Overflow: - exactresult> OV, too largeto represent,returns Underflow: exactresultnonzeroand<UN, toosmallto represent Divide-by-zero: nonzero/0, returns = 0 (Signedzeroes) Invalid: 0/0, -1, log(0),etc. Inexact: therewas aroundingerror(common) Eachof the 5 exceptionsmanipulates2flags:shoulda trap occur? Bydefaultwedon t trap, butwe continuecomputing NaN denorm If we do trap:entera traphandler,wehave access toargumentsin operationthatcausedtheexception Requires preciseinterrupts,causesproblemsonaparallelcomputer, usually notimplemented We canuse exceptionhandlingto build faster algorithms Try the fasterbut riskier algorithm (butdenormcanbeslow) Rapidly test for accuracy (use exceptionhandling) Substituteslowermorestable algorithmas needed See Demmel&Li:crd.lbl.gov/~xiaoye 35 Scott B. Baden / CSE 160 /Wi '16

Fin

Floating-Point Representation and Operations

Download Presentation

Presentation Transcript

Related

More Related Content