float.h
Similary to limits.h
, some limits must be established on the characteristics
of floating-point arithmetic for a given architecture. These are determined by a
model described in the standard and implemented in float.h
. This won't be a
comprehensive introduction to floating-point arithmetic as there are already
resources for that:
- Single-precision floating-point format - Wikipedia
- Double-precision floating-point format - Wikipedia
- Floating point - Wikipedia
- What Every Computer Scientist Should Know About Floating-Point Arithmetic
The Floating-Point Model
Before talking about the specifics of float.h
, we need to lay some groundwork
for how floating-point numbers are defined and some of the terminology that goes
along with them.
- sign - positive or negative
- radix - the numerical base of exponent representation (binary would have a radix of 2)
- significand/mantissa - the part of the floating-point number consisting of its significant digits
- precision - the number of digits in the significand
With those definitions, we'll construct a few variables that we need to construct the floating-point model:
- \(s\) - sign
- \(b\) - base/radix
- \(e\) - exponent
- \(p\) - precision
- \(f_{k}\) - nonnegative integers less than \(b\) (the significand digits)
The standard states, "A normalized floating-point number \(x ( f_1 > 0 \mbox{ if } x \neq 0 )\) is defined by the following model":
This is the model that we will implement with float.h
and which can be used to
construct a floating-point number. For this implementation, \(b\) will always be
\(2\) since we're dealing with the binary representation of numbers. In addition,
we must have a range for \(e\) by defining \(e_{min}\) and \(e_{max}\). To calculate
these values we use the following equations:
- \(k =\) type width in bits
- \(e_{max} = 2^{k-p-1}\)
- \(e_{min} = 1 - e_{max}\)
Floating Types
float
, double
, and long double
are the floating-point types provided by C.
Since this first version of welibc is for the x86_64 architecture we can turn to
the x86_64 ABI to see how these types are implemented. Page 12 shows that
these types are represented by single-precision, double-precision, and 80-bit
extended precision floating-point types, respectively. It also shows us that
that each of the three floating-point types are to be implemented according to
the IEEE Standard for Floating-Point Arithmetic. I'll be choosing to use the
most recent revision of that standard, IEEE 754-2008.
Now we have some reference documentation to help us implement float.h
. We'll
use this documentation to define the variables we discussed above and then we
can use those to calculate the values we need for float.h
. IEEE 754-2008 will
give us values that define the widths, in bits, for the sign (\(s\)), exponent
(\(e\)), and precision (\(p\)) of each floating-point type. From those variables we
can derive the rest of the information we'll need.
IEEE 754 uses normalized floating-point numbers meaning that each value is represented with a nonzero leading digit. For example, \(2.2573 \times 10^{3}\) is normalized, but \(0.22573 \times 10^{4}\) is not, even though they represent the same quantity. By using normalized numbers we ensure that each value is uniquely represented.
Since we're using normalized numbers and the leading digit will always be nonzero (except for infinity, 0, and Not a Number), then the most significant bit of the significand would always be 1. There is no point in storing a bit if it will always be 1 so IEEE 754 uses that storage space to gain more precision. Since the original most significant bit is no longer stored in the floating type, it is referred to as a "hidden" bit.
The hidden bit has a few implications that we need to keep in mind through calculating our values. First, the field width of the significand is actually one more than the number of bits we use to represent it. We'll look at this below when we define all of the field widths. Second, the exponent field implicitly encodes the hidden bit for us so the computed exponent ranges are really one less than the value we need. I'll make sure to note these as we go along.
Float
The float
type is a single-precision floating-point number represented by
32 bits (\(k = 32\)). The IEEE 754-2008 standard refers to this format as binary32
and tells us that the bits are broken down like so:
- \(s = 1\)
- \(e = 8\)
- \(p = 24\)
See how the total bits here is actually 33? This is due to the hidden bit in the significand; there are physically 23 bits for the significand but we get 24 bits because the most significant bit is always 1 for normalized floating point numbers.
Double
The double
type is a double-precision floating-point number represented
by 64 bits (\(k = 64\)). The IEEE 754-2008 standard refers to this format as
binary64 and breaks down the bits like so:
- \(s = 1\)
- \(e = 11\)
- \(p = 53\)
Again, 53 bits for the significand where there are only 52 available.
Long Double
The long double
type is an extended-precision floating-point number
represented by 80 bits (\(k = 80\)). The IEEE 754-2008 standard refers to this
format as binary64 extended and breaks down the bits like so:
- \(s = 1\)
- \(e = 15\)
- \(p = 64\)
Don't forget that the hidden bit for the significand is what gives us 64 instead of 63 bits for \(p\).
float.h Values and Equations
Now that we have the inputs we need, we can derive each value in float.h
according to the equations given to us in the C standard. For all but two values
in float.h
, there is a separate named version for each floating point type
denoted by the prefixes FLT
, DBL
, and LDBL
, for float
, double
, and
long double
, respectively. We'll look at each value to see what it means and
how to compute it based on the data we've already collected.
FLT_ROUNDS
The rounding mode for floating-point addition. This value is used for all types
rather than having separate versions for each type (even though it is prefixed
with FLT
). The standard defines a set of possible assignments and their
meanings:
- \(-1\) - indeterminate
- \(0\) - toward zero
- \(1\) - to nearest
- \(2\) - toward positive infinity
- \(3\) - toward negative infinity
Your specific application of C may necessitate one of these specific rounding modes. However, what if you don't know? How do you choose one?
Well, we can look at What Every Computer Scientist Should Know About Floating-Point Arithmetic for advice (you did read it, didn't you?). There are two very relevant portions from this paper. The first is titled "Exactly Rounded Operations" in the "Rounding Error" section. The second is in the "IEEE Standard" section titled "Operations". The author suggests that you round to the nearest floating point number using the round to even method.
#define FLT_ROUNDS (1)
Exactly Rounded Operations
The paper states that rounding is pretty straightforward except for halfway cases such as 12.5; should that be 12 or 13? It then introduces the concept of "round to even" which proposes that instead of rounding up for all halfway values, they should be rounded towards the number that is even. This should result in 50% of the occurrences rounding up and 50% rounding down, rather than 100% rounding up. It also references a paper by Reiser and Knuth which backs up the position that round to even should be used.
So, why round to even on halfway values if half of the values are rounded down (0, 1, 2, 3, 4) and half of the values are rounded up (5, 6, 7, 8, 9)? Well, rounding up every time results in drift which is the gradual upward increase in repeated computations. By rounding to even you can eliminate drift since the error from one round up would be cancelled out by a subsequent round down.
IEEE 754-2008 Rounding
Goldberg's paper also references the IEEE standard because it also suggests that round to even be used. If we look at section 4.3, "Rounding-direction attributes", of IEEE 754-2008 we find that it refers to round to even as roundTiesToEven, stating that it should be the default rounding-direction attribute for results in binary formats. Although it also defines 4 other rounding attributes, we'll go with the default of roundTiesToEven which is an attribute of round to nearest.
FLT_RADIX
The radix of the exponent representation, \(b\). As stated before, the radix will always be \(2\) since we are performing binary floating-point arithmetic and the base of a binary numerical system is \(2\).
#define FLT_RADIX (2)
FLT_MANT_DIG
The number of base-FLT_RADIX
digits in the floating-point significand, \(p\).
Alternatively, this is the number of bits used to represent the significand
which is the value of \(p\) from our floating-types discussion above.
#define FLT_MANT_DIG (24)
#define DBL_MANT_DIG (53)
#define LDBL_MANT_DIG (64)
FLT_DIG
The number of decimal digits, \(q\), such that any floating-point number with \(q\) decimal digits can be rounded into a floating-point number with \(p\) radix \(b\) digits and back again without change to the \(q\) decimal digits,
This dictates the maximum number of decimal digits that can be converted to
binary floating-point and back with no changes resulting from conversion. We'll
use \(p\) (the FLT_MANT_DIG
family) and \(b\) (FLT_RADIX
).
First, let's try to simplify this equation a little bit. We can immediately see
that we will always be adding \(0\) to the result of the \(floor\) computation
because our \(b\) (FLT_RADIX
) value is \(2\) which is not a power of \(10\).
Secondly, since \(b\) is constant, \(log_{10}b\) is also constant. Lastly, we just
insert the value of \(p\) for each floating-type and we have our solutions.
FLT_DIG
\(= \lfloor(24-1) \times log_{10}2\rfloor + 0 = \lfloor 23 \times log_{10}2 \rfloor = \lfloor 6.92368... \rfloor = 6\)DBL_DIG
\(= \lfloor(53-1) \times log_{10}2\rfloor + 0 = \lfloor 52 \times log_{10}2 \rfloor = \lfloor 15.65355... \rfloor = 15\)LDBL_DIG
\(= \lfloor(64-1) \times log_{10}2\rfloor + 0 = \lfloor 63 \times log_{10}2 \rfloor = \lfloor 18.96488... \rfloor = 18\)
And we end up with:
#define FLT_DIG (6)
#define DBL_DIG (15)
#define LDBL_DIG (18)
FLT_MAX_EXP
The maximum integer such that FLT_RADIX
raised to that power minus \(1\) is a
representable finite floating-point number, \(e_{max}\). Or, the maximum possible
exponent we can use. To compute this we calculate the highest signed value that
we can store in the number of bits alotted to the exponent field. \(b\) is our
base and \(e\) is the exponent width (in bits) from above:
FLT_MAX_EXP
\(= 2^{8-1} - 1 = 2^{7} - 1 = 128 - 1 = 127\)DBL_MAX_EXP
\(= 2^{11-1} - 1 = 2^{10} - 1 = 1024 - 1 = 1023\)LDBL_MAX_EXP
\(= 2^{15-1} - 1 = 2^{14} - 1 = 16384 - 1 = 16383\)
We can verify these results against IEEE 754-2008 which has some handy tables that just give you the values for \(e_{max}\) (specifically Table 3.5-Binary interchange format parameters). Not to spoil the surprise, but our answers agree with the table.
Now, while these values are correct, they are not what we will actually use. Remember that hidden bit from before? This is where it comes in to play with the exponent, so we need to add 1 to these value to get the actual max exponent.
#define FLT_MAX_EXP (+128)
#define DBL_MAX_EXP (+1024)
#define LDBL_MAX_EXP (+16384)
FLT_MIN_EXP
The minimum negative integer such that FLT_RADIX
raised to that power minus
\(1\) is a normalized floating-point number, \(e_{min}\). Or, the lowest possible
exponent we can use. IEEE 754-2008 gives us the following equation to compute
this value:
It's important to note that the \(e_{max}\) in this equation is the computed value
of \(e_{max}\) from before, not the value of the FLT_EXP_MAX
family of macros.
FLT_MIN_EXP
\(= 1 - 127 = -126\)DBL_MIN_EXP
\(= 1 - 1023 = -1022\)LDBL_MIN_EXP
\(= 1 - 16383 = -16382\)
Again, due to the hidden bit in the significand, we must add one to these values to get the minimum exponent.
#define FLT_MIN_EXP (-125)
#define DBL_MIN_EXP (-1021)
#define LDBL_MIN_EXP (-16381)
FLT_MIN_10_EXP
The minimum negative integer such that \(10\) raised to that power is in the range of normalized floating-point numbers,
This is the minimum exponent we could use in a number written in decimal
scientific notation while still being able to reasonably convert it to a binary
floating-point representation. Here, \(e_{min}\) is the value of the FLT_MIN_EXP
family of macros to account for the hidden bit.
FLT_MIN_10_EXP
\(= \lceil log_{10}2^{-125-1} \rceil = \lceil log_{10}2^{-126} \rceil = \lceil -37.92977... \rceil = -37\)DBL_MIN_10_EXP
\(= \lceil log_{10}2^{-1021-1} \rceil = \lceil log_{10}2^{-1022} \rceil = \lceil -307.65265... \rceil = -307\)LDBL_MIN_10_EXP
\(= \lceil log_{10}2^{-16381-1} \rceil = \lceil log_{10}2^{-16382} \rceil = \lceil -4931.47338... \rceil = -4931\)
Which yields:
#define FLT_MIN_10_EXP (-37)
#define DBL_MIN_10_EXP (-307)
#define LDBL_MIN_10_EXP (-4931)
FLT_MAX_10_EXP
The maximum integer such that \(10\) raised to that power is in the range of representable finite floating-point numbers,
This is the maximum exponent we could use in a number written in decimal
scientific notation while still being able to reasonably convert it to a binary
floating-point representation. Here, \(e_{max}\) is the value of the FLT_MAX_EXP
family of macros to account for the hidden bit.
FLT_MAX_10_EXP
\(= \lfloor log_{10}((1 - 2^{-24}) \times 2^{128})\rfloor = \lfloor log_{10}3.40282... \times 10^{38} \rfloor = \lfloor 38.53183... \rfloor = 38\)DBL_MAX_10_EXP
\(= \lfloor log_{10}((1 - 2^{-53}) \times 2^{1024})\rfloor = \lfloor log_{10}1.79769... \times 10^{308} \rfloor = \lfloor 308.25471... \rfloor = 308\)LDBL_MAX_10_EXP
\(= \lfloor log_{10}((1 - 2^{-64}) \times 2^{16384})\rfloor = \lfloor log_{10}1.18973... \times 10^{4932} \rfloor = \lfloor 4932.07544... \rfloor = 4932\)
Whew!
#define FLT_MAX_10_EXP (+38)
#define DBL_MAX_10_EXP (+308)
#define LDBL_MAX_10_EXP (+4932)
FLT_MAX
The maximum representable floating-point number,
You'll notice that this exact equation is found embedded in the calculation of
FLT_MAX_10_EXP
because it uses the maximum floating-point number to derive
the exponent needed to represent it.
FLT_MAX
\(= (1 - 2^{-24}) \times 2^{128} = 3.40282347... \times 10^{38}\)DBL_MAX
\(= (1 - 2^{-53}) \times 2^{1024} = 1.7976931348623157... \times 10^{308}\)LDBL_MAX
\(= (1 - 2^{-64}) \times 2^{16384} = 1.18973149535723176502... \times 10^{4932}\)
Note that the exponents for each of these match the values we calculated for the
FLT_MAX_10_EXP
family of macros. You should also see that there are an
increasing number of significant figures as we go. This is because each
successively larger floating type can hold more precise values than the last. To
determine how many significant figures we should use you can just count the
number of digits in the largest decimal value that can be stored in the bits
alotted for the significand. This is equal to:
With that equation we arrive at \(8\), \(16\), and \(20\) decimal digits for the
float
, double
, and long double
floating types, respectively.
#define FLT_MAX (3.40282347E+38F)
#define DBL_MAX (1.7976931348623157E+308)
#define LDBL_MAX (1.18973149535723176502E+4932L)
FLT_EPSILON
The difference between \(1\) and the least value greater than \(1\) that is representable in the given floating point type,
In other words, you could store 1.0 as a float, but your floating point type
probably can't accurately represent \(1.00000000000000000000000023\). We need to
define the smallest step you can take when incrementing the floating point type
which is what FLT_EPSILON
provides.
FLT_EPSILON
\(= 2^{1-24} = 2^{-23} = 1.19209290... \times 10^{-7}\)DBL_EPSILON
\(= 2^{1-53} = 2^{-52} = 2.2204460492503131... \times 10^{-16}\)LDBL_EPSILON
\(= 2^{1-64} = 2^{-63} = 1.08420217248550443401... \times 10^{-19}\)
Again, we have used a number of significant digits which corresponds to the precision of the floating type which stores that value.
#define FLT_EPSILON (1.19209290-7F)
#define DBL_EPSILON (2.2204460492503131E-16)
#define LDBL_EPSILON (1.08420217248550443401E-19L)
FLT_MIN
The minimum normalized positive floating-point number,
You'll notice that this exact equation is embedded within the calculation of
the FLT_MIN_10_EXP
family of macros for the same reason as the equation used to
calculate FLT_MAX
(but for the minimum number).
FLT_MIN
\(= 2^{-125-1} = 2^{-126} = 1.17549435... \times 10^{-39}\)DBL_MIN
\(= 2^{-1021-1} = 2^{-1022} = 2.2250738585072014... \times 10^{-308}\)LDBL_MIN
\(= 2^{-16381-1} = 2^{-16382} = 3.36210314311209350626... \times 10^{-4932}\)
Those are some tiny numbers!
#define FLT_MIN (1.17549435E-38F)
#define DBL_MIN (2.2250738585072014E-308)
#define LDBL_MIN (3.36210314311209350626E-4932L)
Conclusion
You may have noticed an F
appended to each FLT_
value expressed in
floating-point notation. This is a floating suffix and is a mechanism used by
the C language to indicate that the prepended floating constant is to be
interpreted as a float
type. Similarly, the L
floating suffix is used to
denote a value to be interpreted as a long double
type. Floating constants
with no floating suffix are interpreted as a double
floating type by default.
Lastly, if you paid attention to the values we used vs. the values that
wolframalpha got then you'll notice that we used the "round to even" method,
when necessary, to achieve a proper result.
Our full implementation looks like so:
#ifndef _FLOAT_H
#define _FLOAT_H
#define FLT_ROUNDS (1)
#define FLT_RADIX (2)
#define FLT_MANT_DIG (24)
#define DBL_MANT_DIG (53)
#define LDBL_MANT_DIG (64)
#define FLT_DIG (6)
#define DBL_DIG (15)
#define LDBL_DIG (18)
#define FLT_MIN_EXP (-125)
#define DBL_MIN_EXP (-1021)
#define LDBL_MIN_EXP (-16381)
#define FLT_MIN_10_EXP (-37)
#define DBL_MIN_10_EXP (-307)
#define LDBL_MIN_10_EXP (-4931)
#define FLT_MAX_EXP (+128)
#define DBL_MAX_EXP (+1024)
#define LDBL_MAX_EXP (+16384)
#define FLT_MAX_10_EXP (+38)
#define DBL_MAX_10_EXP (+308)
#define LDBL_MAX_10_EXP (+4932)
#define FLT_MAX (3.40282347E+38F)
#define DBL_MAX (1.7976931348623157E+308)
#define LDBL_MAX (1.18973149535723176502E+4932L)
#define FLT_EPSILON (1.19209290-7F)
#define DBL_EPSILON (2.2204460492503131E-16)
#define LDBL_EPSILON (1.08420217248550443401E-19L)
#define FLT_MIN (1.17549435E-38F)
#define DBL_MIN (2.2250738585072014E-308)
#define LDBL_MIN (3.36210314311209350626E-4932L)
#endif /* _FLOAT_H */