Similary to limits.h, some limits must be established on the characteristics of floating-point arithmetic for a given architecture. These are determined by a model described in the standard and implemented in float.h. This won't be a comprehensive introduction to floating-point arithmetic as there are already resources for that:

The Floating-Point Model

Before talking about the specifics of float.h, we need to lay some groundwork for how floating-point numbers are defined and some of the terminology that goes along with them.

  • sign - positive or negative
  • radix - the numerical base of exponent representation (binary would have a radix of 2)
  • significand/mantissa - the part of the floating-point number consisting of its significant digits
  • precision - the number of digits in the significand

With those definitions, we'll construct a few variables that we need to construct the floating-point model:

  • \(s\) - sign
  • \(b\) - base/radix
  • \(e\) - exponent
  • \(p\) - precision
  • \(f_{k}\) - nonnegative integers less than \(b\) (the significand digits)

The standard states, "A normalized floating-point number \(x ( f_1 > 0 \mbox{ if } x \neq 0 )\) is defined by the following model":

$$ x = s \times b^{e} \times \sum\limits_{k=1}^p f_{k} \times b^{-k} \mbox{ , } e_{min} \leq e \leq e_{max} $$

This is the model that we will implement with float.h and which can be used to construct a floating-point number. For this implementation, \(b\) will always be \(2\) since we're dealing with the binary representation of numbers. In addition, we must have a range for \(e\) by defining \(e_{min}\) and \(e_{max}\). To calculate these values we use the following equations:

  • \(k =\) type width in bits
  • \(e_{max} = 2^{k-p-1}\)
  • \(e_{min} = 1 - e_{max}\)

Floating Types

float, double, and long double are the floating-point types provided by C. Since this first version of welibc is for the x86_64 architecture we can turn to the x86_64 ABI to see how these types are implemented. Page 12 shows that these types are represented by single-precision, double-precision, and 80-bit extended precision floating-point types, respectively. It also shows us that that each of the three floating-point types are to be implemented according to the IEEE Standard for Floating-Point Arithmetic. I'll be choosing to use the most recent revision of that standard, IEEE 754-2008.

Now we have some reference documentation to help us implement float.h. We'll use this documentation to define the variables we discussed above and then we can use those to calculate the values we need for float.h. IEEE 754-2008 will give us values that define the widths, in bits, for the sign (\(s\)), exponent (\(e\)), and precision (\(p\)) of each floating-point type. From those variables we can derive the rest of the information we'll need.

IEEE 754 uses normalized floating-point numbers meaning that each value is represented with a nonzero leading digit. For example, \(2.2573 \times 10^{3}\) is normalized, but \(0.22573 \times 10^{4}\) is not, even though they represent the same quantity. By using normalized numbers we ensure that each value is uniquely represented.

Since we're using normalized numbers and the leading digit will always be nonzero (except for infinity, 0, and Not a Number), then the most significant bit of the significand would always be 1. There is no point in storing a bit if it will always be 1 so IEEE 754 uses that storage space to gain more precision. Since the original most significant bit is no longer stored in the floating type, it is referred to as a "hidden" bit.

The hidden bit has a few implications that we need to keep in mind through calculating our values. First, the field width of the significand is actually one more than the number of bits we use to represent it. We'll look at this below when we define all of the field widths. Second, the exponent field implicitly encodes the hidden bit for us so the computed exponent ranges are really one less than the value we need. I'll make sure to note these as we go along.

Float

The float type is a single-precision floating-point number represented by 32 bits (\(k = 32\)). The IEEE 754-2008 standard refers to this format as binary32 and tells us that the bits are broken down like so:

  • \(s = 1\)
  • \(e = 8\)
  • \(p = 24\)

See how the total bits here is actually 33? This is due to the hidden bit in the significand; there are physically 23 bits for the significand but we get 24 bits because the most significant bit is always 1 for normalized floating point numbers.

Double

The double type is a double-precision floating-point number represented by 64 bits (\(k = 64\)). The IEEE 754-2008 standard refers to this format as binary64 and breaks down the bits like so:

  • \(s = 1\)
  • \(e = 11\)
  • \(p = 53\)

Again, 53 bits for the significand where there are only 52 available.

Long Double

The long double type is an extended-precision floating-point number represented by 80 bits (\(k = 80\)). The IEEE 754-2008 standard refers to this format as binary64 extended and breaks down the bits like so:

  • \(s = 1\)
  • \(e = 15\)
  • \(p = 64\)

Don't forget that the hidden bit for the significand is what gives us 64 instead of 63 bits for \(p\).

float.h Values and Equations

Now that we have the inputs we need, we can derive each value in float.h according to the equations given to us in the C standard. For all but two values in float.h, there is a separate named version for each floating point type denoted by the prefixes FLT, DBL, and LDBL, for float, double, and long double, respectively. We'll look at each value to see what it means and how to compute it based on the data we've already collected.

FLT_ROUNDS

The rounding mode for floating-point addition. This value is used for all types rather than having separate versions for each type (even though it is prefixed with FLT). The standard defines a set of possible assignments and their meanings:

  • \(-1\) - indeterminate
  • \(0\) - toward zero
  • \(1\) - to nearest
  • \(2\) - toward positive infinity
  • \(3\) - toward negative infinity

Your specific application of C may necessitate one of these specific rounding modes. However, what if you don't know? How do you choose one?

Well, we can look at What Every Computer Scientist Should Know About Floating-Point Arithmetic for advice (you did read it, didn't you?). There are two very relevant portions from this paper. The first is titled "Exactly Rounded Operations" in the "Rounding Error" section. The second is in the "IEEE Standard" section titled "Operations". The author suggests that you round to the nearest floating point number using the round to even method.

#define FLT_ROUNDS  (1)

Exactly Rounded Operations

The paper states that rounding is pretty straightforward except for halfway cases such as 12.5; should that be 12 or 13? It then introduces the concept of "round to even" which proposes that instead of rounding up for all halfway values, they should be rounded towards the number that is even. This should result in 50% of the occurrences rounding up and 50% rounding down, rather than 100% rounding up. It also references a paper by Reiser and Knuth which backs up the position that round to even should be used.

So, why round to even on halfway values if half of the values are rounded down (0, 1, 2, 3, 4) and half of the values are rounded up (5, 6, 7, 8, 9)? Well, rounding up every time results in drift which is the gradual upward increase in repeated computations. By rounding to even you can eliminate drift since the error from one round up would be cancelled out by a subsequent round down.

IEEE 754-2008 Rounding

Goldberg's paper also references the IEEE standard because it also suggests that round to even be used. If we look at section 4.3, "Rounding-direction attributes", of IEEE 754-2008 we find that it refers to round to even as roundTiesToEven, stating that it should be the default rounding-direction attribute for results in binary formats. Although it also defines 4 other rounding attributes, we'll go with the default of roundTiesToEven which is an attribute of round to nearest.

FLT_RADIX

The radix of the exponent representation, \(b\). As stated before, the radix will always be \(2\) since we are performing binary floating-point arithmetic and the base of a binary numerical system is \(2\).

#define FLT_RADIX   (2)

FLT_MANT_DIG

The number of base-FLT_RADIX digits in the floating-point significand, \(p\). Alternatively, this is the number of bits used to represent the significand which is the value of \(p\) from our floating-types discussion above.

#define FLT_MANT_DIG    (24)
#define DBL_MANT_DIG    (53)
#define LDBL_MANT_DIG   (64)

FLT_DIG

The number of decimal digits, \(q\), such that any floating-point number with \(q\) decimal digits can be rounded into a floating-point number with \(p\) radix \(b\) digits and back again without change to the \(q\) decimal digits,

$$ \lfloor(p-1) \times log_{10}b \rfloor + \begin{cases} 1 & \text{if } b \text{ is a power of 10}\\ 0 & \text{otherwise} \end{cases} $$

This dictates the maximum number of decimal digits that can be converted to binary floating-point and back with no changes resulting from conversion. We'll use \(p\) (the FLT_MANT_DIG family) and \(b\) (FLT_RADIX).

First, let's try to simplify this equation a little bit. We can immediately see that we will always be adding \(0\) to the result of the \(floor\) computation because our \(b\) (FLT_RADIX) value is \(2\) which is not a power of \(10\). Secondly, since \(b\) is constant, \(log_{10}b\) is also constant. Lastly, we just insert the value of \(p\) for each floating-type and we have our solutions.

And we end up with:

#define FLT_DIG     (6)
#define DBL_DIG     (15)
#define LDBL_DIG    (18)

FLT_MAX_EXP

The maximum integer such that FLT_RADIX raised to that power minus \(1\) is a representable finite floating-point number, \(e_{max}\). Or, the maximum possible exponent we can use. To compute this we calculate the highest signed value that we can store in the number of bits alotted to the exponent field. \(b\) is our base and \(e\) is the exponent width (in bits) from above:

$$ b^{e-1} - 1 $$

We can verify these results against IEEE 754-2008 which has some handy tables that just give you the values for \(e_{max}\) (specifically Table 3.5-Binary interchange format parameters). Not to spoil the surprise, but our answers agree with the table.

Now, while these values are correct, they are not what we will actually use. Remember that hidden bit from before? This is where it comes in to play with the exponent, so we need to add 1 to these value to get the actual max exponent.

#define FLT_MAX_EXP     (+128)
#define DBL_MAX_EXP     (+1024)
#define LDBL_MAX_EXP    (+16384)

FLT_MIN_EXP

The minimum negative integer such that FLT_RADIX raised to that power minus \(1\) is a normalized floating-point number, \(e_{min}\). Or, the lowest possible exponent we can use. IEEE 754-2008 gives us the following equation to compute this value:

$$ e_{min} = 1 - e_{max} $$

It's important to note that the \(e_{max}\) in this equation is the computed value of \(e_{max}\) from before, not the value of the FLT_EXP_MAX family of macros.

Again, due to the hidden bit in the significand, we must add one to these values to get the minimum exponent.

#define FLT_MIN_EXP     (-125)
#define DBL_MIN_EXP     (-1021)
#define LDBL_MIN_EXP    (-16381)

FLT_MIN_10_EXP

The minimum negative integer such that \(10\) raised to that power is in the range of normalized floating-point numbers,

$$ \lceil log_{10}b^{e_{min}-1} \rceil $$

This is the minimum exponent we could use in a number written in decimal scientific notation while still being able to reasonably convert it to a binary floating-point representation. Here, \(e_{min}\) is the value of the FLT_MIN_EXP family of macros to account for the hidden bit.

Which yields:

#define FLT_MIN_10_EXP  (-37)
#define DBL_MIN_10_EXP  (-307)
#define LDBL_MIN_10_EXP (-4931)

FLT_MAX_10_EXP

The maximum integer such that \(10\) raised to that power is in the range of representable finite floating-point numbers,

$$ \lfloor log_{10}((1 - b^{-p}) \times b^{e_{max}}) \rfloor $$

This is the maximum exponent we could use in a number written in decimal scientific notation while still being able to reasonably convert it to a binary floating-point representation. Here, \(e_{max}\) is the value of the FLT_MAX_EXP family of macros to account for the hidden bit.

Whew!

#define FLT_MAX_10_EXP  (+38)
#define DBL_MAX_10_EXP  (+308)
#define LDBL_MAX_10_EXP (+4932)

FLT_MAX

The maximum representable floating-point number,

$$ (1 - b^{-p}) \times b^{e_{max}} $$

You'll notice that this exact equation is found embedded in the calculation of FLT_MAX_10_EXP because it uses the maximum floating-point number to derive the exponent needed to represent it.

Note that the exponents for each of these match the values we calculated for the FLT_MAX_10_EXP family of macros. You should also see that there are an increasing number of significant figures as we go. This is because each successively larger floating type can hold more precise values than the last. To determine how many significant figures we should use you can just count the number of digits in the largest decimal value that can be stored in the bits alotted for the significand. This is equal to:

$$ \lceil log_{10}2^{p} \rceil $$

With that equation we arrive at \(8\), \(16\), and \(20\) decimal digits for the float, double, and long double floating types, respectively.

#define FLT_MAX     (3.40282347E+38F)
#define DBL_MAX     (1.7976931348623157E+308)
#define LDBL_MAX    (1.18973149535723176502E+4932L)

FLT_EPSILON

The difference between \(1\) and the least value greater than \(1\) that is representable in the given floating point type,

$$ b^{1-p} $$

In other words, you could store 1.0 as a float, but your floating point type probably can't accurately represent \(1.00000000000000000000000023\). We need to define the smallest step you can take when incrementing the floating point type which is what FLT_EPSILON provides.

Again, we have used a number of significant digits which corresponds to the precision of the floating type which stores that value.

#define FLT_EPSILON     (1.19209290-7F)
#define DBL_EPSILON     (2.2204460492503131E-16)
#define LDBL_EPSILON    (1.08420217248550443401E-19L)

FLT_MIN

The minimum normalized positive floating-point number,

$$ b^{e_{min}-1} $$

You'll notice that this exact equation is embedded within the calculation of the FLT_MIN_10_EXP family of macros for the same reason as the equation used to calculate FLT_MAX (but for the minimum number).

Those are some tiny numbers!

#define FLT_MIN     (1.17549435E-38F)
#define DBL_MIN     (2.2250738585072014E-308)
#define LDBL_MIN    (3.36210314311209350626E-4932L)

Conclusion

You may have noticed an F appended to each FLT_ value expressed in floating-point notation. This is a floating suffix and is a mechanism used by the C language to indicate that the prepended floating constant is to be interpreted as a float type. Similarly, the L floating suffix is used to denote a value to be interpreted as a long double type. Floating constants with no floating suffix are interpreted as a double floating type by default. Lastly, if you paid attention to the values we used vs. the values that wolframalpha got then you'll notice that we used the "round to even" method, when necessary, to achieve a proper result.

Our full implementation looks like so:

#ifndef _FLOAT_H
#define _FLOAT_H

#define FLT_ROUNDS      (1)

#define FLT_RADIX       (2)

#define FLT_MANT_DIG    (24)
#define DBL_MANT_DIG    (53)
#define LDBL_MANT_DIG   (64)

#define FLT_DIG         (6)
#define DBL_DIG         (15)
#define LDBL_DIG        (18)

#define FLT_MIN_EXP     (-125)
#define DBL_MIN_EXP     (-1021)
#define LDBL_MIN_EXP    (-16381)

#define FLT_MIN_10_EXP  (-37)
#define DBL_MIN_10_EXP  (-307)
#define LDBL_MIN_10_EXP (-4931)

#define FLT_MAX_EXP     (+128)
#define DBL_MAX_EXP     (+1024)
#define LDBL_MAX_EXP    (+16384)

#define FLT_MAX_10_EXP  (+38)
#define DBL_MAX_10_EXP  (+308)
#define LDBL_MAX_10_EXP (+4932)

#define FLT_MAX         (3.40282347E+38F)
#define DBL_MAX         (1.7976931348623157E+308)
#define LDBL_MAX        (1.18973149535723176502E+4932L)

#define FLT_EPSILON     (1.19209290-7F)
#define DBL_EPSILON     (2.2204460492503131E-16)
#define LDBL_EPSILON    (1.08420217248550443401E-19L)

#define FLT_MIN         (1.17549435E-38F)
#define DBL_MIN         (2.2250738585072014E-308)
#define LDBL_MIN        (3.36210314311209350626E-4932L)

#endif /* _FLOAT_H */
comments powered by Disqus