September 15, 1998
TO: Sun Microsystems
FROM: Numerics Working Group (http://math.nist.gov/javanumerics/)
Java Grande Forum (http://www.javagrande.org/)
RE: Response to "Proposal for Extension of Java Floating Point Semantics,
Revision 1, May 1988"
PREFACE
The high efficiency necessary for large-scale numerical applications requires
aggressive exploitation of the unique facilities of floating-point hardware. At
the other extreme, some computations require very high predictability, even at
the expense of performance. The majority of users lie somewhere in between:
they want reasonably fast floating-point performance, and don't want to be
surprised when computed results misbehave unpredictably. Each of these
constituencies must be addressed if Java is to be widely adopted for
numerical computing.
Java Grande Forum Numerics Working Group (JGNWG) is evaluating the Java
Language and its environment for such applications. It has set a number of
constraints to work under:
* Relatively small, but visible, changes to Java and JVM can be made.
Upwards compatibility should be maintained, as well as a worthwhile
amount of backwards compatibility.
* Proposed changes should allow better execution speed on common
floating-point hardware.
* While improved performance is important, predictability must be
maintained.
This response is excepted from a larger report on Java being prepared by
the JGNWG which will be released later this month. Those who have contributed
to the drafting of this report are listed at the end of this memo.
INTRODUCTION
Recently, Sun released for public comment a Proposal for Extension of Java
Floating Point Semantics, Revision 1 [PEJ] (abbreviated in this document as
PEJFPS). PEJFPS is primarily targeted at improving Java's floating point
performance on the x86 line of processors. (No explicit license is granted
(yet) to use the fused mac (multiply-accumulate) instruction, which would
benefit users of the PowerPC, among other architectures.)
Assiduously implementing Java's current strict floating point semantics on
the x86 using previously published techniques is very expensive, potentially
more than an order of magnitude slower than slightly different semantics
[Gol]. A less expensive technique developed recently [Gol98] will be
discussed later. PEJFPS grants partial access to the double extended
floating point format found on the x86 in order to overcome the speed
problem. However, the reckless license to use or not to use double extended
granted by PEJFPS destroys Java's predictability (see recent submissions to
the numeric-interest mailing list, http://www.validgh.com/java/).
Java has been billed as providing "write once, run anywhere" program
development. For both theoretical and practical reasons, Java programs are
not nearly so portable nor reproducible as programmers would naively expect.
However, by exercising discipline (using single threaded code, using default
finalize methods), it is far easier to produce a Java program whose behavior
can be predicted than to produce an analogous C or C++ program with the same
property. Losing Java's predictability would be a significant loss.
Therefore, the JGNWG recommends that PEJFPS not be incorporated into Java.
Instead, JGNWG presents a counter-proposal that works within similar
constraints as PEJFPS but maintains the predictability of the language and
addresses additional numeric programming needs omitted from PEJFPS.
What is the problem on the x86?
x86 processors most naturally operate on 80-bit double extended floating point
values. A precision control word can be set to make the processor round to
single or double precision. However, even when rounding to a reduced precision,
the floating point registers still use the full 15 exponent bits of the double
extended format (instead of the 11 exponent bits for true double and 8 bits for
true float). A store to memory is required to narrow the exponent as well.
Since the register is not changed by the store, for further computation the
stored value must be loaded back from memory. This memory traffic degrades
Java's floating point performance on the x86. Moreover, this technique suffers
from a small discrepancy between operating on true double values and double
values with increased exponent range. Values that would be subnormal doubles
are not subnormal in double with extended exponent range. When such a number is
stored to true double, it can be rounded twice, leading to a difference in the
last bit, about 10^-324. Published techniques to remove this remaining minor
discrepancy can lead to an order of magnitude slowdown, so Java VMs on the x86
generally set the precision control to double precision and allow double
rounding on underflow, at variance with Java's specification [Gol].
The 10X potential performance degradation for exact floating point
conformance on the x86 is largely a theoretical concern since VMs on the x86
in practice use the store-reload technique. PEJFPS aims to eliminate the
smaller 2X to 4X penalty from store-reload. PEJFPS would remove this speed
penalty by allowing the x86's double extended registers to be used at full
precision and range. However, PEJFPS would put too few constraints on when,
where, and whether extended precision is used, leading to unpredictability.
There are two issues for exact reproducibility stemming from the x86's wider
exponent range: maintaining the proper overflow threshold and preserving the
proper gradual underflow behavior. The store-reload technique solves the
former problem but not the latter. Since additions and subtractions
resulting in subnormal values are exact, the underflow threshold is properly
maintained. Using the store-reload technique, double rounding on underflow
can only occur for multiplication and division.
Recently, a refinement of the store-reload technique that eliminates the
double rounding problem has been developed [Gol98]. To avoid double rounding
during multiplication, the new technique scales down one of the operands by
2^(Emax double extended - Emax double) where Emax is the largest exponent for
a given floating point format. After this scaling all operations that would
result in subnormals in true double also result in subnormals in double with
extended exponent range. This result is then rescaled back up by the same
quantity; normal results are unaffected and subnormals are properly rounded
once. A store of the product after being scaled enforces the overflow
threshold.
The procedure for division is analogous; the dividend can be scaled down or
the divisor can be scaled up. In both cases, the resulting quotient is
rescaled up to the proper magnitude.
This new technique has many advantages over previous approaches to making
the x86 round to true double exactly:
* The new technique is only marginally more expensive than the currently
used store-reload method. Therefore, exact emulation of true double
only entails a factor of 2 to 4 slowdown instead of a factor of 10.
* No special testing is needed to handle ±0.0, infinities, and NaN.
* Since the scalings up and down are exact, the proper IEEE sticky flags
are set.
* Also due to the exact scalings, the technique works under dynamic
rounding modes.
The JGNWG strongly encourages JVM writers for the x86 to adopt this new
technique.
What capabilities are needed?
Different numeric applications have different needs. Some, like certain
implementations of higher precision arithmetic using standard floating point
formats, depend on strict floating point semantics and could easily break if
"optimized." Other calculations, such as dot product and matrix multiply, are
relatively insensitive to aggressive optimization; meaningful answers result
even when blocking and other answer-changing optimizations are applied. The
vendor-supplied BLAS are heavily optimized for a given architecture; vendors
would not spend considerable resources creating optimized BLAS, sometimes
included hand-coded assembly, if there were not demand for such faster
programs. The needs of the typical Java programmer fall somewhere between these
extremes; there is a desire for floating point computation that is not
unnecessarily slow, but the programmer doesn't want to be surprised when his
computed results misbehave unpredictably.
Since Java is aimed at a wide audience, the JGNWG proposal changes Java's
default floating point semantics to allow somewhat better performance on the
x86 and PowerPC. However, for most programs on most inputs the change in
numerical results will not be noticed. Like PEJFPS, the JGNWG proposal adds
a "strict floating point" declaration to indicate current Java semantics
must be used. JGNWG also includes a second declaration to allow optimizers
to rearrange certain floating point operations as if they were associative.
Associativity enables many useful optimizations, including aggressive code
scheduling and blocking.
PEJFPS would mingle increased speed and increased precision; in widefp contexts
code presumably runs faster and may incidentally use increased precision and
range. Although it my have been intended only for the sake of fast register
spilling, PEJFPS would allow almost arbitrary truncation of results from
extended to base formats. In any case, the programmer is faced with an
unpredictable program, leading to the resurrection of bugs from earlier
systems, like the Sun III (see the recent comp.compilers thread "inlining +
optimization = nuisance bugs" for a contemporary example). JGNWG's proposal
does not mix considerations of speed and precision, rather, as a concession to
the x86, JGNWG allows extended exponent range in some circumstances.
Some architectures, such as the PowerPC, include a fused mac instruction
that multiplies two numbers exactly and then adds a third number, with a
single rounding error at the end. Machines with this instruction run faster
when it is used. Current Java semantics prohibit fused macs from being used.
There are three degrees of fused mac usage to support:
1. do not use fused macs at all,
2. use fused macs if they are fast (i.e. if there is hardware support),
and
3. used fused mac even if it requires (slow) software simulation.
Fused mac must be forbidden in some sensitive calculations. For example,
using fused mac recklessly can also lead to inappropriately negative
discriminants in quadratic formula calculations [Kah]. Using a fused mac if
available would give more accurate and faster dot products and matrix
multiplies. Some algorithms require a fused mac to work properly. Mandating
a fused mac is necessary to simulate a fused mac capable machine on one that
isn't. Requiring fused mac to be used is accomplished by an explicit method
call to a new method Math.fmac.
JAVA GRANDE COUNTER PROPOSAL
The JGNWG proposal is based upon an analysis that finds very few of the diverse
floating-point semantics allowed by PEJFPS to be both useful and necessary. On
the other hand, each of the few semantics of the JGNWG proposal are necessary
because dropping one would either
* hobble the performance of a commercially important family of computers,
or
* make common operations like multiplying large matrices unnecessarily slow.
The semantic contexts proposed by the JGNWG are as follows.
1. Java's present semantics. All floating point values are true float and
true double values.
2. The present Java semantics except that some subexpressions that would
have over/underflow in option 1 remain representable; and if underflow
is signaled then some underflowed values may be rounded twice instead
of once, differing from option 1 by around 10^-324. These semantics are
used by default on the x86 to ameliorate some of the performance
implications of exactly implementing Java's present floating point
semantics on that line of processors. (This can be accomplished by
admitting the use of 15-bit exponents in the representation of double
values on the JVM operand stack.)
3. Permission to use fused mac (multiply-accumulate) where available. This
can be used with either of the above.
4. Permission to use associativity with any of the above granted by the
code's author.
There are three kinds of floating point semantics in JGNWG, strict, default,
and "associative." The following table illustrates what effect these have on
current processors.
Architecture
Modifier x86 PowerPC SPARC
Java 1.0
strictfp (no double rounding on Java 1.0 Java 1.0
underflow) (no fused mac)
default (no modifier) larger exponent range fused mac can Java 1.0
allowed be used
associativefp many optimizations allowed on all platforms
Strict semantics are indicated by the new strictfp class and method
modifier. Strict semantics are the current Java floating point semantics;
fused mac is not permitted in strict code (unless the compiler can prove the
same answer results). On the x86, using stores to restrict the exponent
range can readily give exactly Java-conforming results for the float format.
Using the new technique described above, exactly Java-conforming results for
the double format can also be implemented at a tolerable cost.
Like PEJFPS, JGNWG proposes to modify the default Java floating point
semantics, i.e. those used in the absence of a strictfp or associativefp
declaration. In general, default semantics allow increased exponent range of
anonymous expressions (on the x86) or use of fused mac (on the PowerPC). In
default mode on the x86, anonymous float and double values created during
expression evaluation are allowed to use the larger exponent range of double
extended . If the extended exponent range is used in code with default
semantics, it must be consistently used for all anonymous values (this
implies that anonymous values spilled to memory must be spilled as 80 bit
quantities to preserve the exponent values). All explicit stores must be
respected and only true double and true float can be stored into
programmer-declared variables.
System properties indicate whether fused mac and extended exponent range are
used by a particular VM. It is always permissible to implement the default
semantics as the "strict" Java 1.0 semantics. Therefore, on a SPARC there is
no difference between the two kinds of floating point semantics.
It is possible that a program intended for strict semantics will fail under
the non-strict JGNWG default semantics. To ease detection of such cases,
JVMs should provide a runtime flag to force default semantics to be
implemented as strict.
Finally, JGNWG also includes an associativefp declaration to allow
optimizers to rearrange floating point operations as if they were
associative if the operations would be associative in the absence of
over/underflow and roundoff. Associativity enables many useful
optimizations, including aggressive code scheduling and blocking.
Optimizations may change the numerical outcome of a computation. To give
greater freedom to the optimizer, associativefp also relaxes the precise
exception model of Java; null pointer and index out of bound exceptions will
still be thrown and caught by the same handlers but the values in variables
might not be in the same state as in an unoptimized version of the code.
[NOTE: There remains heated discussion within the working group as to
the extent to which compilers should be allowed to optimize codes. There
is universal agreement that allowing the use of associativity should be
permitted as described above. More agressive types of optimizations
are possible and many users and compiler developers feel that they should
be allowed under some circumstances.]
On machines with fused mac instructions, chained multiply and add/subtract
operations in the source code can be fused at runtime in default mode.
Expressions are fused preferring fusing a product that is the right hand
argument to an addition over the left hand argument; for example, the
expression
a*b + c*d
is treated as fmac(a, b, c*d) and not fmac(c, d, a*b). Such fusing
operations must occur if fused mac is in use; this prevents programmer
surprises. Otherwise, optimizers could potentially fuse unexpected
expressions or prevent an expression from being fused (e.g., common
subexpression elimination reusing a product that prevents a fused mac). The
arguments to fused mac must be evaluated in left to right order.
When fused mac is being used, the programmer can explicitly store each
intermediate result to locally implement strict semantics.
Examples
The dot product loop
static double dot(double a[], double b[])
{
double s;
for(i = 0; i < a.length; i++)
s += a[i] * b[i];
return s;
}
can be compiled differently under strict and default semantics. In the
examples that follow, loop tests and other integer calculations are omitted
and are assumed to overlap with the floating point calculations. Under
strict Java 1.0 semantics on the x86, one compilation of the dot product
loop is:
// x86 code for strict dot product loop
// rounding precision set to double
// assume scaling factors are in the register stack
push scaling factor
load a[i] and multiply with scaling factor
load b[i] and multiply with a[i](scaled down)
multiply to rescale product
store product to restrict exponent
reload back restricted product
add product to s
store s to restrict exponent
load back restricted s
increment i
loop
As shown above, the product (a[i] * b[i]) and the sum (s + a[i] * b[i]) both
need to be stored to memory and loaded back in. As shown below, under
default semantics, only the sum needs to be written out to memory, halving
the excess memory traffic and removing two multiplications.
// x86 code for default dot product loop
// rounding precision set to double
load a[i]
load b[i] and multiply with a[i]
// product does not need to be stored/reloaded and scaled
add product to s
store s to restrict exponent
reload s
increment i
loop
A larger number of anonymous values in an expression results in a greater
reduction in the number of excess stores. A VM might also be able to use
trap handlers to achieve faster average execution. For example, trapping on
overflow could remove a load from the loop above, yielding
// x86 code for default dot product loop
// rounding precision set to double
// VM is using an overflow trap handler to remove a load
load a[i]
load b[i] and multiply with a[i]
add product to s
store s to restrict exponent // dummy store, reload if store overflows
// reload of s elided
increment i
loop
This trap handler is used by the compiler and not visible to the
applications programmer. The functionality of this trap handler is simple;
the trap handler just has to reload the stored value.
On the x86, if an expression (and all its subexpressions) neither overflows
nor underflows for a given input, executing the default compilation of the
expression will give the same result as the executing the strict
compilation. As when using fused mac, explicitly storing each intermediate
result can be used to implement strict semantics in a limited area. In
default mode, both float and double expressions can use the extended
exponent range. Method arguments and return values from methods must be
strict double or float values to prevent programmers from being ambushed by
greater exponent range they neither intended nor anticipated.
Cost of strictness
Using the new scaling technique, a matrix multiply with strict semantics can be
a little more than twice as slow as a matrix multiply with default semantics.
In both the loops below, an overflow trap handler is used to remove excess
loads in the common case. The sum variables are already in the stack. For
better instruction scheduling, two elements of the matrix product are
calculated simultaneously:
// x86 code for fast matrix multiply using default semantics
// rounding precision set to double
// VM is using an overflow trap handler
// the loop has approximately an 8 or 9 cycle latency on a Pentium
load b[i]
dup b[i]
load a_k[i] and multiply with b[i]
swap top two stack elements
load a_(k+1)[i] and multiply with b[i]
swap top two stack elements
add with pop a_k[i] * b[i] to sumk
add with pop a_(k+1)[i] * b[i] to sumk+1
loop
The strict code is similar:
// x86 code for fast matrix multiply using strict semantics
// rounding precision set to double
// VM is using an overflow trap handler
// the loop has approximately a 19 cycle latency on a Pentium
// assume scaling constants are already in the register stack
put scaling constant on the top of the stack
load b[i] and multiply with scaling factor
dup b[i](scaled down)
load a_k[i] and multiply with b[i](scaled down)
swap top two stack elements
load a_(k+1)[i] and multiply with b[i](scaled down)
swap top two stack elements
rescale a_k[i] * b[i](scaled down)
swap
rescale a_(k+1)[i] * b[i](scaled down)
dummy store of a_(k+1)[i] * b[i]
swap
dummy store of a_k[i] * b[i]
add with pop a_k[i] * b[i] to sum_k
add with pop a_(k+1)[i] * b[i] to sum_(k+1)
store sum_k
swap
store sum_(k+1)
swap
loop
Scoping
Whatever new floating point semantics are expressible in Java need to be
expressible in the JVM too. PEJFPS uses spare bits in a method descriptor to
indicate which kind of floating point semantics a methods has; JGNWG can use
the same approach. This provides method-level control of floating point
semantics. This would be the coarsest level acceptable to the JGNWG.
It may also be convenient to have a finer granularity block-level control.
While this is not critical to the JGNWG proposal, it should be considered.
Such a declaration is easily added to Java, but it is not immediate apparent
how to encode such information in the JVM. Java compilers can include extra
attributes in a class file ([JVM] § 4.7.1). These attributes can be used to
support things such as improved debugging facilities. JVMs are required to
ignore unknown attributes. Therefore, JGNWG could represent the different
floating point semantics of different portions of code using a table emitted
as an extra class file attribute. Strict semantics is always a permissible
policy under the JGNWG proposal; so, in this respect a JGNWG-style class
file would be backwards compatible with existing VMs.
Discussion
The JGNWG proposal allows improved hardware utilization over standard Java
while preserving program predictability. Programmers can test system properties
to determine the VM's behavior.
A reasonable question is that if Java Grande is opposed to PEJFPS due to its
unpredictability, why does Java Grande's proposal also allow some
unpredictability by default? JGNWG permits much less unpredictability than
PEJFPS and JGNWG has fewer differences between strict and default semantics.
For example, in JGNWG a floating point feature must be used consistently;
fused mac or extended exponent range cannot be used on one call to a method
and not used on the next (something allowed by PEJFPS). On the x86, between
allowing extended exponent range and allowing extra precision, allowing
extended exponent range results in many fewer visible differences between
strict code and default code.
The differences arising from extended exponent range on the x86 are visible
only if the calculation would over/underflow on a machine like the SPARC.
Over/underflow is comparatively rare in practice; therefore the Java Grande
differences would be observed at most rarely. PEJFPS allows extended
precision for intermediate results. Extended precision would almost always
be visible. For example,
// implicit widefp under PEJFPS
// default semantics under Java Grande
static foo(){
double one = 1.0;
double three = 3.0;
double a;
double b[] = new double[1];
a = one/three;
b[0] = a;
// Results under different proposals
// JGNWG PEJFPS
if(a == b[0]){...} // always true true or false?
if(a == (one/three)){...}// always true true or false?
}
If one/three is calculated to extended precision and a is treated as an
extended precision value, then a == b[0] will be false under PEJFPS since
arrays are always stored in the base format (32 bit float or 64 bit double).
If a is stored as double precision, a == (one/three) will be false if
(one/three)is calculated to double extended precision. The Java Grande
proposal would always return true for these cases. In short, on the x86 the
cases where the JGNWG proposal allows differences between default and strict
semantics are where overflow or underflow would occur; the cases where
PEJFPS allows differences between default and strict semantics are
(approximately) where an operation is inexact, as most are.
Additional Floating-point Types
The JGNWG proposal thus far does not provide any access to the double extended
format found on the x86. Consistent access to this format is important to
allow good hardware utilization and to ease writing numerical programs; having
access to several more bits of precision than the input data often allows
simpler (and sometimes faster) algorithms to be used. To access double
extended, JGNWG proposes that Java include a third primitive floating point
type called "indigenous." The indigenous floating point type corresponds to the
widest IEEE 743 floating point format that directly executes on the underlying
processor. On the x86, indigenous corresponds to the double extended format; on
most other processors, indigenous corresponds to the double format. (The
indigenous type must be at least as wide as double.) For a particular VM, class
variables indicate whether indigenous is implemented as double or double
extended.
The float and double floating point types have hardware support. Adding a
double extended type would require costly simulation on most architectures
other than the x86. Having an indigenous type preserves the performance
predictability of a program and keeps a close correspondence between
floating point types in a language and floating point formats on a
processor.
Implementing indigenous at the JVM level can be done either by adding new
JVM instructions or (as an interim measure) using the operator overloading
and lightweight classes described in the full JGNWG report.
Additional Floating-point Expression Evaluation Policies
To better utilize certain processors and to lessen the impact of rounding
errors, it is useful for the programmer to conveniently be able to evaluate a
floating point expression in a wider format. For example, a programmer may want
to evaluate an expression consisting of float variables in double precision.
Pre-ANSI C used this floating point expression evaluation policy exclusively.
JGNWG adds a new declaration, anonymous FloatingPointType, to control the
expression evaluation policy:
* anonymous double gives the original C expression evaluation; all float
values are promoted to double.
* anonymous indigenous promotes float and double values to indigenous.
Using anonymous indigenous makes best use of the x86's double extended
registers.
* anonymous float specifies to use the existing Java expression
evaluation policy.
No JVM changes are required to support anonymous double.
PARTICIPANTS IN THE NUMERICS WORKING GROUP
The following individuals contributed to the development of this document at
the Java Grande Forum meetings on May 9-10 and August 6-7 in Palo Alto,
California.
Ronald Boisvert, NIST, Co-chair
John Brophy, Visual Numerics
Sid Chatterjee, University of North Carolina
Dmitri Chiriaev, Sun
Jerome Coonen
Joe Darcy, University of California, Berkeley
David S. Dixon, Mantos Consulting
Geoffrey Fox, University of Syracuse
Steve Hague, NAG
Siamak Hassanzadeh, Sun
Lennart Johnsson, University of Houston
William Kahan, University of California, Berkeley
Cleve Moler, The MathWorks
Jose Moreira, IBM
Roldan Pozo, NIST, Co-chair
Kees van Reevwijk, Technical University, Delft
Keith Seymour, University of Tennessee
Nik Shaylor, Sun
Marc Snir, IBM
Henry Sowizral, Sun
George Thiruvathukal, Loyola University, Chicago
Anne Trefethen, NAG
The following additional individuals also contributed comments which helped
in the development of this document.
Susan Flynn-Hummel, IBM
Sam Midkiff, IBM
REFERENCES
[Dar]
Joseph D. Darcy, Borneo 1.0: Adding IEEE 754 floating point support to
Java, M.S. thesis, University of California, Berkeley, May 1998,
http://www.cs.berkeley.edu/~darcy/Borneo
[Gol]
Roger A. Golliver, First-implementation artifacts in JavaTM
[Gol98]
Roger A. Golliver, Personal communication, August 1998.
[JLS]
James Gosling, Bill Joy, and Guy Steele, The JavaTM Language
Specification, Addison-Wesley, 1996. See
http://java.sun.com/docs/books/jls/.
[JVM]
Tim Lindholm and Frank Yellin, The JavaTM Virtual Machine
Specification, Addison-Wesley, 1997. See
http://java.sun.com/docs/books/vmspec/.
[Kah]
William Kahan, Lecture Notes on the Status of IEEE Standard 754 for
Binary Floating-Point Arithmetic,
http://www.cs.berkeley.edu/~wkahan/ieee754status/ieee754.ps
[PEJ]
Sun Microsystems, Proposal for Extension of Java Floating Point in JDK
1.2, http://java.sun.com/feedback/fp.html