September 15, 1998 TO: Sun Microsystems FROM: Numerics Working Group (http://math.nist.gov/javanumerics/) Java Grande Forum (http://www.javagrande.org/) RE: Response to "Proposal for Extension of Java Floating Point Semantics, Revision 1, May 1988" PREFACE The high efficiency necessary for large-scale numerical applications requires aggressive exploitation of the unique facilities of floating-point hardware. At the other extreme, some computations require very high predictability, even at the expense of performance. The majority of users lie somewhere in between: they want reasonably fast floating-point performance, and don't want to be surprised when computed results misbehave unpredictably. Each of these constituencies must be addressed if Java is to be widely adopted for numerical computing. Java Grande Forum Numerics Working Group (JGNWG) is evaluating the Java Language and its environment for such applications. It has set a number of constraints to work under: * Relatively small, but visible, changes to Java and JVM can be made. Upwards compatibility should be maintained, as well as a worthwhile amount of backwards compatibility. * Proposed changes should allow better execution speed on common floating-point hardware. * While improved performance is important, predictability must be maintained. This response is excepted from a larger report on Java being prepared by the JGNWG which will be released later this month. Those who have contributed to the drafting of this report are listed at the end of this memo. INTRODUCTION Recently, Sun released for public comment a Proposal for Extension of Java Floating Point Semantics, Revision 1 [PEJ] (abbreviated in this document as PEJFPS). PEJFPS is primarily targeted at improving Java's floating point performance on the x86 line of processors. (No explicit license is granted (yet) to use the fused mac (multiply-accumulate) instruction, which would benefit users of the PowerPC, among other architectures.) Assiduously implementing Java's current strict floating point semantics on the x86 using previously published techniques is very expensive, potentially more than an order of magnitude slower than slightly different semantics [Gol]. A less expensive technique developed recently [Gol98] will be discussed later. PEJFPS grants partial access to the double extended floating point format found on the x86 in order to overcome the speed problem. However, the reckless license to use or not to use double extended granted by PEJFPS destroys Java's predictability (see recent submissions to the numeric-interest mailing list, http://www.validgh.com/java/). Java has been billed as providing "write once, run anywhere" program development. For both theoretical and practical reasons, Java programs are not nearly so portable nor reproducible as programmers would naively expect. However, by exercising discipline (using single threaded code, using default finalize methods), it is far easier to produce a Java program whose behavior can be predicted than to produce an analogous C or C++ program with the same property. Losing Java's predictability would be a significant loss. Therefore, the JGNWG recommends that PEJFPS not be incorporated into Java. Instead, JGNWG presents a counter-proposal that works within similar constraints as PEJFPS but maintains the predictability of the language and addresses additional numeric programming needs omitted from PEJFPS. What is the problem on the x86? x86 processors most naturally operate on 80-bit double extended floating point values. A precision control word can be set to make the processor round to single or double precision. However, even when rounding to a reduced precision, the floating point registers still use the full 15 exponent bits of the double extended format (instead of the 11 exponent bits for true double and 8 bits for true float). A store to memory is required to narrow the exponent as well. Since the register is not changed by the store, for further computation the stored value must be loaded back from memory. This memory traffic degrades Java's floating point performance on the x86. Moreover, this technique suffers from a small discrepancy between operating on true double values and double values with increased exponent range. Values that would be subnormal doubles are not subnormal in double with extended exponent range. When such a number is stored to true double, it can be rounded twice, leading to a difference in the last bit, about 10^-324. Published techniques to remove this remaining minor discrepancy can lead to an order of magnitude slowdown, so Java VMs on the x86 generally set the precision control to double precision and allow double rounding on underflow, at variance with Java's specification [Gol]. The 10X potential performance degradation for exact floating point conformance on the x86 is largely a theoretical concern since VMs on the x86 in practice use the store-reload technique. PEJFPS aims to eliminate the smaller 2X to 4X penalty from store-reload. PEJFPS would remove this speed penalty by allowing the x86's double extended registers to be used at full precision and range. However, PEJFPS would put too few constraints on when, where, and whether extended precision is used, leading to unpredictability. There are two issues for exact reproducibility stemming from the x86's wider exponent range: maintaining the proper overflow threshold and preserving the proper gradual underflow behavior. The store-reload technique solves the former problem but not the latter. Since additions and subtractions resulting in subnormal values are exact, the underflow threshold is properly maintained. Using the store-reload technique, double rounding on underflow can only occur for multiplication and division. Recently, a refinement of the store-reload technique that eliminates the double rounding problem has been developed [Gol98]. To avoid double rounding during multiplication, the new technique scales down one of the operands by 2^(Emax double extended - Emax double) where Emax is the largest exponent for a given floating point format. After this scaling all operations that would result in subnormals in true double also result in subnormals in double with extended exponent range. This result is then rescaled back up by the same quantity; normal results are unaffected and subnormals are properly rounded once. A store of the product after being scaled enforces the overflow threshold. The procedure for division is analogous; the dividend can be scaled down or the divisor can be scaled up. In both cases, the resulting quotient is rescaled up to the proper magnitude. This new technique has many advantages over previous approaches to making the x86 round to true double exactly: * The new technique is only marginally more expensive than the currently used store-reload method. Therefore, exact emulation of true double only entails a factor of 2 to 4 slowdown instead of a factor of 10. * No special testing is needed to handle ±0.0, infinities, and NaN. * Since the scalings up and down are exact, the proper IEEE sticky flags are set. * Also due to the exact scalings, the technique works under dynamic rounding modes. The JGNWG strongly encourages JVM writers for the x86 to adopt this new technique. What capabilities are needed? Different numeric applications have different needs. Some, like certain implementations of higher precision arithmetic using standard floating point formats, depend on strict floating point semantics and could easily break if "optimized." Other calculations, such as dot product and matrix multiply, are relatively insensitive to aggressive optimization; meaningful answers result even when blocking and other answer-changing optimizations are applied. The vendor-supplied BLAS are heavily optimized for a given architecture; vendors would not spend considerable resources creating optimized BLAS, sometimes included hand-coded assembly, if there were not demand for such faster programs. The needs of the typical Java programmer fall somewhere between these extremes; there is a desire for floating point computation that is not unnecessarily slow, but the programmer doesn't want to be surprised when his computed results misbehave unpredictably. Since Java is aimed at a wide audience, the JGNWG proposal changes Java's default floating point semantics to allow somewhat better performance on the x86 and PowerPC. However, for most programs on most inputs the change in numerical results will not be noticed. Like PEJFPS, the JGNWG proposal adds a "strict floating point" declaration to indicate current Java semantics must be used. JGNWG also includes a second declaration to allow optimizers to rearrange certain floating point operations as if they were associative. Associativity enables many useful optimizations, including aggressive code scheduling and blocking. PEJFPS would mingle increased speed and increased precision; in widefp contexts code presumably runs faster and may incidentally use increased precision and range. Although it my have been intended only for the sake of fast register spilling, PEJFPS would allow almost arbitrary truncation of results from extended to base formats. In any case, the programmer is faced with an unpredictable program, leading to the resurrection of bugs from earlier systems, like the Sun III (see the recent comp.compilers thread "inlining + optimization = nuisance bugs" for a contemporary example). JGNWG's proposal does not mix considerations of speed and precision, rather, as a concession to the x86, JGNWG allows extended exponent range in some circumstances. Some architectures, such as the PowerPC, include a fused mac instruction that multiplies two numbers exactly and then adds a third number, with a single rounding error at the end. Machines with this instruction run faster when it is used. Current Java semantics prohibit fused macs from being used. There are three degrees of fused mac usage to support: 1. do not use fused macs at all, 2. use fused macs if they are fast (i.e. if there is hardware support), and 3. used fused mac even if it requires (slow) software simulation. Fused mac must be forbidden in some sensitive calculations. For example, using fused mac recklessly can also lead to inappropriately negative discriminants in quadratic formula calculations [Kah]. Using a fused mac if available would give more accurate and faster dot products and matrix multiplies. Some algorithms require a fused mac to work properly. Mandating a fused mac is necessary to simulate a fused mac capable machine on one that isn't. Requiring fused mac to be used is accomplished by an explicit method call to a new method Math.fmac. JAVA GRANDE COUNTER PROPOSAL The JGNWG proposal is based upon an analysis that finds very few of the diverse floating-point semantics allowed by PEJFPS to be both useful and necessary. On the other hand, each of the few semantics of the JGNWG proposal are necessary because dropping one would either * hobble the performance of a commercially important family of computers, or * make common operations like multiplying large matrices unnecessarily slow. The semantic contexts proposed by the JGNWG are as follows. 1. Java's present semantics. All floating point values are true float and true double values. 2. The present Java semantics except that some subexpressions that would have over/underflow in option 1 remain representable; and if underflow is signaled then some underflowed values may be rounded twice instead of once, differing from option 1 by around 10^-324. These semantics are used by default on the x86 to ameliorate some of the performance implications of exactly implementing Java's present floating point semantics on that line of processors. (This can be accomplished by admitting the use of 15-bit exponents in the representation of double values on the JVM operand stack.) 3. Permission to use fused mac (multiply-accumulate) where available. This can be used with either of the above. 4. Permission to use associativity with any of the above granted by the code's author. There are three kinds of floating point semantics in JGNWG, strict, default, and "associative." The following table illustrates what effect these have on current processors. Architecture Modifier x86 PowerPC SPARC Java 1.0 strictfp (no double rounding on Java 1.0 Java 1.0 underflow) (no fused mac) default (no modifier) larger exponent range fused mac can Java 1.0 allowed be used associativefp many optimizations allowed on all platforms Strict semantics are indicated by the new strictfp class and method modifier. Strict semantics are the current Java floating point semantics; fused mac is not permitted in strict code (unless the compiler can prove the same answer results). On the x86, using stores to restrict the exponent range can readily give exactly Java-conforming results for the float format. Using the new technique described above, exactly Java-conforming results for the double format can also be implemented at a tolerable cost. Like PEJFPS, JGNWG proposes to modify the default Java floating point semantics, i.e. those used in the absence of a strictfp or associativefp declaration. In general, default semantics allow increased exponent range of anonymous expressions (on the x86) or use of fused mac (on the PowerPC). In default mode on the x86, anonymous float and double values created during expression evaluation are allowed to use the larger exponent range of double extended . If the extended exponent range is used in code with default semantics, it must be consistently used for all anonymous values (this implies that anonymous values spilled to memory must be spilled as 80 bit quantities to preserve the exponent values). All explicit stores must be respected and only true double and true float can be stored into programmer-declared variables. System properties indicate whether fused mac and extended exponent range are used by a particular VM. It is always permissible to implement the default semantics as the "strict" Java 1.0 semantics. Therefore, on a SPARC there is no difference between the two kinds of floating point semantics. It is possible that a program intended for strict semantics will fail under the non-strict JGNWG default semantics. To ease detection of such cases, JVMs should provide a runtime flag to force default semantics to be implemented as strict. Finally, JGNWG also includes an associativefp declaration to allow optimizers to rearrange floating point operations as if they were associative if the operations would be associative in the absence of over/underflow and roundoff. Associativity enables many useful optimizations, including aggressive code scheduling and blocking. Optimizations may change the numerical outcome of a computation. To give greater freedom to the optimizer, associativefp also relaxes the precise exception model of Java; null pointer and index out of bound exceptions will still be thrown and caught by the same handlers but the values in variables might not be in the same state as in an unoptimized version of the code. [NOTE: There remains heated discussion within the working group as to the extent to which compilers should be allowed to optimize codes. There is universal agreement that allowing the use of associativity should be permitted as described above. More agressive types of optimizations are possible and many users and compiler developers feel that they should be allowed under some circumstances.] On machines with fused mac instructions, chained multiply and add/subtract operations in the source code can be fused at runtime in default mode. Expressions are fused preferring fusing a product that is the right hand argument to an addition over the left hand argument; for example, the expression a*b + c*d is treated as fmac(a, b, c*d) and not fmac(c, d, a*b). Such fusing operations must occur if fused mac is in use; this prevents programmer surprises. Otherwise, optimizers could potentially fuse unexpected expressions or prevent an expression from being fused (e.g., common subexpression elimination reusing a product that prevents a fused mac). The arguments to fused mac must be evaluated in left to right order. When fused mac is being used, the programmer can explicitly store each intermediate result to locally implement strict semantics. Examples The dot product loop static double dot(double a[], double b[]) { double s; for(i = 0; i < a.length; i++) s += a[i] * b[i]; return s; } can be compiled differently under strict and default semantics. In the examples that follow, loop tests and other integer calculations are omitted and are assumed to overlap with the floating point calculations. Under strict Java 1.0 semantics on the x86, one compilation of the dot product loop is: // x86 code for strict dot product loop // rounding precision set to double // assume scaling factors are in the register stack push scaling factor load a[i] and multiply with scaling factor load b[i] and multiply with a[i](scaled down) multiply to rescale product store product to restrict exponent reload back restricted product add product to s store s to restrict exponent load back restricted s increment i loop As shown above, the product (a[i] * b[i]) and the sum (s + a[i] * b[i]) both need to be stored to memory and loaded back in. As shown below, under default semantics, only the sum needs to be written out to memory, halving the excess memory traffic and removing two multiplications. // x86 code for default dot product loop // rounding precision set to double load a[i] load b[i] and multiply with a[i] // product does not need to be stored/reloaded and scaled add product to s store s to restrict exponent reload s increment i loop A larger number of anonymous values in an expression results in a greater reduction in the number of excess stores. A VM might also be able to use trap handlers to achieve faster average execution. For example, trapping on overflow could remove a load from the loop above, yielding // x86 code for default dot product loop // rounding precision set to double // VM is using an overflow trap handler to remove a load load a[i] load b[i] and multiply with a[i] add product to s store s to restrict exponent // dummy store, reload if store overflows // reload of s elided increment i loop This trap handler is used by the compiler and not visible to the applications programmer. The functionality of this trap handler is simple; the trap handler just has to reload the stored value. On the x86, if an expression (and all its subexpressions) neither overflows nor underflows for a given input, executing the default compilation of the expression will give the same result as the executing the strict compilation. As when using fused mac, explicitly storing each intermediate result can be used to implement strict semantics in a limited area. In default mode, both float and double expressions can use the extended exponent range. Method arguments and return values from methods must be strict double or float values to prevent programmers from being ambushed by greater exponent range they neither intended nor anticipated. Cost of strictness Using the new scaling technique, a matrix multiply with strict semantics can be a little more than twice as slow as a matrix multiply with default semantics. In both the loops below, an overflow trap handler is used to remove excess loads in the common case. The sum variables are already in the stack. For better instruction scheduling, two elements of the matrix product are calculated simultaneously: // x86 code for fast matrix multiply using default semantics // rounding precision set to double // VM is using an overflow trap handler // the loop has approximately an 8 or 9 cycle latency on a Pentium load b[i] dup b[i] load a_k[i] and multiply with b[i] swap top two stack elements load a_(k+1)[i] and multiply with b[i] swap top two stack elements add with pop a_k[i] * b[i] to sumk add with pop a_(k+1)[i] * b[i] to sumk+1 loop The strict code is similar: // x86 code for fast matrix multiply using strict semantics // rounding precision set to double // VM is using an overflow trap handler // the loop has approximately a 19 cycle latency on a Pentium // assume scaling constants are already in the register stack put scaling constant on the top of the stack load b[i] and multiply with scaling factor dup b[i](scaled down) load a_k[i] and multiply with b[i](scaled down) swap top two stack elements load a_(k+1)[i] and multiply with b[i](scaled down) swap top two stack elements rescale a_k[i] * b[i](scaled down) swap rescale a_(k+1)[i] * b[i](scaled down) dummy store of a_(k+1)[i] * b[i] swap dummy store of a_k[i] * b[i] add with pop a_k[i] * b[i] to sum_k add with pop a_(k+1)[i] * b[i] to sum_(k+1) store sum_k swap store sum_(k+1) swap loop Scoping Whatever new floating point semantics are expressible in Java need to be expressible in the JVM too. PEJFPS uses spare bits in a method descriptor to indicate which kind of floating point semantics a methods has; JGNWG can use the same approach. This provides method-level control of floating point semantics. This would be the coarsest level acceptable to the JGNWG. It may also be convenient to have a finer granularity block-level control. While this is not critical to the JGNWG proposal, it should be considered. Such a declaration is easily added to Java, but it is not immediate apparent how to encode such information in the JVM. Java compilers can include extra attributes in a class file ([JVM] § 4.7.1). These attributes can be used to support things such as improved debugging facilities. JVMs are required to ignore unknown attributes. Therefore, JGNWG could represent the different floating point semantics of different portions of code using a table emitted as an extra class file attribute. Strict semantics is always a permissible policy under the JGNWG proposal; so, in this respect a JGNWG-style class file would be backwards compatible with existing VMs. Discussion The JGNWG proposal allows improved hardware utilization over standard Java while preserving program predictability. Programmers can test system properties to determine the VM's behavior. A reasonable question is that if Java Grande is opposed to PEJFPS due to its unpredictability, why does Java Grande's proposal also allow some unpredictability by default? JGNWG permits much less unpredictability than PEJFPS and JGNWG has fewer differences between strict and default semantics. For example, in JGNWG a floating point feature must be used consistently; fused mac or extended exponent range cannot be used on one call to a method and not used on the next (something allowed by PEJFPS). On the x86, between allowing extended exponent range and allowing extra precision, allowing extended exponent range results in many fewer visible differences between strict code and default code. The differences arising from extended exponent range on the x86 are visible only if the calculation would over/underflow on a machine like the SPARC. Over/underflow is comparatively rare in practice; therefore the Java Grande differences would be observed at most rarely. PEJFPS allows extended precision for intermediate results. Extended precision would almost always be visible. For example, // implicit widefp under PEJFPS // default semantics under Java Grande static foo(){ double one = 1.0; double three = 3.0; double a; double b[] = new double[1]; a = one/three; b[0] = a; // Results under different proposals // JGNWG PEJFPS if(a == b[0]){...} // always true true or false? if(a == (one/three)){...}// always true true or false? } If one/three is calculated to extended precision and a is treated as an extended precision value, then a == b[0] will be false under PEJFPS since arrays are always stored in the base format (32 bit float or 64 bit double). If a is stored as double precision, a == (one/three) will be false if (one/three)is calculated to double extended precision. The Java Grande proposal would always return true for these cases. In short, on the x86 the cases where the JGNWG proposal allows differences between default and strict semantics are where overflow or underflow would occur; the cases where PEJFPS allows differences between default and strict semantics are (approximately) where an operation is inexact, as most are. Additional Floating-point Types The JGNWG proposal thus far does not provide any access to the double extended format found on the x86. Consistent access to this format is important to allow good hardware utilization and to ease writing numerical programs; having access to several more bits of precision than the input data often allows simpler (and sometimes faster) algorithms to be used. To access double extended, JGNWG proposes that Java include a third primitive floating point type called "indigenous." The indigenous floating point type corresponds to the widest IEEE 743 floating point format that directly executes on the underlying processor. On the x86, indigenous corresponds to the double extended format; on most other processors, indigenous corresponds to the double format. (The indigenous type must be at least as wide as double.) For a particular VM, class variables indicate whether indigenous is implemented as double or double extended. The float and double floating point types have hardware support. Adding a double extended type would require costly simulation on most architectures other than the x86. Having an indigenous type preserves the performance predictability of a program and keeps a close correspondence between floating point types in a language and floating point formats on a processor. Implementing indigenous at the JVM level can be done either by adding new JVM instructions or (as an interim measure) using the operator overloading and lightweight classes described in the full JGNWG report. Additional Floating-point Expression Evaluation Policies To better utilize certain processors and to lessen the impact of rounding errors, it is useful for the programmer to conveniently be able to evaluate a floating point expression in a wider format. For example, a programmer may want to evaluate an expression consisting of float variables in double precision. Pre-ANSI C used this floating point expression evaluation policy exclusively. JGNWG adds a new declaration, anonymous FloatingPointType, to control the expression evaluation policy: * anonymous double gives the original C expression evaluation; all float values are promoted to double. * anonymous indigenous promotes float and double values to indigenous. Using anonymous indigenous makes best use of the x86's double extended registers. * anonymous float specifies to use the existing Java expression evaluation policy. No JVM changes are required to support anonymous double. PARTICIPANTS IN THE NUMERICS WORKING GROUP The following individuals contributed to the development of this document at the Java Grande Forum meetings on May 9-10 and August 6-7 in Palo Alto, California. Ronald Boisvert, NIST, Co-chair John Brophy, Visual Numerics Sid Chatterjee, University of North Carolina Dmitri Chiriaev, Sun Jerome Coonen Joe Darcy, University of California, Berkeley David S. Dixon, Mantos Consulting Geoffrey Fox, University of Syracuse Steve Hague, NAG Siamak Hassanzadeh, Sun Lennart Johnsson, University of Houston William Kahan, University of California, Berkeley Cleve Moler, The MathWorks Jose Moreira, IBM Roldan Pozo, NIST, Co-chair Kees van Reevwijk, Technical University, Delft Keith Seymour, University of Tennessee Nik Shaylor, Sun Marc Snir, IBM Henry Sowizral, Sun George Thiruvathukal, Loyola University, Chicago Anne Trefethen, NAG The following additional individuals also contributed comments which helped in the development of this document. Susan Flynn-Hummel, IBM Sam Midkiff, IBM REFERENCES [Dar] Joseph D. Darcy, Borneo 1.0: Adding IEEE 754 floating point support to Java, M.S. thesis, University of California, Berkeley, May 1998, http://www.cs.berkeley.edu/~darcy/Borneo [Gol] Roger A. Golliver, First-implementation artifacts in JavaTM [Gol98] Roger A. Golliver, Personal communication, August 1998. [JLS] James Gosling, Bill Joy, and Guy Steele, The JavaTM Language Specification, Addison-Wesley, 1996. See http://java.sun.com/docs/books/jls/. [JVM] Tim Lindholm and Frank Yellin, The JavaTM Virtual Machine Specification, Addison-Wesley, 1997. See http://java.sun.com/docs/books/vmspec/. [Kah] William Kahan, Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic, http://www.cs.berkeley.edu/~wkahan/ieee754status/ieee754.ps [PEJ] Sun Microsystems, Proposal for Extension of Java Floating Point in JDK 1.2, http://java.sun.com/feedback/fp.html