The NIST Sparse BLAS (v. 0.9)

Performance Studies

Sparse BLAS Homepage

Preliminary Performance Results:

Performance measurements are given for simple matrix-vector and matrix-matrix multiplies for several sparsity patterns. The measurements reflect compiler optimization (-O3 and loop unrolling) only. Typical performance for problems with large blocksize and multiple right-hand-sides is about 17 Mflops on a Sun Sparc 20 and about 27 Mflops on an IBM RS6000 Model 590. The ``Lite'' interface provides no measureable performance gain, except for some very small problems. The blocked schemes (BSR and VBR) begin to pay off when blocksize is greater than 5 or 10; for smaller blocksizes, the point-wise (CSR) scheme is more efficient.

Also available in postscript form:
Preliminary Performance Studies, July 1996

(34K gzipped postscript file, 9 pages)

The test matrices used in the following tests were generated by reading sparsity patterns from Harwell-Boeing files, and using these patterns as the block structure for a matrix of given blocksize. The results shown are for Matrix-Vector and Matrix-Matrix multiplications only, rather than a full DAXPY, since we are interested in the efficiency of the sparse code. Source code for the performance testers is available from the authors.

The current test results are for the following matrix patterns:

The experimental parameters are blocksize and number of right-hand-sides. We present results of testing on a Sun Sparc 20 and an IBM RS6000. Each data point represents the average result of 4 runs with the same parameters.

Source code for performance testers:

Last updated: July 25, 1996 by KAR.