**Next:**Ongoing Activities

**Up:**The Code Creation

**Previous:**Quadratic Formula

## Sparse Matrices

The transformational approach and the MathBus will be used in a
restructuring compiler that will generate highly optimized parallel
code for sparse matrix problems. Sparse matrix computations are
ubiquitous in computational science since they arise whenever
differential equations are solved using numerical techniques such as
the finite element and finite difference methods. It is therefore
ironic that almost all of the work to date on automatic program
restructuring and parallelization, such as the HPF effort, has focused
on dense matrix problems. The main reason for this is that
restructuring compilers cannot analyze sparse matrix programs. Sparse
matrices are stored in special formats, such as compressed row or
column storage (CRS, CCS), and these formats complicate dependence
analysis. Consider matrix-vector multiplication, which is a key step
in conjugate gradient solvers, as well as in other iterative matrix
algorithms [44]. If **A** is a sparse matrix in CCS form,
the following code computes **Ax**, storing the result in vector **v**.

` for i := 1 to N do `

for j := A.column-pointer(i) to A.column-pointer(i+1) - 1 do

v( A.row-index(j)) := v( A.row-index(j)) + A.val(j)* x(i)

od

od

This small code fragment illustrates two key points regarding sparse
matrix programs and their parallelization using automatic code
restructuring tools. First, there is substantial parallelism in
sparse matrix problems. A sparse matrix program performs a subset of
the operations performed by the corresponding dense program; if two
operations can be done in parallel in the dense program, they can be
done in parallel in the sparse program. In the matrix-vector product,
this means that different rows of the matrix can be multiplied with
the vector in parallel; for each row, the partial products can be
accumulated in parallel. In fact, in algorithms like the multi-frontal
algorithm for sparse Cholesky factorization, the zero entries in
sparse matrices can be exploited to extract * more* parallelism
than is there in the corresponding dense
program [72].

Secondly, the parallelism in sparse matrix programs is obscured by
the storage scheme for sparse matrices. In particular, the use of
indirection arrays, such as * row-index* in CCS, can foil the
ability of a restructuring compiler to generate parallel code. These
compilers perform * dependence analysis* to determine a partial
order on loop nest iterations, but the analysis techniques work only
when array subscripts are linear functions of loop index variables,
allowing standard tests like the GCD and Banerjee tests to be
used [126,91,18]. When complicated
subscripts such as those involving indirection arrays are used,
dependence analysis fails. With conventional compiler technology, the
matrix-vector product code shown above is doomed to run sequentially.

The Bernoulli project at Cornell is addressing this problem through
the use of program transformations to * generate* sparse matrix
codes from dense matrix programs. Intuitively, this process can be
viewed as a form of partial evaluation in which the compiler exploits
information about zeros/nonzeros in matrices to eliminate unnecessary
computations, and then chooses an appropriate representation for the
sparse matrices. Since the input to the compiler is a dense matrix
program, the ability to analyze and restructure programs, for example
by using the Lambda loop transformation tool-kit developed at
Cornell [65], is retained. However, the transformation to
sparse code requires information about the sparsity structure of
arrays. To get this information, we are integrating our compiler with
a discretization system being written by Rich Zippel. This system uses
Paul Chew's mesh generator, and incorporates symbolic and numerical
routines for implementing the weighted residual method of discretizing
differential equations. The discretization system has complete
knowledge of the sparsity structure of matrices since this is fully
determined by the meshing and discretization process. This information
will be passed on the MathBus to the restructuring compiler to
generate parallel code. Any bottlenecks detected by the restructuring
compiler can be passed back to the synthesizer so that a revised mesh
or different set of basis functions can be used for better parallel
performance. The two-way flow of information between the
discretization system and the restructuring compiler is an instance of
the synergy which results from an integrated system, a synergy that is
absent in current approaches. Distribution of sparse matrices across
the nodes of the parallel machine is done using standard methods like
recursive spectral bisection [92] and geometric
partitioning [79].

To summarize, the compiler generates efficient, parallel sparse code by partial evaluation of the dense program relative to the sparsity structure of matrices which is obtained over the MathBus from the discretization system.

**Next:**Ongoing Activities

**Up:**The Code Creation

**Previous:**Quadratic Formula

*nuprl project*

Tue Nov 21 08:50:14 EST 1995

Tue Nov 21 08:50:14 EST 1995