Task Allocation and Performance Measures for the Box Data Type

by R. J. Hanson, J. J. Hassell, Y. T. Kwok

Introduction of the Problem

The Visual Numerics, Inc. IMSL Distributed Network Fortran Library product has generic functions and operators. These apply to several common tasks of numerical linear algebra and other standard computational algorithms. One of the types valid for these functions is the box data type. This is a rank-3 assumed-shape Fortran array with any standard precision, including complex types. For example suppose we have matrices . Each matrix has dimension . These problems are placed in a rank-3 array. This array is defined by a Fortran declaration:

REAL (kind(1D0)) A(m,n,k)

Each matrix is a rack of the box: A(:,:,p) = . This suite of problems provides an opportunity for parallel computations. Each rack is independent of the others, so matrix operations, such as computing the generalized inverse , are computable in parallel. Parallel methods lead to complications that are not present if each rack is completed in sequence. A reason for tolerating the complication is to improve the performance. We have hidden most of the added complexity in our box data type functions and defined operations.

In the rest of the paper we will illustrate the basic algorithms and design for the DNFL box data type functions. This is found in Section 2. Section 3 outlines the basic task allocation algorithm. In Section 4 timing data is presented that shows typical performance measurements. We present results for matrix inverses, solutions of linear algebraic equations, matrix products, Cholesky factorizations and two-dimensional discrete Fourier Transforms, or DFT. Our software uses the MPI system. The machines we timed are the Hitachi SR2201 and an array of SUN-Sparc work stations. The work stations are a single SparcServer 670 MP, two SparcStation 10, and a single SparcStation 20.

Problem Allocation

The programming style is based on the portable MPI model for parallel computing. The problems are completed based on a master-worker assignment strategy, with some important differences. To explain the differences we use the standard numbering for the nodes of the MPI communicator: . Here is the rank of the nodes, and is the number of machines in the communicator. We developed our algorithm starting from an example. The general algorithm illustrated by this example is correct for values of . Our approach adds extra details:

Self-Scheduling with Root Working

The basic master-worker algorithm consists of an overall loop that contains an inner loop. One loop is primarily for the master and the inner loop is for the workers. The root assigns tasks, and the worker nodes complete the tasks and return results. The root may also perform tasks. It receives incoming results and, at the end, signals each node to exit the loop. The reason for presenting the algorithm in this way, is to emphasize that one program unit executes at all nodes. What varies at the nodes is the place in that unit where control resides.

Here are some comments about this Algorithm A:

Performance Result Charts

The details of Algorithm A are tolerable provided there is an improvement in performance. In fact users who call our box data type functions do not need to deal with this complexity. We present some timings in this section that show improvements. We give completion times for computations, where the parallel strategy is compared with the non-parallel case . These average times are respectively denoted by and . The timings are repeated more than once to yield respective standard deviations and . A ratio that indicates the relative improvment of the parallel vs. the non-parallel times is given by . Using the formula for the differential of a quotient yields the first variation formula . We have observed that presenting values for is commonplace in the parallel computing literature. But presenting the first variation is somewhat novel. Our view is that the pair of numbers give a better idea of what can be repeated by an independent experiment. The bar charts, with error bars, that follow display these pairs for specific choices of computer, problem, problem size, number of repeats, number of nodes, and whether the root works or does not work. Here are some comments about the timings.

See the Performance Result Charts

References


1 Gropp, William, Lusk, E., Skjellum, A. (1994) USING MPI - Portable Parallel Programming with the Message-Passing Interface, The MIT Press, Cambridge, MA.

2 loc cit., pages 29-36