Task
Allocation and Performance Measures for the Box Data Type
by R. J. Hanson, J. J. Hassell, Y. T. Kwok
Introduction of the Problem
The Visual Numerics, Inc. IMSL Distributed Network
Fortran Library product has generic functions and operators.
These apply to several common tasks of numerical linear algebra
and other standard computational algorithms. One of the types
valid for these functions is the box data type. This is
a rank-3 assumed-shape Fortran array with any standard precision,
including complex types. For example suppose we have
matrices
. Each matrix has dimension
. These problems are placed in a rank-3
array. This array is defined by a Fortran declaration:
REAL (kind(1D0)) A(m,n,k)
Each matrix is a rack of the box: A(:,:,p)
=
. This suite
of
problems provides an opportunity for
parallel computations. Each rack is independent of the others,
so matrix operations, such as computing the generalized inverse
, are computable in parallel. Parallel
methods lead to complications that are not present if each rack
is completed in sequence. A reason for tolerating the complication
is to improve the performance. We have hidden most of the added
complexity in our box data type functions and defined operations.
In the rest of the paper we will illustrate the basic
algorithms and design for the DNFL box data type functions. This
is found in Section 2. Section 3 outlines the basic task allocation
algorithm. In Section 4 timing data is presented that shows typical
performance measurements. We present results for matrix inverses,
solutions of linear algebraic equations, matrix products, Cholesky
factorizations and two-dimensional discrete Fourier Transforms,
or DFT. Our software uses the MPI system. The machines we timed
are the Hitachi SR2201 and an array of SUN-Sparc work stations.
The work stations are a single SparcServer 670 MP, two SparcStation
10, and a single SparcStation 20.
Problem Allocation
The programming style is based on the portable MPI
model for parallel computing. The
problems
are completed based on a master-worker assignment strategy,
with some important differences. To explain the differences we
use the standard numbering for the nodes of the MPI communicator:
. Here
is the
rank of the nodes, and
is the
number of machines in the communicator. We developed our algorithm
starting from an example. The general algorithm illustrated by
this example is correct for values of
.
Our approach adds extra details:
- It is critical to have the task assignment algorithm
valid for any value of
.
- For the important special case
it is necessary to have the root machine working. This requirement
complicates the algorithm and code.
- We allow users to specify the order in which
nodes are to be allocated tasks. After the algorithm enters its
self-scheduling phase, assignment adapts so that the fastest completing
worker nodes get the next tasks.
- We allow users to flag nodes that are never to
be assigned any significant work. This avoids the use of certain
worker nodes without creating new MPI communicators.
- Tests are made for the cases where
.
The special value
denotes that MPI has
been finalized. For
, we do not use MPI
communication or other parallel routines in the algorithm given
in the next section. This test prevents calling MPI routines
when these calls are invalid or unnecessary. Users can choose
to avoid any MPI calls by setting
.
- At the outset we broadcast a list of nodes that
work on this computation. All nodes have this list, which is
used to schedule communication with the root.
Self-Scheduling with Root Working
The basic master-worker algorithm consists of an
overall loop that contains an inner loop. One loop is primarily
for the master and the inner loop is for the workers. The root
assigns tasks, and the worker nodes complete the tasks and return
results. The root may also perform tasks. It receives incoming
results and, at the end, signals each node to exit the loop. The
reason for presenting the algorithm in this way, is to emphasize
that one program unit executes at all nodes. What varies at
the nodes is the place in that unit where control resides.
Here are some comments about this Algorithm A:
- The problem and result data are general. To
help fix the ideas, consider the problem data as each matrix
.
The results are the generalized inverses
.
- The special tag, referred to below, is
conveniently defined as an integer whose value is distinct from
the index of any rack in the box. We use the value zero as the
special tag.
- The use of Algorithm A appears to address
parallel execution of many problems of the same type and size.
However, note that the algorithm can effectively assign a single
problem to a worker node. This would ideally be a fast machine
compared to the root node. Consider the particular values
,
the root not working, and the node priority order =
.
Thus at step 5, the root sends the single problem to the worker
node, who receives it. At step 9 the root receives the results.
At step 7 the root exits from the loop. At step 8 the worker
computes the results, sends them to the root, and exits the loop.
- This algorithm will not use parallel computing
with the particular values
. Users can
suppress parallel computing by exchanging values of
with
. This may be desirable for achieving
good performance.
- Speedups can be measured
by recording the total times for the algorithm to execute a box
data type problem for
compared to
and
the root working. These ratios are interesting and a few results
are presented in the next section.
Algorithm A
- Define the working list to be the ranks
of nodes, using the priority order, that will receive at least
one task. The number of nodes on this list is the smaller of
the number of tasks and the number of available nodes.
- Broadcast the working list to all nodes.
- This is the start of the overall loop. It
ends at step 11.
- If this is a worker node, not on the working
list, then exit the overall loop.
- The root node sends problems to worker nodes
on the working list. Each problem is tagged with its positive
rack index.
- This is the start of the self-scheduling
loop. It ends at step 10.
- This step is skipped by all worker nodes. If
there are no problems remaining and results are all in from the
worker nodes, then exit the outer loop. If there is a
problem remaining, the root is on the working list, then
compute the result.
- If this is a worker node, then compute the result.
Send the result and the associated tag to the root. Receive
data for a new problem. If this new problem has the special
tag, then exit from the outer loop. Otherwise cycle
on the self-scheduling loop.
- If this is the root, and if any result is due
from worker nodes, then receive it. Use the tag to identify the
results. If any problems remain, send that worker node a new
problem. If there are no problems remaining, send that worker
node a special tag. Cycle on the self-scheduling loop.
- This is the end of the self-scheduling loop.
- This is the end of the outer loop.
Performance Result Charts
The details of Algorithm A are tolerable provided
there is an improvement in performance. In fact users who call
our box data type functions do not need to deal with this complexity.
We present some timings in this section that show improvements.
We give completion times for computations, where the parallel
strategy
is compared with the non-parallel
case
. These average times are respectively
denoted by
and
.
The timings are repeated more than once to yield respective standard
deviations
and
.
A ratio that indicates the relative improvment of the parallel
vs. the non-parallel times is given by
.
Using the formula for the differential of a quotient yields the
first variation formula
. We have observed
that presenting values for
is commonplace
in the parallel computing literature. But presenting the first
variation
is somewhat novel. Our view
is that the pair of numbers
give a better
idea of what can be repeated by an independent experiment. The
bar charts, with error bars, that follow display these pairs for
specific choices of computer, problem, problem size, number of
repeats, number of nodes, and whether the root works or does not
work. Here are some comments about the timings.
- The computations used double precision IEEE arithmetic.
Our software has versions of the codes for single precision and
complex arithmetic, but this precision seemed representative of
typical speedups.
- Most of the computations showed speedup, compared
with sequentially computing the racks of the box. An exception
to this is the matrix multiply operator, .x.
and the FFT_BOX
and IFFT_BOX
DFT functions. On the network of Sun Sparc workstations, these
are slower than the sequential versions. But on the Hitachi there
is a significant speedup, as desired. We believe the reason for
this is the effective communication technology connecting nodes
in the Hitachi machine.
- Each study used four processors and a small problem
size,
. The timings were repeated thirty
times each. The dependent variable scale with value one is where
a speedup occurs.
- Results are given for the "root working"
and the "root not working." Typically it is better
for performance to have the "root not working" during
parallel computations.
- Speedups by a factor greater than the number
of processors can occur. This phenomena happens when the communication
is efficient and the worker nodes are more powerful than the root
in terms of current performance.
See the Performance Result Charts
References
1 Gropp, William, Lusk, E., Skjellum, A. (1994) USING MPI - Portable Parallel Programming with the Message-Passing Interface, The MIT Press, Cambridge, MA.
2 loc cit., pages 29-36