Programming on Parallel Computing Architectures

J. Benton

Introduction

Utilizing the maximum amount of processing power to solve certain problems has always been important for computer scientists.  Because the cost of processors is decreasing, parallel processing, the ability to execute several computations at once to solve a problem, has become an increasingly viable solution for success in this endeavor [2].  There are different ways of building parallel systems.  This article will give an overview of the types of parallel systems available and give examples of suitable problems for each system.

Parallel architectures generally lie in three categories, MISD (multiple instruction stream, single data stream), SIMD (single instruction stream, multiple data stream) and MIMD (multiple instruction stream, multiple data stream).  There have been no commercial MISD machines built [6] and as such, only the advantages of SIMD (pronounced symdee) and MIMD (pronounced mymdee) architectures will be discussed.

On SIMD systems, execution of a single instruction occurs simultaneously on different sets of data [2].  Each processor has its own memory for data, but there exists only one instruction memory that attaches to a single control processor.  The control processor broadcasts instructions to each of the other processors and has the responsibility of performing any nonparallel parts of execution.  Each processor can communicate with its neighboring processors.  Since none of the processors share memory, yet each executes the same instructions, all information must transmit in unison.  SIMD architectures work best on problems where the same functions execute on many sets of data, such as in image processing in which each pixel is operated on turn.

Unlike SIMD architectures, MIMD systems provide a separate set of instructions for each processor.  This allows the processors to work on different parts of a problem asynchronously and independently.  To communicate, a MIMD system uses either a shared-memory architecture or a distributed-memory architecture.  Shared memory MIMD systems rely on a networked assembly of memory for communication.  On the other hand, distributed memory systems consist of processors that have their own memory.  The most commonly used method of processor interaction on distributed-memory MIMD systems is message passing, which allows the passing of information between the processors through a communications network [8].  MIMD systems provide a way to utilize processing power to solve problems that split into parts where different but dependent calculations are performed, such as in physical simulations.

The remainder of this paper discusses both SIMD and MIMD architectures.  First, the paper focuses on issues involving SIMD systems and gives an example of how a specific SIMD program operates.  Then, the paper presents two ways of solving a problem on a MIMD system, considering performance issues that must be addressed.

Background

When looking at the speed of a parallel program the first step is to find how much work is executing in parallel and how much completes with only one processor.  Gene Amdahl, architect of the IBM 360 computers, developed a method of measuring speedup by using the number of processors executing in parallel in a formula [2].

Let Fractionparallel be the fraction of time the program spends in parallel, Fractionserial be the fraction of time the program spends performing serial operations and Processorsparallel be the number of processors executing in parallel:

  [5]

Amdahl's law is used to define speedup, or how much faster a machine will run given a number of processors.  An alternative method of defining speedup using Amdahl's law is,

  [6].

Evidence exists that Amdahl's law has shortcomings when applied to systems executing some programs on massively parallel processors.  Some alternative methods exist for measuring speedup on these systems [5].  However, Amdahl's law remains a sufficient estimate of speedup for this discussion.

SIMD

Computer programs that solve problems requiring similar action on different data are common [6].  These types of problems are said to have data parallelism.  SIMD architectures have provided us a way to apply parallel data to the same set of instructions by using processor elements (PEs).  A PE is a processor that has its own memory and is connected with other PEs.  Each PE performs a single operation on different but similarly structured sets of data, providing a high degree of speedup on massively data parallel problems.  Many machine-dependent languages exist to program SIMD systems.  Most of these languages require a specific topology.  This requirement creates a disadvantage by making the languages non-portable.  To counter-act this shortcoming, a programming language needs a method to define how PEs communicate with one another.  A machine-independent language, called Parallaxis, provides this ability and gives an environment to learn how to program SIMD systems [1].

            Parallaxis is based on the Modula-2 programming language, and each program written in Parallaxis includes both a parallel algorithm and information about the system topology and number of PEs in a particular system.  Because the topology of PEs exists in Parallaxis programs, the language uses virtual PEs defined in code to provide for portability.

Heat Transfer on a Plate

Presented here is an example algorithm used to simulate heat transfer across a plate to illustrate the abilities of Parallaxis on a SIMD system.  The connections and configuration of the SIMD system are defined first.  The two-dimensional grid arrangement in Figure 1 illustrates this.  Each PE handles data that represents a discrete location on a metal plate.

Figure 1.  Two-dimensional grid topology for our algorithm.  Each shaded region represents a PE and each edge represents a connection.

The results of this algorithm are such that each PE will have the average value of its surrounding PEs on each iteration, thereby spreading the value of the heated location across the plate.  The Appendix A shows the Parallaxis code for the above algorithm.

In this example, virtual PEs send information to their neighbors only if their neighbor exists.  That is, if a PE does not have a neighbor in the direction indicated, it will not attempt to send data and instead use data from its own memory to replace the data it would normally receive from a neighbor.

The algorithm first goes through a serial portion, where it sends an initial heat value to a starting location on the grid.  The initial location refreshes its value to simulate a constant hot spot on the plate.  A control processor, which sends instructions and executes serial parts of the algorithm, performs the processing required to send the value to the initial location.  During this time, each PE sits without executing any instructions of the algorithm.

All PEs perform the following loop after initialization.  Note that numSec indicates the number of times the loop will run, recTemp holds the total of all surrounding temperatures and currTemp equals the current temperature that the PE represents.

for(i=0;i<numSec;i++)
{
         recTemp = east processor temperature + west processor temperature +
                    south processor temperature + north processor temperature;
         currTemp = recTemp / 4;
         Refresh PE with hot spot
}

Speedup

When applying Amdahl's law to the above algorithm, a high speedup is expected.  That is, Amdahl's law estimates that if four processors are applied to this algorithm then a speedup close to four will occur.  This is because the serial portion of the program, which is responsible for initializing variables, is small.

 

MIMD

There are problems that SIMD systems are not well suited to solve.  Problems that involve conditional branches are able to slow down SIMD machines, causing some PEs to remain idle while others execute.  MIMD architectures alleviate this impediment to speedup.  Because the processors in a MIMD system execute asynchronously from one another, they do not need to communicate at any specified interval unless specifically programmed to do so. 

On a MIMD system, when a process has the data it needs to execute it is not necessary for it to communicate with other processors.  When work runs out, however, processors need a method to communicate with one another.  The Message Passing Interface (MPI) standard was developed to help facilitate communication and to provide portability across different systems [9].  MPI specifies the names, calling sequences and results of message passing subroutines called from programs [4].  MPI libraries exist for Fortran, C, and C++ and compile and link to programs just as other libraries do.

Shown here is an example of a parallel tree searching algorithm using the MPI standard on a distributed-memory MIMD architecture to show the concepts necessary to program MIMD systems by giving an example of tree searching in parallel.

The Knight's Tour

The Knight's Tour problem is presented here as an example of a tree searching problem.  The original Knight's Tour problem asks whether there exists a path on an 8 by 8 chessboard where a knight touches every square exactly once, ending one move away from the starting square.  For example, Figure 2 shows a knight's tour on an 8 by 8 chessboard; the location marked by a zero indicates the starting square.   This problem is a form of the problem of finding a Hamilton circuit within a graph [3], where the nodes are represented by squares and each edge represents a knight's move on the board.  The problem of finding a Hamilton circuit is NP-complete and requires exhaustive searching [7].

Figure 2.  An 8 x 8 reentrant knight's tour.

The original problem has been changed in order to eliminate some of the search space.   The problem will now be redefined that of, given an initial square, finding the number of non-reentrant tours, tours that do not necessarily end on the initial square, and are now searching on a 5 by 5 board.  In addition, the search is using backtracking to reduce the amount of searching done.  Non-reentrant tours are counted because no reentrant tours exist on 5 by 5 boards.  Searching occurs on a 5 by 5 board because it creates a complex enough graph to illustrate tree searching in parallel.  Figure 3 shows a 5 by 5 non-reentrant tour.

Figure 3.   A 5 x 5 non-reentrant knight's tour.

Algorithms for the Knight's Tour Problem

The knight needs to move so that it does not land on a square that it has previously visited.  A definition of the current problem state is used to ensure that no searched moves enter an invalid position.  This definition of the current state of a problem is called a partial solution of the problem, and the process of ensuring no invalid positions are searched is called backtracking.  The partial solutions of the problem are also used to indicate when the goal of a knight's tour has been reached.  For the example problem, a partial solution consists of a two-dimensional array representing the chessboard.  In the array, each element represents a chessboard square that is either empty or holds a number.  The number serves to indicate how many moves were required to reach that location on the chessboard.  Also part of the partial solution is a Cartesian coordinate giving the current location of the knight.  The coordinate system is such that (0,0) represents the first square on the chessboard.

A tree search tries to find a solution by searching each possible state until the goal is found.  It does this by systematically expanding partial solutions into all possible states.  Figure 4 shows how the 5x5 board is expanded.

Figure 4.  Each box represents an iteration of tree expansion.  Each partial solution is expanded into the next possible partial solutions.

The representation is used to describe three algorithms to solve the Knight's Tour problem.  First, a serial algorithm presents how backtracking is used to solve the problem.  The other algorithms are parallel.  One algorithm dedicates a processor to act as a centralized work distributor to delegate work to each of the other processors.  The other uses no centralized distributor and relies on each of the processors to request work from the others.

Serial Algorithm: Backtracking

Backtracking is used for the serial algorithm.  The variable board represents a 5x5 two-dimensional array, x_init represents an initial x-coordinate, y_init­ represents an initial y value, and depth represents the amount of moves needed to reach the current spot.  The variable num_solutions indicates the number of solutions found.

depth = 0
board(x_init, y_init) = 1
do
{
      if not at a depth of 25 (most possible moves) then
            expand next possible moves;
      otherwise num_solutions =  num_solutions + 1
} until no more partial solutions

By considering successive moves that the knight can make one can eliminate the impossible moves by checking their validity, thereby deleting unnecessary branches of the tree's search space.

Parallel Algorithm: Centralized Distributor Approach

Using one processor to give work to all other processors provides an introduction to work distribution, or load balancing.  Load balancing is the process of equalizing the amount of work that each processor performs.  If an algorithm that solves a problem does not provide load balancing, processors could be sitting idle while others are working to solve a problem.  For the centralized work distributor approach to the Knight's Tour problem, one processor is used to delegate work to all other processors.  Initially, the centralized distributor produces this work, but after the other processors have created some partial solutions of their own, they send a portion of them to the distributor.  In turn, the distributor sends the work it has received to any other process that needs something to do.

The centralized distributor algorithm can be broken into functions that the distributor performs and those that the other processors do.

Centralized Distributor

Create partial solutions to depth 1 using the backtracking algorithm from the serial program.

Divide partial solutions equally between processors and send partial solutions to processors in a round robin fashion until no work exists to send.

do
{
     If partial solutions are pending from a processor then
          receive those partial solutions and place into a queue.
     If a request for work is pending from a processor then
          receive request 
     If all other processors have requested work and the queue is empty then
          send a termination signal to all processes and quit
     otherwise
          send work to requesting processor.
} forever

Processors

Receive initial work from distributor.
do
{
     If the queue is empty, request work from distributor.  Wait for and receive work.  If a termination signal comes instead, terminate.
     Perform backtracking search for one level deep
     If a solution is found, count it and break loop
      Otherwise, place partial solutions on queue.
      If maximum expansions reached, send a portion of the work to the distributor.
} forever

In order to optimize this algorithm, we found the amount of work that the centralized distributor would send to processes, called distributor divide number, finding the value that provided the best time.  Separately, the optimum value for the work that processes would send the centralized distributor was found, called the process divide number. In addition, the number of partial solution expansions before sending the distributor work, called maximum expansions was found.  This was done by executing the parallel program on five processors, and holding the value of one of the variables constant while changing the other values.  Figures 5, 6 and 7 show the results of finding the optimum value of the distributor divide number, processor divide number and maximum expansions respectively.  The average times were found using five processors.

Figure 5.  Finding the optimum value for the way to divide work among processes.

Figure 6.  Finding the optimum value for the way to divide work to send to distributor.

Figure 7.  The maximum number of expansions to perform before sending work to the distributor.

Parallel Algorithm: Distributed Load Balancing Approach

The distributed load balancing approach is much like the centralized distributor approach without a central processor providing work.  Instead of each processor requesting work from a distributor, they request and receive work from each other, thereby allowing all of the processors to work on the problem simultaneously.  One processor performs the first expansion.  After that, all processors execute in parallel.  Each performs the following algorithm.

A specified processor expands problem to one level deep.
do
{
If the work queue is empty, request work.  Wait to receive work from at least one processor. 
If all other processors request work while waiting then terminate (summing the number of solutions found).
Perform backtracking search one level deep.
If out of work, break loop.
If work is incoming then receive work.
      If a solution is found, count it and break loop.
            Otherwise, place partial solutions on queue.
      If the maximum number of expansions not reached then break loop.
If other processors are requesting work then send a portion of the work to the requesting processor.
} forever

Like the centralized distributor approach, the load balancing approach also has a maximum number of expansions that execute before servicing other processes.  Figure 8 shows the optimum value found.  The values originate from runs the average of four runs on five processors.

Figure 8.  Finding the optimum value for the maximum expansions before servicing requests for work.

Evaluation and Comparison

When four processors are applied to a parallel program, we hope to get a perfect speedup of 4.  However, this case rarely occurs because of the overhead incurred from serial initialization, message passing, and the waiting time that each parallel process has.  Figure 9 shows the amount of time each parallel program takes to execute on a certain number of processors and Figure 10 shows the amount of speedup we achieve on each program.  The programs ran on a network of Intel Pentium 133 MHz machines, each running Windows NT 4.0 and containing 64 megabytes of RAM with the machines networked in a star configuration using a centralized hub.

Figure 9.  Time spent executing.

Figure 10.  Speedup of parallel programs.

From both programs, one can see that as more processors are applied, a degradation of performance occurs until a point where performance begins to improve.  To understand why this is so, look to the how quickly each program executes on a single processor.

In contrast to both of the parallel programs, Figure 11 shows that serial algorithm executes very quickly at 2.90 seconds.  The distributed load balancing program, running with one processor, completes execution in an average of 28.34 seconds, and the centralized distributor approach running with a distributor process and a working process on a single processor runs at 14.39 seconds.  The overhead we have incurred from message passing decreases the speed of our parallel programs enough that we have little-to-no improvement over a serial program.

Serial Program Time

Distributed Load Balancing Time

Centralized Distributor Time

(1 Processor, 2 Processes)

2.9 Seconds

28.34 Seconds

14.39 Seconds

Figure 11.   The amount of time of completion using one processor on all approaches.

Let us look at the cost of message passing.  To estimate the amount of time needed to pass messages to each machine, we will time how long it takes to send 1028 bytes of data from one machine to another.  For five tries, we average a time of 10 milliseconds for every 1028 bytes.  Each partial solution in the algorithm requires 112 bytes.  Using these values, we can find that it takes just over 1 millisecond to send a partial solution from one machine to another given that no other communication is occurring over the network (i.e. no network traffic related problems).  To understand how this affects our execution speed, we must establish how many partial solutions each processor is sending during execution.  It may be acceptable to have an increase in the number of partial solutions sent, provided the relationship between network traffic and problem completion is minimal.  Figure 12 provides a view into how the both load balancing approaches are affected as we apply more processors to the problem.  By comparing the completion times from Figure 9 to the total number of partial solutions sent, it is evident a correspondence exists between these two factors on both programs.  The amount of improvement differs between them, however.

Figure 12.   The distributed load balancing approach shows significant reduction in network traffic as more processors are applied to the problem.

Next, looking at the behavior of the wait time shown in Figure 13, one can see that the time a processor needs to wait after sending a request increases as we apply more processors in the centralized distributor program.  It is clear that when many processors request work, the work distributor can only send data to one processor at a time.  It can be hypothesized that as processors are added to this system, there is a higher likely-hood that there will be an increased wait time for all processors.  However, an interesting result appears.  When 5 processors are applied, the wait time decreases.  The search tree structure is the probable culprit behind this.  As the centralized distributor splits the problem into parts, some branches of the search tree yield dead ends and a processor must then request work, giving a higher wait time.  The centralized distributor will divide the problem in the same manner only when an identical number of processors are applied, thereby giving us consistent results on wait time.

Figure 13.   The average time spent waiting depends on the number of processors because of how much work is done by all of the processors.

The amount of wait time spent in the distributed load balancing approach is significantly higher than that of the centralized distributor approach for most cases.  One may think that this is in contrast to the concept of the algorithm.  Each processor that needs work is serviced by each of the other processors, hoping for less time to wait.  Considering other factors is important in the distributed load balancing approach, however.  There is a need to decide how much work is given the requesting processor.  Giving all work provides undesirable results, as it would cause the requesting processor to receive all work on all processors.  The solution used in this example is to give 100 partial solutions to each processor requesting work.  Another solution is to give a portion of the work equal the number of partial solutions divided by the number of processors.  Different solutions will generate different wait time distributions.

Even though centralized distributor program has a faster completion time, the distributed load balancing approach still has the highest amount of work load balance.  Applying more processors to either program would likely show that more speedup is possible.

Conclusion

Fast computers provide us a way to execute our computationally intensive programs quickly, but limitations to the speed of computers exist, and as such parallel computing has provided a way to remove that limitation and achieve a large amount of computing power.  The two parallel computing architectures presented here show how parallel computing performs, and also shows considerations that must be kept in mind when applying algorithms to parallel systems.

We have seen that SIMD systems are generally suited to problems that require few conditional branches and that distributed-memory MIMD systems work best on problems that do not require large amounts of communication.  Our experiments with the knight's tour problem have shown that by reducing the amount of communication required between processors, we can decrease the amount of time required to execute a program.

References

1
Bräunl, Thomas.  Parallaxis-III - A structured data-parallel programming language, 1995.  http://www.ee.uwa.edu.au/~braunl/parallaxis/
2
Dowd, K.  High Performance Computing, Second Ed.  O'Reilly & Associates, 1998.
3
Gibbons, A.  Algorithmic Graph Theory.  Cambridge University Press, Cambridge U.K., 1985.
4
Gropp, W., et al., Using MPI: Portable Parallel Programming with the Message-Passing Interface, Second Ed., MIT Press, 1999.
5
Gustafon, John L.  Reevaluating Amdahl's law.  Communications of the ACM 31, 5.   (May 1998).
6
Hennessey, J.L. and Patterson, D.A., Computer Architecture: A Quantitative Approach, Second Ed., Morgan Kaufmann Publishers, Inc.,  San Francisco, 1996.
7
McHugh, J.A., Algorithmic Graph Theory, Prentice Hall, New Jersey, 1990.
8
Pacheco, P.S., Parallel Programming with MPI, Morgan Kaufmann Publishers, Inc., San Francisco, 1997
9
Snir, M., et al., MPI: The Complete Reference, Second Ed., Vol. 1, MIT Press, 1998.

Bibliography

J. Benton is a graduate student at the Department of Computer Science and Engineering at Arizona State University. His research interests include parallel computing algorithms, distributed operating systems and fault tolerance.