Distributed-memory programming

The ``art" of distributed-memory programming

If you have not already discovered this, you will probably soon realize that there are significant differences in programming a distributed-memory (DM) machine compared to a conventional machine. In fact, some might say there is a real "art" to DM programming, and a way of thinking that just is not required elsewhere.

The primary reason DM machines are more difficult to use is the fact that, not only is the data in memory distributed, but, in general, the programmer is responsible for ensuring that data is in the right spot at the right time, typically by using a message passing library to send and receive data across a network to and from processing nodes in the machine. (A notable exception, of course, is virtual shared-memory machines, such as the KSR, which have operating systems designed to manage distributed data without explicit user control.) This responsibility on the shoulders of the user is far from trivial, particularly considering the fact that data movement across a network doesn't always behave predictably. When data messages are delayed due to backlog on the network, for example, program synchronization becomes an issue, and a given program may not behave deterministically - a characteristic that many programmers have always taken for granted and counted on as an indisputable fact.

Where does the "art" come in? Primarily in finding the right way to view an application so that a data distribution which maximizes efficiency comes to the fore. It's likely that with enough effort, virtually any distribution of data across a machine can be made to work. However, if the goal is to have a program that actually runs faster on multiple nodes, rather than slower, the most obvious data distribution might not achieve the goal. "Art" might occasionally translate to patience, while the application developer experiments with a range of distributions until the most efficient is found.

SPMD versus MPMD programming

The SP2 is equipped to run in both Single Program Multiple Data (SPMD) and Multiple Program Multiple Data (MPMD) modes. The former requires that each node run an identical (single) program, though the data being processed on each node may differ. The latter allows the user to run different programs on each of the nodes. Probably the most familiar example of MPMD programming is the manager-worker framework, where the user composes two program: a manager program running on one processor to perform I/O or other program set-up functions and then coordinate the efforts of the other processors, and a worker program running on the remaining processors to perform the basic parallel tasks in the application. Often, SPMD programs are designed to mimic this manager-worker model, since there is always a portion of work in an application which it makes sense to perform on only one processor. This mimicking is accomplished by having one of the processors in the pool act alternately as a manager and a worker. NOTE: One important advantage to adopting the SPMD mimic over the original MPMD manager-worker model is that the resulting code can be run on only one processor, thus greatly facilitating debugging. We include a conversion to SPMD example to illustrate how a typical MPMD manager-worker pair can be converted into a SPMD program.

Choosing a message passing library: PVM(e) or MPL?

If you have programmed in PVM in the past, and perhaps even have developed your application for workstation clusters or other machines using PVM, you are most likely interested in putting the PVM portability to use by porting your code without modification to the SP2. Unfortunately, there can be a performance issue regarding your choice of message-passing library. Standard PVM release 3.3.11 is installed on the SP2, and supports use of the SP2 as both cluster of workstations with ethernet interconnections (RS6K architecture) as a more tightly coupled system using the SP Switch (SP2MPI architecture). To use the switch while maintaining a good portion of PVM portability, IBM has provided their own proprietary version of PVM, called PVMe ("e" for "enhanced").

PVMe is able to access the switch, but there are some catches: it is compatible with a previous version of PVM (currently, 3.2.6), and it restricts the user to one PVM process per node - if you try to spawn more processes than you have nodes, PVMe will fail.

If you currently use any new features of PVM3.3 (global reduce operations, for example), then you will have to back-track to 3.2.6 compatibility to run on the switch, and, if you are accustomed to being able to overlap PVM processes on a node, you will have some work to do to adjust to the PVMe restriction. On the bright side, IBM seems committed to continuing to upgrade PVMe in the footsteps of PVM, so eventually those nice PVM3.3 features should be added to PVMe.

An alternative is to give up the PVM portability edge and use IBM's own message passing library, MPL. For obvious reasons, IBM has designed this library to use the switch, and it comes complete with global communication routines and other features not found in PVMe. Further, many of IBM's parallel programming tools for tasks such as program visualization and performance monitoring are available only to users of MPL. The structure of the message passing calls is very similar to those of Intel's NX and Thinking Machines' CMMD - so porting is not necessarily a difficult task. If you have familiarity with these other vendors syntax for message passing, using MPL should be pretty straightforward. If your only message passing experience is with PVM, then the sample programs included later in the tutorial may be useful, allowing you to directly compare a PVM(e) program with a corresponding MPL program which performs the same task. Also, later sections Message Passing with PVM(e) and Message Passing with MPL provide some examples of how to send and receive messages in each library.