Frequently Asked Questions



My code no longer works since the upgrade February 16th. What's wrong?
See the Quick-start notes for adapting to the upgrade. If your code still doesn't work after making the necessary changes, contact the Consulting Office at 301-975-2968.

Are the batch nodes dedicated to my job?
Yes. The batch environment on the SP2 is "single threaded", only one job runs at a time. Occasionally, "runaway" processes from previous jobs may appear. If you suspect such a runaway job is interfering with your batch performance, please notify joyce@danube.

Where can I find documentation to help get me started?
See the pages on Sources of information and documentation

Which message passing library is the best for me to use?
You have three choices: (At some point in the near future, MPI will be another alternative, possibly replacing MPL.)

For portability, choose PVM. It loses some on performance since at present it can't access the full capabilities of the SP2's high speed switch, however, the drawbacks of the other choices make this choice a reasonable alternative. Also, the soon to be released version 3.4 of PVM promises to support the SP2 as a machine type, using IBM provided MPI underneath for high performance. This is likely to make PVM (along with MPI), the most attractive choice by sometime in the fall of 1995.

MPL and PVMe are both designed for optimized use of the high speed switch. Disadvantages are that MPL is not portable to other machines, and PVMe is one major release behind PVM in supported functionality, and restricts the user to only one process per node.

See the section on Choosing a message passing library: PVM(e) or MPL?.

Can PVM utilize the high performance switch?
Yes and no. PVM cannot use the switch at the same level as MPL and PVMe, which use what IBM calls the LSP (Light Speed Protocol) mode of the switch. PVM can use either Ethernet connections between the nodes, or the high speed switch via IP. To use the switch (rather than Ethernet), the PVM hostfile must contain the switch addresses (HSSW1.NIST.GOV, HSSW2.NIST.GOV, etc.) rather than the node addresses (grand1.nist.gov, grand2.nist.gov, etc.). Performance is not as good as the LSP, but significantly better than Ethernet.

My LoadLeveler job hangs on the queue and never runs. What's wrong?
LoadLeveler is not very smart about batch class names, and even a slight mispelling will result in a job that hangs on the queue. If you job seems to be stuck on the queue, that is, no other jobs are running in the node pool for your class and your job won't start, check your command file for the spelling of the class name. Exact capitalization, etc. is required. A list of valid class names can be found with tha command /usr/local/bin/classes (or just "classes", if /usr/local/bin is in your PATH).

Where is the standard output from node processes written?
For MPL and PVMe batch jobs, you will find all standard output from your job in the output file created by the batch run, < program > .out. < jobid > .

For PVM jobs, output is usually written to the file /tmp/pvml. < uid > in the local filespace of the main pvm node in the virtual machine. You can either copy that file back to your home area at the end of a batch or interactive run (via a custom user cleanup script in batch), or call the routine pvm_catchout(), which will force node output to appear on the console of an interactive session or in the output file of a batch job.

My batch output file shows the message ``PVMD: One or more nodes have a pending PVM session, please reset them using option '-r'." What's wrong?
Occasionally, runaway processes can disrupt the normal procedure for allocating nodes for PVMe. The system provided PVMe batch script automatically tries the reset option if it is unsuccessful in allocating nodes, and is usually successful. In such a case, the above warning will still appear in your output script (for some reason, this message isn't sent to standard error as you would expect), but it will be followed by otherwise normal output from your program. If not, check the corressponding error file for information. If the script was unable to allocate nodes a second time, a message to this effect should appear in the error file. At that point, you can a) try submitting the job again, b) submit the job again later, or c) contact joyce@danube, karin@danube or mlo@danube, to check whether there is a problem with the node pool you are trying to access.

My PVMe batch output file shows the message ``No node is available to start the process All the nodes are already running a process." What's wrong?
PVMe constrains the user to one process per node. The above message is the result of trying to spawn a process when all node of the virtual machine already are running processes. To get around the problem, you can change to regular PVM (easiest solution), or rewrite your program so that it will only require one process per node (may require having one process do double-duty as both a master and worker on one node, or limiting worker processes to one less than the total available nodes).

Can I do any sort of checkpointing with the LoadLeveler batch system?
The current system does not support checkpointing. However, it is sometimes possible for a user to implement their own checkpointing by saving temporary files and keeping track of where the program leaves off if it is interrupted, particularly by CPU time limits. The system batch scripts can aid in this process by calling a user defined script just before a batch job is timed-out. The user defined script can save any required files for restart, and resubmit the job. Contact karin@danube for details (on-line documentation in development).