Overview
The OBS supports complex scheduling of multiple batch jobs
with two applications batchmaster and batchslave.
The user launches batchmaster and provides it with
a task script. The task script is a
Tcl script that describes the set of tasks for batchmaster
to accomplish. The work is actually done by instances of
batchslave that are launched by batchmaster.
The task script may be
modeled after the included simpletask.tcl or multitask.tcl
sample scripts.
The OBS has been designed to control multiple sequential and concurrent micromagnetic simulations, but batchmaster and batchslave are completely general and may be used to schedule other types of jobs as well.
tclsh oommf.tcl batchmaster [standard options] task_script \ [host [port]]where
When batchmaster is run, it sources the task script. Tcl commands in the task script should modify the global object $TaskInfo to inform the master what tasks to perform and optionally how to launch slaves to perform those tasks. The easiest way to create a task script is to modify one of the included example scripts. More detailed instructions are in the Batch task scripts section.
After sourcing the task script, batchmaster launches all the specified slaves, initializes each with a slave initialization script, and then feeds tasks sequentially from the task list to the slaves. When a slave completes a task it reports back to the master and is given the next unclaimed task. If there are no more tasks, the slave is shut down. When all the tasks are complete, the master prints a summary of the tasks and exits.
When the task script requests the
launching and controlling jobs off the local machine,
with slaves running on remote
machines, then the command line argument
host must be set to the local
machine's network name, and the $TaskInfo methods
AppendSlave and ModifyHostList will need to be called from
inside the task script. Furthermore, OOMMF does not currently
supply any methods for launching jobs on remote machines,
so a task script which requests the launching of jobs on remote
machines
requires a working rsh
command or
equivalent.
(Details.)
The application batchslave may be launched by the command line:
tclsh oommf.tcl batchslave [standard options] \ host port id password [script [arg ...]]where
In normal operation, the user does not launch batchslave. Instead, instances of batchslave are launched by batchmaster as instructed by a task script. Although batchmaster may launch any slaves requested by its task script, by default it launches instances of batchslave.
The function of batchslave is to make a connection to a master program, source the script and pass it the list of arguments arg .... Then it receives commands from the master, and evaluates them, making use of the facilities provided by script. Each command is typically a long-running one, such as solving a complete micromagnetic problem. When each command is complete, the batchslave reports back to its master program, asking for the next command. When the master program has no more commands batchslave terminates.
Inside batchmaster, each instance of batchslave is launched by evaluating a Tcl command. This command is called the spawn command, and it may be redefined by the task script in order to completely control which slave applications are launched and how they are launched. When batchslave is to be launched, the spawn command might be:
exec tclsh oommf.tcl batchslave -tk 0 -- $server(host) $server(port) \ $slaveid $passwd batchsolve.tcl -restart 1 &The Tcl command exec is used to launch subprocesses. When the last argument to exec is &, the subprocess runs in the background. The rest of the spawn command should look familiar as the command line syntax for launching batchslave.
The example spawn command above cannot be completely provided by the task script, however, because parts of it are only known by batchmaster. Because of this, the task script should define the spawn command using ``percent variables'' which are substituted by batchmaster. Continuing the example, the task script provides the spawn command:
exec %tclsh %oommf batchslave -tk 0 %connect_info \ batchsolve.tcl -restart 1batchmaster replaces %tclsh with the path to tclsh, and %oommf with the path to the OOMMF bootstrap application. It also replaces %connect_info with the five arguments from
--
through $password that provide batchslave
the hostname and port where batchmaster is waiting for
it to report to, and the ID and password it should pass back.
In this example, the task script instructs batchslave to source the
file batchsolve.tcl and pass it the arguments -restart 1.
Finally, batchmaster always appends the argument & to
the spawn command so that all slave applications are launched in the
background.
The communication protocol between batchmaster and batchslave is evolving and is not described here. Check the source code for the latest details.
$TaskInfo AppendTask A "BatchTaskRun taskA.mif"This method expects two arguments, a label for the task (here ``A'') and a script to accomplish the task. The script will be passed across a network socket from batchmaster to a slave application, and then the script will be interpreted by the slave. (In particular, keep in mind that the file system seen by the script will be that of the machine on which the slave process is running.)
This example uses the default batchsolve.tcl procs to run the simulation defined by the taskA.mif MIF file. If you want to make changes to the MIF problem specifications on the fly, you will need to modify the default procs. This is done by creating a slave initialization script, via the call
$TaskInfo SetSlaveInitScript { <insert script here> }The slave initialization script does global initializations, and also generally redefines the SolverTaskInit proc; optionally the BatchTaskRelaxCallback and SolverTaskCleanup procs may be redefined as well. At the start of each task SolverTaskInit is called by BatchTaskRun (in batchsolve.tcl), at each control point BatchTaskRelaxCallback is executed, and at the end of each task SolverTaskCleanup is called. The first and third are passed the arguments that were passed to BatchTaskRun. A simple SolverTaskInit proc could be
proc SolverTaskInit { args } { global mif basename outtextfile set A [lindex $args 0] set outbasename "$basename-A$A" $mif SetA $A $mif SetOutBaseName $outbasename set outtextfile [open "$outbasename.odt" "a+"] puts $outtextfile [GetTextData header \ "Run on $basename.mif, with A=[$mif GetA]"] }This proc receives the exchange constant A for this task on the argument list, and makes use of the global variables mif and basename. (Both should be initialized in the slave initialization script outside the SolverTaskInit proc.) It then stores the requested value of A in the mif object, sets up the base filename to use for output, and opens a text file to which tabular data will be appended. The handle to this text file is stored in the global outtextfile, which is closed by the default SolverTaskCleanup proc. A corresponding task script could be
$TaskInfo AppendTask "A=13e-12 J/m" "BatchTaskRun 13e-12"which runs a simulation with A set to 13e-12 J/m. This example is taken from the multitask.tcl sample script. (For commands accepted by mif objects, see the file mmsinit.cc. Another object than can be gainfully manipulated is solver, which is defined in solver.tcl.)
If you want to run more than one task at a time, then the $TaskInfo method AppendSlave will have to be invoked. This takes the form
$TaskInfo AppendSlave <spawn count> <spawn command>where <spawn command> is the command to launch the slave process, and <spawn count> is the number of slaves to launch with this command. (Typically <spawn count> should not be larger than the number of processors on the target system.) The default value for this item (which gets overwritten with the first call to $TaskInfo AppendSlave) is
1 {Oc_Application Exec batchslave -tk 0 %connect_info batchsolve.tcl}The Tcl command Oc_Application Exec is supplied by OOMMF and provides access to the same application-launching capability that is used by the OOMMF bootstrap application. Using a <spawn command> of Oc_Application Exec instead of exec %tclsh %oommf saves the spawning of an additional process. The default <spawn command> launches the batchslave application, with connection information provided by batchmaster, and using the script batchsolve.tcl.
Before evaluating the <spawn command>, batchmaster applies several percent-style substitutions useful in slave launch scripts: %tclsh, %oommf, %connect_info, %oommf_root, and %%. The first is the Tcl shell to use, the second is an absolute path to the OOMMF bootstrap program on the master machine, the third is connection information needed by the batchslave application, the fourth is the path to the OOMMF root directory on the master machine, and the last is interpreted as a single percent. batchmaster automatically appends the argument & to the <spawn command> so that the slave applications are launched in the background.
To launch batchslave on a remote host, use rsh in the spawn command, e.g.,
$TaskInfo AppendSlave 1 {exec rsh foo tclsh oommf/oommf.tcl \ batchslave -tk 0 %connect_info batchsolve.tcl}This example assumes tclsh is in the execution path on the remote machine foo, and OOMMF is installed off of your home directory. In addition, you will have to add the machine foo to the host connect list with
$TaskInfo ModifyHostList +fooand batchmaster must be run with the network interface specified as the server host (instead of the default localhost), e.g.,
tclsh oommf.tcl batchmaster multitask.tcl barwhere bar is the name of the local machine.
This may seem a bit complicated, but the examples in the next section should make things clearer.
tclsh oommf.tcl batchmaster simpletask.tclThis example uses the default slave launch script, so a single slave is launched on the current machine, and the 3 simulations will be run sequentially. Also, no slave init script is given, so the default procs in batchsolve.tcl are used. Output will be magnetization states and tabular data at each control point, stored in files on the local machine with base names as specified in the MIF files.
# FILE: simpletask.tcl # # This is a sample batch task file. Usage example: # # tclsh oommf.tcl batchmaster simpletask.tcl # Form task list $TaskInfo AppendTask A "BatchTaskRun taskA.mif" $TaskInfo AppendTask B "BatchTaskRun taskB.mif" $TaskInfo AppendTask C "BatchTaskRun taskC.mif"
The second task script is a more complicated example running concurrent processes on two machines. This script should be run with the command
tclsh oommf.tcl batchmaster multitask.tcl barwhere bar is the name of the local machine.
Near the top of the multitask.tcl script several Tcl variables (RMT_MACHINE through A_list) are defined; these are used farther down in the script. The remote machine is specified as foo, which is used in the $TaskInfo AppendSlave and $TaskInfo ModifyHostList commands.
There are two AppendSlave commands, one to run two slaves on the local machine, and one to run a single slave on the remote machine (foo). The latter changes to a specified working directory before launching the batchslave application on the remote machine. (For this to work you must have rsh configured properly. In the future it may be possible to launch remote commands using the OOMMF account server application, thereby lessening the reliance on system commands like rsh.)
Below this the slave init script is defined. The Tcl regsub command is used to place the task script defined value of BASEMIF into the init script template. The init script is run on the slave when the slave is first brought up. It first reads the base MIF file into a newly created mms_mif instance. (The MIF file needs to be accessible by the slave process, irrespective of which machine it is running on.) Then replacement SolverTaskInit and SolverTaskCleanup procs are defined. The new SolverTaskInit interprets its first argument as a value for the exchange constant A. Note that this is different from the default SolverTaskInit proc, which interprets its first argument as the name of a MIF file to load. With this task script, a MIF file is read once when the slave is brought up, and then each task redefines only the value of A for the simulation (and corresponding changes to the output filenames and data table header).
Finally, the Tcl loop structure
foreach A $A_list { $TaskInfo AppendTask "A=$A" "BatchTaskRun $A" }is used to build up a task list consisting of one task for each value of A in A_list (defined at the top of the task script). For example, the first value of A is 10e-13, so the first task will have the label A=10e-13 and the corresponding script is BatchTaskRun 10e-13. The value 10e-13 is passed on by BatchTaskRun to the SolverTaskInit proc, which has been redefined to process this argument as the value for A, as described above.
There are 6 tasks in all, and 3 slave processes, so the first three tasks will run concurrently in the 3 slaves. As each slave finishes it will be given the next task, until all the tasks are complete.
# FILE: multitask.tcl # # This is a sample batch task file. Usage example: # # tclsh oommf.tcl batchmaster [-tk 0] multitask.tcl hostname [port] # # Task script configuration set RMT_MACHINE foo # Note that paths that contain any shell metacharacters are likely to fail. set RMT_TCLSH {tclsh} set RMT_OOMMF {/path/to/oommf/oommf.tcl} set RMT_WORK_DIR {/path/to/oommf/app/mmsolve/scripts} set LCL_WORK_DIR {/path/to/oommf/app/mmsolve/scripts} set BASEMIF taskA set A_list { 10e-13 10e-14 10e-15 10e-16 10e-17 10e-18 } # Slave launch commands $TaskInfo ModifyHostList +$RMT_MACHINE $TaskInfo AppendSlave 1 "[list cd $LCL_WORK_DIR] \;\ Oc_Application Exec batchslave -tk 0 %connect_info batchsolve.tcl" $TaskInfo AppendSlave 1 "exec rsh $RMT_MACHINE \ cd \"$RMT_WORK_DIR\" \\\;\ \"$RMT_TCLSH\" \"$RMT_OOMMF\" batchslave -tk 0 %connect_info batchsolve.tcl" # Slave initialization script (with batchsolve.tcl proc redefinitions) set init_script { # Initialize solver. This is run at global scope set basename __BASEMIF__ ;# Base mif filename (global) mms_mif New mif $mif Read [FindFile ${basename}.mif] # Redefine TaskInit and TaskCleanup proc's proc SolverTaskInit { args } { global mif outtextfile basename set A [lindex $args 0] set outbasename "$basename-A$A" $mif SetA $A $mif SetOutBaseName $outbasename set outtextfile [open "$outbasename.odt" "a+"] puts $outtextfile [GetTextData header \ "mmSolve run on $basename.mif, with A=[$mif GetA]"] flush $outtextfile } } # Substitute $BASEMIF in for __BASEMIF__ in above script regsub -all -- __BASEMIF__ $init_script $BASEMIF init_script $TaskInfo SetSlaveInitScript $init_script # Create task list foreach A $A_list { $TaskInfo AppendTask "A=$A" "BatchTaskRun $A" }