Overview
The OBS supports complex scheduling of multiple batch jobs
with two applications, batchmaster and batchslave.
The user launches batchmaster and provides it with
a task script. The task script is a
Tcl script that describes the set of tasks for batchmaster
to accomplish. The work is actually done by instances of
batchslave that are launched by batchmaster.
The task script may be
modeled after the included simpletask.tcl or multitask.tcl
sample scripts.
The OBS has been designed to control multiple sequential and concurrent micromagnetic simulations, but batchmaster and batchslave are completely general and may be used to schedule other types of jobs as well.
The application batchmaster is launched by the command line:
tclsh oommf.tcl batchmaster [standard options] task_script \ [host [port]]
When batchmaster is run, it sources the task script. Tcl commands in the task script should modify the global object $TaskInfo to inform the master what tasks to perform and optionally how to launch slaves to perform those tasks. The easiest way to create a task script is to modify one of the included example scripts. More detailed instructions are in the Batch task scripts section.
After sourcing the task script, batchmaster launches all the specified slaves, initializes each with a slave initialization script, and then feeds tasks sequentially from the task list to the slaves. When a slave completes a task it reports back to the master and is given the next unclaimed task. If there are no more tasks, the slave is shut down. When all the tasks are complete, the master prints a summary of the tasks and exits.
When the task script requests the launching and controlling of jobs off
the local machine, with slaves running on remote machines, then the
command line argument host must be set to the local machine's
network name, and the $TaskInfo methods AppendSlave and
ModifyHostList will need to be called from inside the task script.
Furthermore, OOMMF does not currently supply any methods for launching
jobs on remote machines, so a task script which requests the launching
of jobs on remote machines requires a working
ssh
command or
equivalent.
(Details.)
The application batchslave may be launched by the command line:
tclsh oommf.tcl batchslave [standard options] \ host port id password [auxscript [arg ...]]
In normal operation, the user does not launch batchslave. Instead, instances of batchslave are launched by batchmaster as instructed by a task script. Although batchmaster may launch any slaves requested by its task script, by default it launches instances of batchslave.
The function of batchslave is to make a connection to a master program, source the auxscript and pass it the list of arguments aux_arg .... Then it receives commands from the master, and evaluates them, making use of the facilities provided by auxscript. Each command is typically a long-running one, such as solving a complete micromagnetic problem. When each command is complete, the batchslave reports back to its master program, asking for the next command. When the master program has no more commands batchslave terminates.
Inside batchmaster, each instance of batchslave is launched by evaluating a Tcl command. This command is called the spawn command, and it may be redefined by the task script in order to completely control which slave applications are launched and how they are launched. When batchslave is to be launched, the spawn command might be:
The Tcl command exec is used to launch subprocesses. When the last argument to exec is &, the subprocess runs in the background. The rest of the spawn command should look familiar as the command line syntax for launching batchslave.exec tclsh oommf.tcl batchslave -tk 0 -- $server(host) $server(port) \ $slaveid $passwd batchsolve.tcl -restart 1 &
The example spawn command above cannot be completely provided by the task script, however, because parts of it are only known by batchmaster. Because of this, the task script should define the spawn command using ``percent variables'' which are substituted by batchmaster. Continuing the example, the task script provides the spawn command:
batchmaster replaces %tclsh with the path to tclsh, and %oommf with the path to the OOMMF bootstrap application. It also replaces %connect_info with the five arguments fromexec %tclsh %oommf batchslave -tk 0 %connect_info \ batchsolve.tcl -restart 1
--
through $password that provide batchslave
the hostname and port where batchmaster is waiting for
it to report to, and the ID and password it should pass back.
In this example, the task script instructs batchslave to source the
file batchsolve.tcl and pass it the arguments -restart 1.
Finally, batchmaster always appends the argument & to
the spawn command so that all slave applications are launched in the
background.
The communication protocol between batchmaster and batchslave is evolving and is not described here. Check the source code for the latest details.
The application batchmaster creates an instance of a BatchTaskObj object with the name $TaskInfo. The task script uses method calls to this object to set up tasks to be performed. The only required call is to the AppendTask method, e.g.,
This method expects two arguments, a label for the task (here ``A'') and a script to accomplish the task. The script will be passed across a network socket from batchmaster to a slave application, and then the script will be interpreted by the slave. In particular, keep in mind that the file system seen by the script will be that of the machine on which the slave process is running.$TaskInfo AppendTask A "BatchTaskRun taskA.mif"
This example uses the default batchsolve.tcl procs to run the simulation defined by the taskA.mif MIF 1.x file. If you want to make changes to the MIF problem specifications on the fly, you will need to modify the default procs. This is done by creating a slave initialization script, via the call
The slave initialization script does global initializations, and also usually redefines the SolverTaskInit proc; optionally the BatchTaskIterationCallback, BatchTaskRelaxCallback and SolverTaskCleanup procs may be redefined as well. At the start of each task SolverTaskInit is called by BatchTaskRun (in batchsolve.tcl), after each iteration BatchTaskIterationCallback is executed, at each control point BatchTaskRelaxCallback is run, and at the end of each task SolverTaskCleanup is called. SolverTaskInit and SolverTaskCleanup are passed the arguments that were passed to BatchTaskRun. A simple SolverTaskInit proc could be$TaskInfo SetSlaveInitScript { <insert script here> }
This proc receives the exchange constant A for this task on the argument list, and makes use of the global variables mif and basename. (Both should be initialized in the slave initialization script outside the SolverTaskInit proc.) It then stores the requested value of A in the mif object, sets up the base filename to use for output, and opens a text file to which tabular data will be appended. The handle to this text file is stored in the global outtextfile, which is closed by the default SolverTaskCleanup proc. A corresponding task script could beproc SolverTaskInit { args } { global mif basename outtextfile set A [lindex $args 0] set outbasename "$basename-A$A" $mif SetA $A $mif SetOutBaseName $outbasename set outtextfile [open "$outbasename.odt" "a+"] puts $outtextfile [GetTextData header \ "Run on $basename.mif, with A=[$mif GetA]"] }
which runs a simulation with A set to 13e-12 J/m. This example is taken from the multitask.tcl sample script. (For commands accepted by mif objects, see the file mmsinit.cc. Another object than can be gainfully manipulated is solver, which is defined in solver.tcl.)$TaskInfo AppendTask "A=13e-12 J/m" "BatchTaskRun 13e-12"
If you want to run more than one task at a time, then the $TaskInfo method AppendSlave will have to be invoked. This takes the form
where <spawn command> is the command to launch the slave process, and <spawn count> is the number of slaves to launch with this command. (Typically <spawn count> should not be larger than the number of processors on the target system.) The default value for this item (which gets overwritten with the first call to $TaskInfo AppendSlave) is$TaskInfo AppendSlave <spawn count> <spawn command>
The Tcl command Oc_Application Exec is supplied by OOMMF and provides access to the same application-launching capability that is used by the OOMMF bootstrap application. Using a <spawn command> of Oc_Application Exec instead of exec %tclsh %oommf saves the spawning of an additional process. The default <spawn command> launches the batchslave application, with connection information provided by batchmaster, and using the auxscript batchsolve.tcl.1 {Oc_Application Exec batchslave -tk 0 %connect_info batchsolve.tcl}
Before evaluating the <spawn command>, batchmaster applies several percent-style substitutions useful in slave launch scripts: %tclsh, %oommf, %connect_info, %oommf_root, and %%. The first is the Tcl shell to use, the second is an absolute path to the OOMMF bootstrap program on the master machine, the third is connection information needed by the batchslave application, the fourth is the path to the OOMMF root directory on the master machine, and the last is interpreted as a single percent. batchmaster automatically appends the argument & to the <spawn command> so that the slave applications are launched in the background.
To launch batchslave on a remote host, use ssh in the spawn command, e.g.,
This example assumes tclsh is in the execution path on the remote machine foo, and OOMMF is installed off of your home directory. In addition, you will have to add the machine foo to the host connect list with$TaskInfo AppendSlave 1 {exec ssh foo tclsh oommf/oommf.tcl \ batchslave -tk 0 %connect_info batchsolve.tcl}
and batchmaster must be run with the network interface specified as the server host (instead of the default localhost), e.g.,$TaskInfo ModifyHostList +foo
where bar is the name of the local machine.tclsh oommf.tcl batchmaster multitask.tcl bar
This may seem a bit complicated, but the examples in the next section should make things clearer.
The first sample task script is a simple example that runs the 3 micromagnetic simulations described by the MIF 1.x files taskA.mif, taskB.mif and taskC.mif. It is launched with the command
This example uses the default slave launch script, so a single slave is launched on the current machine, and the 3 simulations will be run sequentially. Also, no slave initialization script is given, so the default procs in batchsolve.tcl are used. Output will be magnetization states and tabular data at each control point, stored in files on the local machine with base names as specified in the MIF files.tclsh oommf.tcl batchmaster simpletask.tcl
# FILE: simpletask.tcl # # This is a sample batch task file. Usage example: # # tclsh oommf.tcl batchmaster simpletask.tcl # # Form task list $TaskInfo AppendTask A "BatchTaskRun taskA.mif" $TaskInfo AppendTask B "BatchTaskRun taskB.mif" $TaskInfo AppendTask C "BatchTaskRun taskC.mif"
second sample task script builds on the previous example by defining BatchTaskIterationCallback and BatchTaskRelaxCallback procedures in the slave init script. The first set up to write tabular data every 10 iterations, while the second writes tabular data on each control point event. The data is written to the output file specified by the Base Output Filename entry in the input MIF files. Note that there is no magnetization vector field output in this example. This task script is launched the same way as the previous example:
tclsh oommf.tcl batchmaster octrltask.tcl
# FILE: octrltask.tcl # # This is a sample batch task file, with expanded output control. # Usage example: # # tclsh oommf.tcl batchmaster octrltask.tcl # # "Every" output selection count set SKIP_COUNT 10 # Initialize solver. This is run at global scope set init_script { # Text output routine proc MyTextOutput {} { global outtextfile puts $outtextfile [GetTextData data] flush $outtextfile } # Change control point output proc BatchTaskRelaxCallback {} { MyTextOutput } # Add output on iteration events proc BatchTaskIterationCallback {} { global solver set count [$solver GetODEStepCount] if { ($count % __SKIP_COUNT__) == 0 } { MyTextOutput } } } # Substitute $SKIP_COUNT in for __SKIP_COUNT__ in above "init_script" regsub -all -- __SKIP_COUNT__ $init_script $SKIP_COUNT init_script $TaskInfo SetSlaveInitScript $init_script # Form task list $TaskInfo AppendTask A "BatchTaskRun taskA.mif" $TaskInfo AppendTask B "BatchTaskRun taskB.mif" $TaskInfo AppendTask C "BatchTaskRun taskC.mif"
The third task script is a more complicated example running concurrent processes on two machines. This script should be run with the command
where bar is the name of the local machine.tclsh oommf.tcl batchmaster multitask.tcl bar
Near the top of the multitask.tcl script several Tcl variables (RMT_MACHINE through A_list) are defined; these are used farther down in the script. The remote machine is specified as foo, which is used in the $TaskInfo AppendSlave and $TaskInfo ModifyHostList commands.
There are two AppendSlave commands, one to run two slaves on the local machine, and one to run a single slave on the remote machine (foo). The latter changes to a specified working directory before launching the batchslave application on the remote machine. (For this to work you must have ssh configured properly.)
Below this the slave initialization script is defined. The Tcl regsub command is used to place the task script defined value of BASEMIF into the init script template. The init script is run on the slave when the slave is first brought up. It first reads the base MIF file into a newly created mms_mif instance. (The MIF file needs to be accessible by the slave process, irrespective of which machine it is running on.) Then replacement SolverTaskInit and SolverTaskCleanup procs are defined. The new SolverTaskInit interprets its first argument as a value for the exchange constant A. Note that this is different from the default SolverTaskInit proc, which interprets its first argument as the name of a MIF 1.x file to load. With this task script, a MIF file is read once when the slave is brought up, and then each task redefines only the value of A for the simulation (and corresponding changes to the output filenames and data table header).
Finally, the Tcl loop structure
is used to build up a task list consisting of one task for each value of A in A_list (defined at the top of the task script). For example, the first value of A is 10e-13, so the first task will have the label A=10e-13 and the corresponding script is BatchTaskRun 10e-13. The value 10e-13 is passed on by BatchTaskRun to the SolverTaskInit proc, which has been redefined to process this argument as the value for A, as described above.foreach A $A_list { $TaskInfo AppendTask "A=$A" "BatchTaskRun $A" }
There are 6 tasks in all, and 3 slave processes, so the first three tasks will run concurrently in the 3 slaves. As each slave finishes it will be given the next task, until all the tasks are complete.
# FILE: multitask.tcl # # This is a sample batch task file. Usage example: # # tclsh oommf.tcl batchmaster multitask.tcl hostname [port] # # Task script configuration set RMT_MACHINE foo set RMT_TCLSH tclsh set RMT_OOMMF "/path/to/oommf/oommf.tcl" set RMT_WORK_DIR "/path/to/oommf/app/mmsolve/data" set BASEMIF taskA set A_list { 10e-13 10e-14 10e-15 10e-16 10e-17 10e-18 } # Slave launch commands $TaskInfo ModifyHostList +$RMT_MACHINE $TaskInfo AppendSlave 2 "exec %tclsh %oommf batchslave -tk 0 \ %connect_info batchsolve.tcl" $TaskInfo AppendSlave 1 "exec ssh $RMT_MACHINE \ cd $RMT_WORK_DIR \\\;\ $RMT_TCLSH $RMT_OOMMF batchslave -tk 0 %connect_info batchsolve.tcl" # Slave initialization script (with batchsolve.tcl proc # redefinitions) set init_script { # Initialize solver. This is run at global scope set basename __BASEMIF__ ;# Base mif filename (global) mms_mif New mif $mif Read [FindFile ${basename}.mif] # Redefine TaskInit and TaskCleanup proc's proc SolverTaskInit { args } { global mif outtextfile basename set A [lindex $args 0] set outbasename "$basename-A$A" $mif SetA $A $mif SetOutBaseName $outbasename set outtextfile [open "$outbasename.odt" "a+"] puts $outtextfile [GetTextData header \ "Run on $basename.mif, with A=[$mif GetA]"] flush $outtextfile } proc SolverTaskCleanup { args } { global outtextfile close $outtextfile } } # Substitute $BASEMIF in for __BASEMIF__ in above script regsub -all -- __BASEMIF__ $init_script $BASEMIF init_script $TaskInfo SetSlaveInitScript $init_script # Create task list foreach A $A_list { $TaskInfo AppendTask "A=$A" "BatchTaskRun $A" }