Brief Condor dagman HowTo
This small article will not get into any depth what DAGs are and will also not explain the terminology of Condor's dagman, for this please look at the
Condor manual (chapter 2.10).
Imagine having a application which are really I/O intensive and only say 10 jobs may run at any time and in the past you have been submitting all you jobs into the "Hold" status and release them slowly either manually or via a script. But now you deem this too tedious and let Condor do that for you.
Take this submit script
Executable = /bin/sleep
Arguments = $(length)
Error = /dev/null
Output = /dev/null
Log = /local/user/carsten/dag.log
Notification = NEVER
Universe = vanilla
queue
which does nothing else than sleeping for an unspecified number of seconds (but you may put your own executables in there).
Now, DAGs are really powerful, but in our case there is no relation between the jobs other than you want them to be run. You need to create another file which looks like this (I called it
sleep.dag
, surprise!):
JOB SLEEP_01 sleep.submit
JOB SLEEP_02 sleep.submit
JOB SLEEP_03 sleep.submit
JOB SLEEP_04 sleep.submit
JOB SLEEP_05 sleep.submit
VARS SLEEP_01 length="14"
VARS SLEEP_02 length="24"
VARS SLEEP_03 length="43"
VARS SLEEP_04 length="11"
VARS SLEEP_05 length="52"
What does it do? Condor's dagman will submit 5 jobs (internally named SLEEP_01, ..., SLEEP_05) and will pass the variable
length
into the submit file
sleep.submit
from above. To generate this file I used a simple bash script:
#!/bin/bash
NUMBER=${1:-50}
OUTPUT=${2:-sleep.dag}
echo > $OUTPUT
# first write jobs part
for i in `seq -w $NUMBER`; do
echo JOB SLEEP_$i sleep.submit >> $OUTPUT
done
echo >> $OUTPUT
# and then write variable part
for i in `seq -w $NUMBER`; do
tmp=$RANDOM
let "tmp %= 120"
echo "VARS SLEEP_$i length=\"$tmp\"" >> $OUTPUT
done
Of course, your script will vary
Finally, you can submit this by running
condor_submit_dag -maxjobs 10 -maxidle 30 sleep.dag
where
maxjobs
tells condor to run at most 10 jobs concurrently on the cluster and at most 30 jobs may be idle in the Condor's batch queue at any time (
maxidle
).
But that's just the beginning, since Condor supports pre- as well as post-jobs being run before and after a job, one could use a pre job to ensure that data is copied to a local node's hard disk and then a job may run without needing to this itself. On submitting the DAG you could specify
-maxpre n
and make sure that only
n
very I/O intensive jobs run at the same time, but afterwards the number of running jobs could be unlimited.
Use it
Finally, in the log file Condor will tell you - for free - the current status of your jobs and it can also keep track which jobs failed (and restart them automatically if asked to do so)...
--
CarstenAulbert - 08 Apr 2010