Brief Condor dagman HowTo

This small article will not get into any depth what DAGs are and will also not explain the terminology of Condor's dagman, for this please look at the Condor manual (chapter 2.10).

Imagine having a application which are really I/O intensive and only say 10 jobs may run at any time and in the past you have been submitting all you jobs into the "Hold" status and release them slowly either manually or via a script. But now you deem this too tedious and let Condor do that for you.

Take this submit script
Executable     = /bin/sleep
Arguments      = $(length)

Error   = /dev/null
Output  = /dev/null
Log     = /local/user/carsten/dag.log
Notification = NEVER

Universe = vanilla
queue

which does nothing else than sleeping for an unspecified number of seconds (but you may put your own executables in there).

Now, DAGs are really powerful, but in our case there is no relation between the jobs other than you want them to be run. You need to create another file which looks like this (I called it sleep.dag, surprise!):

JOB SLEEP_01 sleep.submit
JOB SLEEP_02 sleep.submit
JOB SLEEP_03 sleep.submit
JOB SLEEP_04 sleep.submit
JOB SLEEP_05 sleep.submit

VARS SLEEP_01 length="14"
VARS SLEEP_02 length="24"
VARS SLEEP_03 length="43"
VARS SLEEP_04 length="11"
VARS SLEEP_05 length="52"

What does it do? Condor's dagman will submit 5 jobs (internally named SLEEP_01, ..., SLEEP_05) and will pass the variable length into the submit file sleep.submit from above. To generate this file I used a simple bash script:

#!/bin/bash

NUMBER=${1:-50}
OUTPUT=${2:-sleep.dag}

echo > $OUTPUT

# first write jobs part
for i in `seq -w $NUMBER`; do
    echo JOB SLEEP_$i sleep.submit >> $OUTPUT
done

echo >> $OUTPUT

# and then write variable part
for i in `seq -w $NUMBER`; do
    tmp=$RANDOM
    let "tmp %= 120"
    echo "VARS SLEEP_$i length=\"$tmp\"" >> $OUTPUT
done

Of course, your script will vary wink

Finally, you can submit this by running
condor_submit_dag -maxjobs 10 -maxidle 30 sleep.dag

where maxjobs tells condor to run at most 10 jobs concurrently on the cluster and at most 30 jobs may be idle in the Condor's batch queue at any time (maxidle).

But that's just the beginning, since Condor supports pre- as well as post-jobs being run before and after a job, one could use a pre job to ensure that data is copied to a local node's hard disk and then a job may run without needing to this itself. On submitting the DAG you could specify -maxpre n and make sure that only n very I/O intensive jobs run at the same time, but afterwards the number of running jobs could be unlimited.

Use it smile

Finally, in the log file Condor will tell you - for free - the current status of your jobs and it can also keep track which jobs failed (and restart them automatically if asked to do so)...

-- CarstenAulbert - 08 Apr 2010

DocumentationForm edit

Title Simple Condor dagman HowTo for beginners
Description When running many jobs on the cluster which are too intense on a file system Condor's dagman might be invaluable
Tags Condor dagman
Category User
Topic revision: r2 - 10 Feb 2012, ArthurVarkentin
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback