How to check the state of your jobs
High level overview via web browser
If you want to monitor how and if your jobs progress, we have a poor men's high level overview named
CondorWatch available. Here, you can find all jobs currently known to the system with submit hosts as columns and user/accounting tags as rows.
Each field with contents shows three bits of information: The first one shows how many jobs are currently running while the second one how many jobs encountered an error and are now awaiting further handling by the user (see below). The final bit of information shows how many jobs are currently waiting to be scheduled to a free resource. If a job requires multiple CPU cores, it will show the total number of CPU cores in parentheses after each number.
command line interface on the submit host
quick overview
For a rough overview, the command
condor_q
will yield a summary for your user - your user name is an implicit argument to
condor_q
.
more details
If you want more details you will be able to get those by running
condor_q -nobatch
As with all command line options for condor executables you can write fewer letters as long as the option name stays unique, i.e.
condor_q -nob
will still work but
condor_q -no
will not.
where are my jobs running?
To list on which machine a job runs, the option
-run
may be used, e.g.
condor_q -nob -run
[...]
1517656.13 username 5/29 13:12 0+00:19:44 slot1_2@a6705.atlas.local
[..]
means that job number
.13
from cluster
1517656
was submitted on May 29th at 13:12 UTC. When
condor_q
was issued, the job had a total wall clock run time of 19:44 minutes and was currently running on node
a6705
.
One could now
ssh a6705
into that node and inspect the system state with
htop
or other tools.
how to debug failed jobs?
A job which failed (usually this means it exited with a non-zero exit code) will usually be placed into the "hold" state. One can see the error messages by running
condor_q -nob -hold
[..]
45344094.0 user name 5/28 11:27 Error from slot1_14@a1821.atlas.local: Job has gone over memory limit of 8064 megabytes. Peak usage: 8048 megabytes.
In this case, the job asked for about 8 GB of memory but went up to the limit/past its limit. The slight discrepancy in numbers shown here may arise from slightly different ways how they are measured/computed.