How to check the state of your jobs

High level overview via web browser

If you want to monitor how and if your jobs progress, we have a poor men's high level overview named CondorWatch available. Here, you can find all jobs currently known to the system with submit hosts as columns and user/accounting tags as rows.

Each field with contents shows three bits of information: The first one shows how many jobs are currently running while the second one how many jobs encountered an error and are now awaiting further handling by the user (see below). The final bit of information shows how many jobs are currently waiting to be scheduled to a free resource. If a job requires multiple CPU cores, it will show the total number of CPU cores in parentheses after each number.

command line interface on the submit host

quick overview

For a rough overview, the command condor_q will yield a summary for your user - your user name is an implicit argument to condor_q.

more details

If you want more details you will be able to get those by running
condor_q -nobatch

As with all command line options for condor executables you can write fewer letters as long as the option name stays unique, i.e. condor_q -nob will still work but condor_q -no will not.

where are my jobs running?

To list on which machine a job runs, the option -run may be used, e.g.
condor_q -nob -run
[...]
1517656.13  username          5/29 13:12   0+00:19:44 slot1_2@a6705.atlas.local
[..]
means that job number .13 from cluster 1517656 was submitted on May 29th at 13:12 UTC. When condor_q was issued, the job had a total wall clock run time of 19:44 minutes and was currently running on node a6705.

One could now ssh a6705 into that node and inspect the system state with htop or other tools.

how to debug failed jobs?

A job which failed (usually this means it exited with a non-zero exit code) will usually be placed into the "hold" state. One can see the error messages by running
condor_q -nob -hold
[..]
45344094.0  user name     5/28 11:27 Error from slot1_14@a1821.atlas.local: Job has gone over memory limit of 8064 megabytes. Peak usage: 8048 megabytes.

In this case, the job asked for about 8 GB of memory but went up to the limit/past its limit. The slight discrepancy in numbers shown here may arise from slightly different ways how they are measured/computed.

DocumentationForm edit

Title CondorWatch web page
Description Monitor running jobs on ATLAS from your browser!
Tags
Category User
Topic attachments
I Attachment Action Size Date Who Comment
condorwatch.csscss condorwatch.css manage 649 bytes 20 Dec 2014 - 14:16 KarlWette CondorWatch style sheet
Topic revision: r4 - 29 May 2019, CarstenAulbert
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback