Condor High Throughput Computing System is a software framework for managing workload on a cluster of computers. It is some kind of batch system, that distributes, queues and runs software-jobs on a set of dedicated computers, e.g. your idle desktop pc or some nodes of our cluster.
Condor runs on linux, solaris, bsd, windows... The latest version is 7.0.1. Condor is free software and the later versions are released under the Apache Licence. The source code open and can be downloaded from the project homepage.
See also
Installation
Using apt-get
There is
condor debian package in the repository, that one can install using apt-get. With condor one also have to install boinc-client, used for the backfill. For the local condor and boinc configuration please install
condor-config .
Hints for building the debian package from source
Successful build require:
- gcc-3.4 ( for building glibc )
- lncurses-dev
- flex
- bison-1.35
- csh
- simple make , mean do not use make -j n
make public generates tar.gz installation tree.
condor-7.0.1-linux-x86_64-debian40-dynamic-unstripped.tar.gz
condor-7.0.1-linux-x86_64-debian40-dynamic.tar.gz
condor-7.0.1-linux-x86_64-debian40.tar.gz
condordebugsyms-7.0.1-linux-x86_64-debian40-dynamic.tar.gz
From that, one can install condor and build a deb package.
Configuration
The global configuration files is
/opt/condor/etc/condor_config
and it needs local configuration file
/etc/default/condor
The local config file makes a decision which parameter to pass to the main config file. However, first we have a common part.
Common part
HOSTNAME=$(hostname)
RELEASE_DIR=/opt/condor
LOCAL_DIR=/local/condor.$(hostname)
CONDOR_ADMIN=root@localhost
MAIL=/usr/bin/mail
COLLECTOR_NAME=MPI-GRAPHY-AEI-Hannover
LOCAL_HOST=$(hostname)
cat <
Condor Server
cat <
We use four head nodes h1,h2,h3,h4 in HAD (High Availability of Daemons) configuration, so that the crucial services for the whole cluster do not depend on a single machine. The head nodes also are the place, where you are allowed to submit condor jobs.
Condor Execution Nodes
cat <(1*\$(MINUTE))
EVICT_BACKFILL=\$(MachineBusy)
BOINC_HOME=/local/boinc
BOINC_Executable=/opt/boinc/bin/boinc_client
BOINC_Universe=vanilla
BOINC_InitialDir=\$(BOINC_HOME)
BOINC_Output=\$(BOINC_HOME)/boinc.out
BOINC_Error=\$(BOINC_HOME)/boinc.err
BOINC_Owner=boinc
EOF
The real compute units, the executable machines, are the nodes n*. For the concept of checkpointing and backfill, please read the next two sections.
Checkpoint Server
- What is Checkpoint server?
A Condor pool can also be configured with a checkpoint server or servers that serve as a repository for checkpoints. That is true only for Linux server. In case that a job is evicted, the checkpoint server allows Condor to migrate the job to another machine and continue computing from the last checkpoint.
A test performed using the simple C like program
#include
#include
#define DATASIZE 1024*1024*128
#define MINUTE 60*1000*1000
int main(int argc,char* argv[]){
int i;
int *d;
d = (int*)calloc(DATASIZE,sizeof(int));
if (d==NULL) exit (1);
int s = sizeof(int)*DATASIZE;
printf("calloc(%d,%d) => %d\n",DATASIZE,sizeof(int),s);
for (i=24;i<1048;i++){
d[i]=1;
}
usleep(3*MINUTE);
free(d);
fprintf(stderr,"%d free now\n",s);
return 0;
}
shows, that the memmory is 1:1 copied to the ckpt server, leading to large streams of zeros crossing the network. There is no compression right now, except for the data on the ckpt server, once the zeros are over. For that reason we setup ckpt server on every compute node and force the machine to use it.
Backfill
Condor can be configured to run backfill jobs whenever the condor_ startd has no other work to perform. Currently, the only supported environment is "BOINC".
Hints
DNS
It seems that it is not enough for condor, if you can ping every single machine in your pool by name. For that reason, all the hosts should be in /etc/hosts. For the head nodes you should write down all the node names and ips. On the nodes it is sufficient to have all the head nodes names and node name itself.
want_remote_io = True | False
This option controls how a file is opened and manipulated in a standard universe job. If this option is true, which is the default, then the condor_ shadow makes all decisions about how each and every file should be opened by the executing job. This entails a network round trip (or more) from the job to the condor_ shadow and back again for every single open() in addition to other needed information about the file. If set to false, then when the job queries the condor_ shadow for the first time about how to open a file, the condor_ shadow will inform the job to automatically perform all of its file manipulation on the local file system on the execute machine and any file remapping will be ignored. This means that there must be a shared file system (such as NFS or AFS) between the execute machine and the submit machine and that ALL paths that the job could open on the execute machine must be valid. The ability of the standard universe job to checkpoint, possibly to a checkpoint server, is not affected by this attribute. However, when the job resumes it will be expecting the same file system conditions that were present when the job checkpointed.