You are here: Foswiki>ATLAS Web>Condor (22 Oct 2008, MiroslavShaltev)Edit Attach

Condor

Condor High Throughput Computing System is a software framework for managing workload on a cluster of computers. It is some kind of batch system, that distributes, queues and runs software-jobs on a set of dedicated computers, e.g. your idle desktop pc or some nodes of our cluster.

Condor runs on linux, solaris, bsd, windows... The latest version is 7.0.1. Condor is free software and the later versions are released under the Apache Licence. The source code open and can be downloaded from the project homepage.

See also

Installation

Using apt-get

There is condor debian package in the repository, that one can install using apt-get. With condor one also have to install boinc-client, used for the backfill. For the local condor and boinc configuration please install condor-config .

Hints for building the debian package from source

Successful build require:

  • gcc-3.4 ( for building glibc )
  • lncurses-dev
  • flex
  • bison-1.35
  • csh
  • simple make , mean do not use make -j n

make public generates tar.gz installation tree.

 condor-7.0.1-linux-x86_64-debian40-dynamic-unstripped.tar.gz
 condor-7.0.1-linux-x86_64-debian40-dynamic.tar.gz
 condor-7.0.1-linux-x86_64-debian40.tar.gz
 condordebugsyms-7.0.1-linux-x86_64-debian40-dynamic.tar.gz

From that, one can install condor and build a deb package.

Configuration

The global configuration files is

/opt/condor/etc/condor_config

and it needs local configuration file

/etc/default/condor

The local config file makes a decision which parameter to pass to the main config file. However, first we have a common part.

Common part

 HOSTNAME=$(hostname)
 RELEASE_DIR=/opt/condor
 LOCAL_DIR=/local/condor.$(hostname)
 CONDOR_ADMIN=root@localhost
 MAIL=/usr/bin/mail
 COLLECTOR_NAME=MPI-GRAPHY-AEI-Hannover
 LOCAL_HOST=$(hostname)
 cat <

Condor Server

 cat <

We use four head nodes h1,h2,h3,h4 in HAD (High Availability of Daemons) configuration, so that the crucial services for the whole cluster do not depend on a single machine. The head nodes also are the place, where you are allowed to submit condor jobs.

Condor Execution Nodes

 cat <(1*\$(MINUTE))
 EVICT_BACKFILL=\$(MachineBusy)
 BOINC_HOME=/local/boinc
 BOINC_Executable=/opt/boinc/bin/boinc_client
 BOINC_Universe=vanilla
 BOINC_InitialDir=\$(BOINC_HOME)
 BOINC_Output=\$(BOINC_HOME)/boinc.out
 BOINC_Error=\$(BOINC_HOME)/boinc.err
 BOINC_Owner=boinc
 EOF

The real compute units, the executable machines, are the nodes n*. For the concept of checkpointing and backfill, please read the next two sections.

Checkpoint Server

  • What is Checkpoint server?

A Condor pool can also be configured with a checkpoint server or servers that serve as a repository for checkpoints. That is true only for Linux server. In case that a job is evicted, the checkpoint server allows Condor to migrate the job to another machine and continue computing from the last checkpoint.

  • Limitations

A test performed using the simple C like program

 #include 
 #include 
 #define DATASIZE 1024*1024*128
 #define MINUTE 60*1000*1000
 int main(int argc,char* argv[]){
 int i; 
 int *d;
 d = (int*)calloc(DATASIZE,sizeof(int));
 if (d==NULL) exit (1);
 int s = sizeof(int)*DATASIZE;
 printf("calloc(%d,%d) => %d\n",DATASIZE,sizeof(int),s);
 for (i=24;i<1048;i++){
 d[i]=1;
 }
 usleep(3*MINUTE);
 free(d);
 fprintf(stderr,"%d free now\n",s);
 return 0;
 }

shows, that the memmory is 1:1 copied to the ckpt server, leading to large streams of zeros crossing the network. There is no compression right now, except for the data on the ckpt server, once the zeros are over. For that reason we setup ckpt server on every compute node and force the machine to use it.

Backfill

Condor can be configured to run backfill jobs whenever the condor_ startd has no other work to perform. Currently, the only supported environment is "BOINC".

Hints

DNS

It seems that it is not enough for condor, if you can ping every single machine in your pool by name. For that reason, all the hosts should be in /etc/hosts. For the head nodes you should write down all the node names and ips. On the nodes it is sufficient to have all the head nodes names and node name itself.

WantRemoteIO

want_remote_io = True | False

This option controls how a file is opened and manipulated in a standard universe job. If this option is true, which is the default, then the condor_ shadow makes all decisions about how each and every file should be opened by the executing job. This entails a network round trip (or more) from the job to the condor_ shadow and back again for every single open() in addition to other needed information about the file. If set to false, then when the job queries the condor_ shadow for the first time about how to open a file, the condor_ shadow will inform the job to automatically perform all of its file manipulation on the local file system on the execute machine and any file remapping will be ignored. This means that there must be a shared file system (such as NFS or AFS) between the execute machine and the submit machine and that ALL paths that the job could open on the execute machine must be valid. The ability of the standard universe job to checkpoint, possibly to a checkpoint server, is not affected by this attribute. However, when the job resumes it will be expecting the same file system conditions that were present when the job checkpointed.
Topic revision: r2 - 22 Oct 2008, MiroslavShaltev
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback