Condor, BOINC and dynamic slots
As Condor's dynamic slots don't yet mix with "fetch work", we build the system around a dedicated, cron based system running on a submit host. Setup itself is quite easy - once you get the idea
Prerequisites
You need to have a recent enough version of the BOINC client to understand all XML options used below (6.10.x should be sufficient). Each execute node should have the target directory (
/local/boinc
) pre-created (with proper permissions).
Wrapper script
Place this wrapper script under /local/boinc/boinc_starter.sh (and don't forget to make it executable)
#!/bin/bash
set -e
PATH=/bin:/usr/bin
##################################################################################
# start boinc from a schedd machine to fill the cluster (even with dynamic slots)
##################################################################################
# get slot from local machine ad:
SLOT=$(awk -F"[\"@]" '/^Name/ {print $2}' ${_CONDOR_MACHINE_AD})
# some useful variables
BOINCDIR=/local/boinc/${SLOT}
UID=boinc
GID=boinc
# create environment
# slot directory
if [ ! -d "${BOINCDIR}" ]; then
mkdir -p "${BOINCDIR}"
fi
# account file
if [ ! -f "${BOINC_InitialDir}/account_einstein.phys.uwm.edu.xml" ]; then
cat > "${BOINC_InitialDir}/account_einstein.phys.uwm.edu.xml" <<EOF
<account>
<master_url>http://einstein.phys.uwm.edu/</master_url>
<authenticator>INSERT_YOUR_ACCOUNT_KEY_HERE</authenticator>
</account>
EOF
fi
# config file
if [ ! -f "${BOINC_InitialDir}/cc_config.xml" ]; then
cat > "${BOINC_InitialDir}/cc_config.xml" <<EOF
<cc_config>
<log_flags>
</log_flags>
<options>
<report_results_immediately>1</report_results_immediately>
<dont_contact_ref_site>1</dont_contact_ref_site>
<ncpus>1</ncpus>
<allow_multiple_clients>1</allow_multiple_clients>
<allow_remote_gui_rpc>0</allow_remote_gui_rpc>
<proxy_info>
<no_proxy></no_proxy>
</proxy_info>
</options>
</cc_config>
EOF
fi
# start the beast
exec /usr/bin/nice -n +19 /usr/bin/boinc --dir ${BOINCDIR} \
--update_prefs http://einstein.phys.uwm.edu/ --no_gui_rpc \
>> ${BOINCDIR}/boinc.out 2>> ${BOINCDIR}/boinc.err
on submit machine
Create a crontab entry for the proper user starting a script doing
- Check number of idle jobs for this user, e.g.
condor_q -constraint "JobStatus =?
1" -format "%s\n" JobStatus carsten|wc -l=
- exit if this number exceeds a pre-defiend threshold
- submit more jobs into the pool to reach this threshold, e.g. write this as "standard in" to condor_submit:
Executable = /local/boinc/boinc_starter.sh
Error = /dev/null
Output = /dev/null
Log = /dev/null
Universe = vanilla
request_memory=400
request_cpus=1
request_disk=50
nice_user = true
on_exit_remove = true
Queue NUMBER
where NUMBER is the difference between threshold and current number of idle jobs.
Sample script:
#!/bin/bash
set -e
# set up environment
PATH=/bin:/usr/bin
source /opt/condor/condor.sh
# max number of idle jobs in queue
IDLE_THRESHOLD=200
# current number of idle jobs in queue
IDLE_CUR=$(condor_q -constraint "JobStatus =?= 1" -format "%s\n" JobStatus carsten|wc -l)
# do we need to act?
DIFF=$(($IDLE_THRESHOLD - $IDLE_CUR))
if [[ $DIFF <= 0 ]]; then
exit 0
fi
(
cat <<EOF
Executable = /local/boinc/boinc_starter.sh
Error = /dev/null
Output = /dev/null
Log = /dev/null
Universe = vanilla
request_memory=400
request_cpus=1
request_disk=50
nice_user = true
on_exit_remove = true
Queue $DIFF
EOF
) | condor_submit
--
CarstenAulbert - 01 Jul 2011