abstract
We buy the same kinds of nodes for the LISA and LIGO cluster. They
share the same date core switch. The LISA part has to be able to handle
parallel MPI jobs. In addition LISA group scientist have to be
preferred when starting jobs.
drawbacks
The nodes which are serving MPI jobs have a different setup than the
other. Currently parallel universe jobs are not working together with
dynamical slots. We need dynamical slots to avoid memory competition on
the nodes. Memory competition leads to swaping, which slows the node
dramatically down and to failing jobs.
plan
- anybody can start standard universe jobs everywhere
- if LISA people start jobs, they get a better priority on the LISA nodes
- if LIGO people start jobs, they get a better priority on LIGO nodes
- vanilla universe jobs can start only on LIGO nodes, because we enable eviction on the LISA nodes
- parallel jobs can be started only from the dedicated LISA head node
- whenever a parallel job has been started, the running universe jobs on the LISA nodes will be evicted to provide slots for the MPI jobs
- to have a metric, LISA people will be members of a "lisa" group
- this group must be configured on the dedicated head node, but in can be everywhere
alternative plan
- we avoid dynamical slot provisioning and MPI can run everywhere
setup
group ID
When logging in on the LISA head node, the group ids of the user must be
read out. The "/etc/profile.atlas" (or "/etc/profile.lisa" if you like)
contains the following lines:
GIDLIST="$(id -G)"
export GIDLIST
It reads out the group ID and writes it into the variable GIDLIST. Lets
say the LISA gid is 3000.
condor configuration LISA head node
The environment variable GIDLIST must become a condor CLASS_AD which
must by attached to the submit process.
Incorporate the following condor configuration:
GIDL="$ENV(GIDLIST)"
SUBMIT_EXPRS = $(SUBMIT_EXPRS), GIDL
.
condor configuration LISA compute nodes
Dynamical slot provisioning must be turned off. Avoid:
SLOT_TYPE_1 = cpu=100%, ram=100%, swap=50%
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1_PARTITIONABLE = True
The Rank has to be influenced by the gid list. Add the following into
the condor configuration:
START = (Owner == "fehrmann")
RANK = stringListMember("3000", GIDL, " ")
The Rank can be used to evict other jobs.
-- HenningFehrmann - 20 Mar 2013