abstract

We buy the same kinds of nodes for the LISA and LIGO cluster. They share the same date core switch. The LISA part has to be able to handle parallel MPI jobs. In addition LISA group scientist have to be preferred when starting jobs.

drawbacks

The nodes which are serving MPI jobs have a different setup than the other. Currently parallel universe jobs are not working together with dynamical slots. We need dynamical slots to avoid memory competition on the nodes. Memory competition leads to swaping, which slows the node dramatically down and to failing jobs.

plan

  • anybody can start standard universe jobs everywhere
  • if LISA people start jobs, they get a better priority on the LISA nodes
  • if LIGO people start jobs, they get a better priority on LIGO nodes
  • vanilla universe jobs can start only on LIGO nodes, because we enable eviction on the LISA nodes
  • parallel jobs can be started only from the dedicated LISA head node
  • whenever a parallel job has been started, the running universe jobs on the LISA nodes will be evicted to provide slots for the MPI jobs
  • to have a metric, LISA people will be members of a "lisa" group
  • this group must be configured on the dedicated head node, but in can be everywhere

alternative plan

  • we avoid dynamical slot provisioning and MPI can run everywhere

setup

group ID

When logging in on the LISA head node, the group ids of the user must be read out. The "/etc/profile.atlas" (or "/etc/profile.lisa" if you like) contains the following lines: GIDLIST="$(id -G)" export GIDLIST It reads out the group ID and writes it into the variable GIDLIST. Lets say the LISA gid is 3000.

condor configuration LISA head node

The environment variable GIDLIST must become a condor CLASS_AD which must by attached to the submit process. Incorporate the following condor configuration:
GIDL="$ENV(GIDLIST)"
SUBMIT_EXPRS = $(SUBMIT_EXPRS), GIDL
.

condor configuration LISA compute nodes

Dynamical slot provisioning must be turned off. Avoid:
SLOT_TYPE_1 = cpu=100%, ram=100%, swap=50%
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1_PARTITIONABLE = True

The Rank has to be influenced by the gid list. Add the following into the condor configuration:
START = (Owner == "fehrmann")
RANK = stringListMember("3000", GIDL, " ")

The Rank can be used to evict other jobs.

-- HenningFehrmann - 20 Mar 2013
Topic revision: r1 - 20 Mar 2013, HenningFehrmann
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback