HTCondor configuration updates in 2015

(1) Using cgroups to softly enforce memory and core limits

Reasoning

In the past, we either relied on users' jobs to obey the limits they declared or we tried to use ulimit with a wrapper script to limit jobs, however, this usually did not work out most of the time.

Using Linux' cgroups it is now possible that each job can ask for a share of a system's RAM and is guaranteed this much memory (plus a little bit of overhead). HTCondor gives the choice the use hard limits (where a job is automatically killed whenever the limit is hit) or a soft limit (a job may use more than asked for but will be killed as soon as another job requires that memory). To allow jobs to run as smoothly as possible, we will start with the soft limit.

Impact for users

Your job's submission file now MUST include the two following items:

  • request_memory specifies the amount of memory the job will need (in MiBytes). Try to be as reasonable and specific as possible, a value much too large will limit the number of jobs you can run concurrently, a value too small will most likely lead to jobs being killed by the system.
  • request_cpus specifies the number of logical CPU cores your job will need. Choose 1 for a single threaded job and a larger value for multi-threaded programs.

The defaults will be set to 250 MiByte and a single core, i.e. not specifying anything will set these defaults which may not work for you.

Jobs which are killed will NOT write anything to their stdout/stderr files about it, but you will be able to recognize this by the hold reason.

Also, please do NOT use any memory related requirement anymore as this will most likely cause your jobs to never run!

Admin specific

To declare the cgroup to be used and to use soft-limits (see above), put hits into the config and use the systemd service file:

BASE_CGROUP = htcondor
CGROUP_MEMORY_LIMIT_POLICY = soft
JOB_DEFAULT_REQUESTCPUS = 1
JOB_DEFAULT_REQUESTMEMORY = 250

(2) Declaring GPUs as a dynamic resource

Reasoning

So far GPU usage was not really well documented and varied a lot between various clusters. HTCondor recently added GPU discovery to its code and we migrate to this feature.

Impact for users

If you want to use a GPU for your codes, please specify

request_GPUs = 1

If you need special capabilities, you may want to consider using specific requirements, e.g.

requirements = (CUDACapability >= 1.2) && $(requirements:True)

Admin specific

Setting these two knobs should be enough to automatically discover and add the found resources to machine class add:
use feature : GPUs
GPU_DISCOVERY_EXTRA = -extra

Ensure, the kernel drivers are in place in order to auto-detect the GPU(s), i.e. kernel drivers need to be loaded before HTCondor is started.

(3) Accounting group quotas

Reasoning

Impact for users

Admin specific

(4) Disable swap on execute nodes

Reasoning

In the past compute nodes tended to become barely usable if jobs ran out of memory and into swap which affected other concurrently running jobs.

Impact for users

Hopefully either none or better job throughput.

Admin specific

change FAI config in order not to create a swap partition in class NODE_COMPUTE anymore.

(5) systemd service file

Admin specific

We will be utilizing a systemd service file from now on like

[Unit]
Description=HTCondor Distributed High-Throughput-Computing
After=network.target atlas-cuda.service

[Service]
ControlGroup=/htcondor
LimitNOFILE=16384
ExecStart=/usr/sbin/condor_master -f
ExecStop=/usr/sbin/condor_off -master
ExecReload=/usr/bin/kill -HUP $MAINPID

[Install]
WantedBy=multi-user.target

(6) Enable preemption in major part of Atlas

This is only a possibility and currently NOT planned for production

Reasoning

Unlike other clusters, so far we have not enabled preemption on Atlas. Usually, there were more than enough resources available to satisfy most demands. However, recently some jobs were running for exceedingly long times (24-60hrs, rarely more than 100hrs) and this leads to a couple of problems:

  • Resource waste - at least some of these jobs do not use any kind of check-pointing which means they will restart from scratch if the job and/or he node was killed/rebooted during job execution
  • Fair share not being possible - a large number of long running jobs will hinder the scheduler to allow all users to have access to resources
  • Without preemption we may not be able to fulfill our computing resource obligation towards LIGO/VIRGO (see also accounting groups above)

Impact for users

We will configure the major share of our resources to only accept jobs when a specific keyword has been added to your submit files. Please only use this keyword if your jobs
  • run shorter than 12 hours or
  • your jobs can checkpoint and are able to restart from this checkpoint

Jobs capable of checkpointing (standard universe) will automatically be selected.

This check pointing could be done via HTCondor's standard universe, the upcoming vanilla universe check pointing feature or a self-constructed check pointing mechanism investigating old log files to continue from the last known state.

If your jobs qualify according to the above restrictions, please add this to your submit file (unless you are using the standard universe, which we will allow by default):

+AllowPreemption = True

Please note, that we will still have machines which will not preempt jobs at all, but the number will be considerably smaller.

Admin specific

On the majority of systems we should add this to the START expression:

START = (Target.AllowPreemption =?= True) || ( Target.JobUniverse == 1)

on other hosts just use the regular START=TRUE.

(7) Change matching order

Reasoning

Currently HTCondor at Atlas uses the default negotiator ranking, which may lead to many free single CPU cores throughout the system. Given that the latest pipelines start to use more threads and thus require more CPU cores, it would be better to fill-up empty available slots on busy machines first.

Impact for users

This should not impact users' jobs much if at all.

Admin specific

The negotiator needs to sort the jobs in a different way (mostly analog to the default settings of HTCondor):

  1. Each considered slot with "RemoteOwner" undefined will get base score of 10,000,000,000 (this should match partionable slots)
  2. For each available CPU core of this slot, reduce score by 1,000,000,000 (i.e. the fewer cores available, the more higher the rank)
  3. The more RAM the slot has available the lower the score will be (Memory is measured in KByte); with given scores one compute core is currently worth 100GByte RAM
  4. We also want jobs to flock to faster nodes if possible, therefore we also introduce cpuinfo_bogomips which should at least indicate how fast the cpus are. HTCondor's benchmarked Kflops and mibs values seem to vary a lot between identical machines and are therefore not used. Factor used here is 100,000 as we expect the spread between slow and fast nodes to be O(1000).

NEGOTIATOR_PRE_JOB_RANK = (1.0e10 * ( ifThenElse( isUndefined(RemoteOwner), 1.0, 0.0))) \
                         - (1.0e9 * Cpus) \
                         - Memory \
                         + (1.0e5 * ( ifThenElse( isUndefined(cpuinfo_bogomips), 1.0, real(cpuinfo_bogomips) )))

-- CarstenAulbert - 19 Mar 2015
Topic revision: r5 - 22 May 2015, CarstenAulbert
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback