HTCondor configuration updates in 2015
(1) Using cgroups to softly enforce memory and core limits
Reasoning
In the past, we either relied on users' jobs to obey the limits they
declared or we tried to use
ulimit
with a wrapper script to limit
jobs, however, this usually did not work out most of the time.
Using Linux' cgroups it is now possible that each job can ask for a share
of a system's RAM and is guaranteed this much memory (plus a little bit
of overhead). HTCondor gives the choice the use
hard limits (where a job
is automatically killed whenever the limit is hit) or a
soft limit (a
job
may use more than asked for but will be killed as soon as another
job requires that memory). To allow jobs to run as smoothly as possible,
we will start with the
soft limit.
Impact for users
Your job's submission file now
MUST include the two following items:
-
request_memory
specifies the amount of memory the job will need (in MiBytes). Try to be as reasonable and specific as possible, a value much too large will limit the number of jobs you can run concurrently, a value too small will most likely lead to jobs being killed by the system.
-
request_cpus
specifies the number of logical CPU cores your job will need. Choose 1
for a single threaded job and a larger value for multi-threaded programs.
The defaults will be set to 250 MiByte and a single core, i.e. not specifying
anything will set these defaults which may not work for you.
Jobs which are killed will
NOT write anything to their stdout/stderr files
about it, but you will be able to recognize this by the hold reason.
Also, please do
NOT use any memory related
requirement anymore as this
will most likely cause your jobs to never run!
Admin specific
To declare the cgroup to be used and to use soft-limits (see above), put hits into the config and use the
systemd service file:
BASE_CGROUP = htcondor
CGROUP_MEMORY_LIMIT_POLICY = soft
JOB_DEFAULT_REQUESTCPUS = 1
JOB_DEFAULT_REQUESTMEMORY = 250
(2) Declaring GPUs as a dynamic resource
Reasoning
So far
GPU usage was not really well documented and varied a lot between
various clusters. HTCondor recently added
GPU discovery to its code and
we migrate to this feature.
Impact for users
If you want to use a
GPU for your codes, please specify
request_GPUs = 1
If you need special capabilities, you may want to consider using specific
requirements, e.g.
requirements = (CUDACapability >= 1.2) && $(requirements:True)
Admin specific
Setting these two knobs should be enough to automatically discover and add
the found resources to machine class add:
use feature : GPUs
GPU_DISCOVERY_EXTRA = -extra
Ensure, the kernel drivers are in place in order to auto-detect the
GPU(s), i.e. kernel drivers need to be loaded before HTCondor is started.
(3) Accounting group quotas
Reasoning
Impact for users
Admin specific
(4) Disable swap on execute nodes
Reasoning
In the past compute nodes tended to become barely usable if jobs ran out
of memory and into swap which affected other concurrently running jobs.
Impact for users
Hopefully either none or better job throughput.
Admin specific
change FAI config in order
not to create a swap partition in class
NODE_COMPUTE
anymore.
(5) systemd service file
Admin specific
We will be utilizing a systemd service file from now on like
[Unit]
Description=HTCondor Distributed High-Throughput-Computing
After=network.target atlas-cuda.service
[Service]
ControlGroup=/htcondor
LimitNOFILE=16384
ExecStart=/usr/sbin/condor_master -f
ExecStop=/usr/sbin/condor_off -master
ExecReload=/usr/bin/kill -HUP $MAINPID
[Install]
WantedBy=multi-user.target
(6) Enable preemption in major part of Atlas
This is only a possibility and currently NOT planned for production
Reasoning
Unlike other clusters, so far we have not enabled preemption on Atlas. Usually,
there were more than enough resources available to satisfy most demands.
However, recently some jobs were running for exceedingly long times (24-60hrs,
rarely more than 100hrs) and this leads to a couple of problems:
- Resource waste - at least some of these jobs do not use any kind of check-pointing which means they will restart from scratch if the job and/or he node was killed/rebooted during job execution
- Fair share not being possible - a large number of long running jobs will hinder the scheduler to allow all users to have access to resources
- Without preemption we may not be able to fulfill our computing resource obligation towards LIGO/VIRGO (see also accounting groups above)
Impact for users
We will configure the major share of our resources to only accept jobs when
a specific keyword has been added to your submit files. Please
only use
this keyword if your jobs
- run shorter than 12 hours or
- your jobs can checkpoint and are able to restart from this checkpoint
Jobs capable of checkpointing (standard universe) will automatically be selected.
This check pointing could be done via HTCondor's standard universe,
the upcoming vanilla universe check pointing feature or a
self-constructed check pointing mechanism investigating old log files
to continue from the last known state.
If your jobs qualify according to the above restrictions, please add this
to your submit file (unless you are using the standard universe, which we
will allow by default):
+AllowPreemption = True
Please note, that we will still have machines which will not preempt jobs
at all, but the number will be considerably smaller.
Admin specific
On the majority of systems we should add this to the START expression:
START = (Target.AllowPreemption =?= True) || ( Target.JobUniverse == 1)
on other hosts just use the regular
START=TRUE
.
(7) Change matching order
Reasoning
Currently HTCondor at Atlas uses the default negotiator ranking, which may lead to many free single CPU cores throughout the system. Given that the latest pipelines start to use more threads and thus require more CPU cores, it would be better to fill-up empty available slots on busy machines first.
Impact for users
This should not impact users' jobs much if at all.
Admin specific
The negotiator needs to sort the jobs in a different way (mostly analog to the default settings of HTCondor):
- Each considered slot with "RemoteOwner" undefined will get base score of 10,000,000,000 (this should match partionable slots)
- For each available CPU core of this slot, reduce score by 1,000,000,000 (i.e. the fewer cores available, the more higher the rank)
- The more RAM the slot has available the lower the score will be (Memory is measured in KByte); with given scores one compute core is currently worth 100GByte RAM
- We also want jobs to flock to faster nodes if possible, therefore we also introduce cpuinfo_bogomips which should at least indicate how fast the cpus are. HTCondor's benchmarked Kflops and mibs values seem to vary a lot between identical machines and are therefore not used. Factor used here is 100,000 as we expect the spread between slow and fast nodes to be O(1000).
NEGOTIATOR_PRE_JOB_RANK = (1.0e10 * ( ifThenElse( isUndefined(RemoteOwner), 1.0, 0.0))) \
- (1.0e9 * Cpus) \
- Memory \
+ (1.0e5 * ( ifThenElse( isUndefined(cpuinfo_bogomips), 1.0, real(cpuinfo_bogomips) )))
--
CarstenAulbert - 19 Mar 2015