How to use Condor's dynamic slot model more efficiently

Condor currently offers essentially two distinct ways to set-up execute nodes to offer compute capabilities. The standard way is to create one slot per physical CPU core and distribute the system memory evenly to these slots, i.e. an execute node with 8GB RAM and 4 CPU cores will be split into 4 slots with one core and 2 GB RAM each. A much larger machine with 64 GB RAM and 16 cores could be split into 16 4GB RAM slots. This set-up can be manually changed by the admin, e.g. one could configure 2 slots to allow jobs with only 500MB RAM and re-distribute the remaining RAM to the other slots. This model can be called "static slot model".

On the other hand, Condor offers a newer model called dynamic provisioning where by default the whole machine is offered as s single slot and jobs partition off a subslot. In theory this should provide better utilization of resources as larger can be run without the need to reconfigure the whole cluster and small memory jobs may run while there are still CPU cores available. However, the downside is that it is also possible to starve the cluster by having only very few, very large jobs running. Optimizing this is a on-going process, but it needs help by the user as well.

How to use it properly

With the dynamic model you can ask the system to give you more than a single CPU core and a specified amount of RAM by specifying the following two variables in your submit file:
RequestMemory = 100
RequestCpus = 1

This would tell Condor this job wants to run on a single CPU core and needs 100 MB RAM.

Enforcing the limits

On Atlas we currently enforce the RAM usage by placing the job under stricter ulimit rules which allow up to your requested amount of memory (plus a little bit of extra head room. If you hit this barrier subsequent memory allocations by your job will fail. The trick is to specify this properly, too small and your jobs will fail, too large and the number of machines which are able to run these jobs will decrease.

In the (near) future we will also start to enforce the number of cores more strictly by using Linux's cgroups, i.e. if you request a single core, but start to fork many processes they will only run on a single core. Currently, these jobs have an adverse effect on other jobs running on the same execute node.

-- CarstenAulbert - 11 Jul 2012

DocumentationForm edit

Title Dynamic slots with Condor
Description How to enable your jobs to run more efficient on Condor/Atlas
Tags condor atlas dynamic slots
Category User
Topic revision: r1 - 11 Jul 2012, CarstenAulbert
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback