Atlas basic usage guide

First things first

Be nice to others, others should be nice to you as well smile

Please read this aloud: I will be nice to other users, and other users will be nice to me wink

We have quite a few means to help you getting more performance out of the cluster, if you experience problems, please let us know. Right now we try to stay out of your way, i.e. you are free to experiment and use the system in a way that maximizes your efficiency. As long as you don't get into the way of other users, go for it smile In case of problems, don't hesitate asking on atlas-users -(at)- aei.mpg.de or atlas_admin.

Getting access

If you want to get access to Atlas, please let us know the following points:
  • your grid identity (output from grid-proxy-info)
  • your preferred user name
  • your shell of choice

If you already have an account on the LDG you should already have one - if you cannot log in, please let us know!

Where to login, what servers are available to me?

Atlas consists of a few log-in (head) nodes, many executes nodes and storage servers which serve data from the background. Here's a list of important machines as well as our naming scheme. Only our head/log-in nodes are directly accessible from the outside. To log into other nodes within Atlas, you need to either use codor_ssh_to_job or use an Atlas internal ssh-keypair. All external machines from Atlas have a FQDN of the form MACHINENAME.atlas.aei.uni-hannover.de.
type name cores RAM remarks
head node atlas1 8 24GB Condor job submission, compiling, local interactive jobs, web server for LSC protected content
atlas2 8 24GB Condor job submission, compiling, local interactive jobs
atlas3 8 24GB Condor job submission, compiling, local interactive jobs, web server for LSC protected content
atlas4 8 24GB Condor job submission, compiling, local interactive jobs, experimental packages from LSCsoft-proposed repository
titan1 8 64GB Condor job submission, compiling, local interactive jobs
titan2 8 64GB Condor job submission, compiling, local interactive jobs
titan3 16 48GB Condor job submission, compiling, local interactive jobs
         
execute nodes (2008) n0001..n1675 4 8 GB Jobs are started by Condor here. You may log into these nodes, but do not start jobs here bypassing Condor!
gpu nodes (2010) gpu001..gpu0066 4 12 GB Jobs are started by Condor here. You may log into these nodes, but do not start jobs here bypassing Condor! Each machine features 4 GPGPUs from Nvidia, these can be either Tesla C1060 or Tesla C2050. More information on how to access these, will be added soon.
         
developer machine n0000 4 8 GB This machine can be used to develop and test your codes
         
data server d01..d30 8 16 GB These machines are not directly accessibly by users, but access is possible via /atlas/data/
data server s02..s12 8 16 GB Also not directly accessible, these are our old home file server, but still in good shape and use
data server mdsXXX, hsmXXX up to 24 up to 128GB mds machines are "MetaDataServer" responsible for serving the "meta" information of a home file system, hsm nodes are used for multiplexing access to the content of files from the home file system, i.e. these increase the available bandwidth

Special directories/variables/paths

/local

  • /local is always a local disk of the system you are currently using, i.e. if you type hostname this is the disk physically present in this server
  • under /local/user/ are pre-created directories for each user. Feel free to use this, but try to clean-up stuff you don't need anymore
  • IMPORTANT: /local is never backed up, it is considered scratch space

Standard variables pointing here are $SCRATCH, $TMPDIR and $TMP

/home

This directory is managed by the automounter on all machines and every user entry here points to the same NFS-shared directory. I.e. the contents of /home/ should appear identically everywhere within Atlas.

Caveat: Due to NFS caching, try not to rely on files which are being created on one node to immediately appear on another machine. It can take up to 60s for files to appear. Also, it is easy to create race conditions

/atlas

Anther auto-mountable directory is /atlas with all its subdirectories. Most of these have stabilized over the past years, however, we reserve the right to introduce new features changes here, but we will try hard to keep everything as backward compatible as possible. Please note, that you need to change into a directory before it becomes visible! The subdirectories here are as follows:

/atlas/data
Data servers (GW files, SFTs) are being served here. You will most likely never need to remember files under this path as ligo_data_find results will point here.
/atlas/einstein
This path is mostly used for the Einstein@Home project and it mostly is not usable by other users.
/atlas/hsm
We move more and more files into our hierarchical storage mangement (HSM) system. Files under here are mostly relevant for certain projects or serve as a backup space if a data server fails. Currently, this path is the one with most changes to it.
/atlas/node
This is the old and legacy way to access the local disks of various systems, e.g. the local disk of n1234 can be found under /atlas/node/n1234/. It's better to use /atlas/user. (see below)
/atlas/user
Use this path for a cluster-wide stable way to access files from any machine where your files might end up, e.g. if you want to access your files from n1234, just look under /atlas/user/n1234/USERNAME (obviously exchange USERNAME with your own one). It looks almost the same as /atlas/node but it is not, e.g. with /atlas/node you won't be able to access /local/user from a head node.
/atlas/v42
This is a currently experimental path where we plan to make available files which require a high bandwidth. The plan is to use symbolic link farms to make access fully transparent to users' jobs and keeping all NFS traffic within the same rack. 42 was chosen as there are 42 execute nodes per rack. This is again subject to be changed.

Condor specifics

Atlas is now running a slot model which is called "dynamic slots" or "dynamic provisioning". This means, each multi-CPU core machine is not statically partitioned as it was done before April 2012, but now each machine advertises itself with all CPU cores, disk space and memory it has. Each user job will take the requested slice away from this "master slot", thus the user is required to define the slice size. The user should now specify in Condor's submit file how many CPU cores the jobs requires (RequestCpus) and also how much memory is needed (RequestMemory). Don't despair if you forget it, our defaults which should work for a large fraction of jobs are 1 CPU core and 1400MB RAM.

In our old set-up you were allowed to exceed your "slot's memory constraints" which sometimes lead to a very bad situation. Thus, Condor is now enforcing these limits more rigorously via the operating system's ulimit mechanism. It will allow 10% additional headroom before your jobs will not be able to allocate more memory than you requested! Thus, please check your malloc calls!

Finally, it is also possible to use GPUs, but unfortunately not via the dynamic slot model. Therefore, telling Condor that you have a job needing a GPU (more on a separate page later).
Topic revision: r16 - 05 Apr 2012, CarstenAulbert
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback