How to use dsh and password less ssh in between the nodes It's very easy to get password less ssh working in between the nodes, you just need a proper ssh keypair...
GridKaHOWTO Follow the whole process with the same user and the same browser! If you use Firefox, please ensure to use version ESR version 60 or 68 a...
Planned downtimes This page summarizes planned/on going work within the Atlas cluster along with a few details. Usually, we will issue condor_off peaceful at lea...
This is the public information on the ATLAS cluster operated by the Max Planck Institute for Gravitational Physics (also named Albert Einstein Insititute, AEI) si...
Trying to get iPXE as the default method to netinstalls working (based on http://ipxe.org/howto/chainloading and https://doc.rogerwhittaker.org.uk/ipxe installati...
Rebuilding Debian's kernel (loosely following https://kernel team.pages.debian.net/kernel handbook/ch common tasks.html#s common official) pbuilder environment I...
First steps with spack Please note this all this was tested on a extremely minimally installed server. I.e. just installing something like doxygen can take a very...
Create a hybrid USB Image The goal is to create an image file which can be copied onto a USB stick and booted both via legacy BIOS as well as UEFI. This document ...
Simple ZFS ZVol testing creating baseline Create simple test data set in RAM: mkdir p /dev/shm/data for i in $(seq w 30); do dd if=/dev/null base64)" nosalt...
Detailed list of metrics we want to monitor Compute nodes (61) * CPU: user / nice / system / wait (4) * disk: * space available/free per locally defi...
Monitoring for Jessie and Beyond What do we want/need to monitor (metrics/checks) A list of non exhaustive metrics and checks we we need/would like to have, e.g....
Webserver serving user content If you need a webserver to serve content from your $HOME to the world, please create the directory ~/WWW on Atlas if it does not ex...
How to add a new host (salt era) This example will use einstein12 as a sample machine which before was known as ra15. Before you begin, you need to have ssh agent...
Cluster upgrade to Debian 8/Jessie We plan to use this page for keeping a record of where we are with respect to our full cluster upgrade to Debian Jessie. Curr...
FAI Jessie set up 1 base install via old fai jessie 1 base minimal config via salt 1 echo 'deb http://repo.atlas.local/reprepro fai contrib' /etc/apt/...
How to disable KM1/2 and use KM4 manually In order to disable KM1/2 and temporarily run with KM4 only, the following steps are needed (please monitor that each ch...
HSM file system check stats (last update: 2016 07 26T18:42Z) Planned steps (starting at 2016 07 26T11:00Z): 1 Issuing condor_hold to all jobs on all submit hos...
How to migrate from gitmaster to gitlab This document explains how to migrate away from gitmaster.atlas.aei.uni hannover.de to the new gitlab.aei.uni hannover.de:...
Aptly For the new LDG repo set , we are trying to use aptly as a potential successor to reprepro. The goals are: * support various Debian and Ubuntu releases ...
general questions are dual port 10GbaseT connectors possible Likely, will be clarified. are 40 GBits (QFSP ) possible Not clear yet. what is the meaning of a q...
Checking drive ordering between SL3000 and SAM Following Doc Id 1006246.1 to verify drive ordering matches between SL3000 and SAM/Solaris preparation 1 Shut d...
llldd * TCP connection from CIT (or sites) to special receiver machine (possibly need root access for John Zweizig) possibly SL6???? * from there UDP multic...
Condor Accounting Groups on Atlas In May 2015, LIGO introduced mandatory accounting groups for jobs running on the LIGO data grid (LDG). As Atlas is part of the L...
What is ATLAS? ATLAS is a general purpose compute cluster, located in the Albert Einstein Institute for Gravitational Physics, in Hannover Germany, on the campus ...
HTCondor configuration updates in 2015 (1) Using cgroups to softly enforce memory and core limits Reasoning In the past, we either relied on users' jobs to obey...
Benchmarking distributed file systems stupid fast tests first, all using small compute nodes and use iozone r 32 s $((2**24 2**25)) i 0 i 1 i 2 i 8 O I S...
Cheat sheet for rebooting E@H machines in Hannover If the E@H machines need to be rebooted (e.g. kernel upgrade) here's the proper ordering: Isolated machines Th...
SQLDump tests for Einstein@home On einstein db1 the following was found: # no compresion /usr/bin/mysqldump opt master data=2 EinsteinAtHome mbuffer /dev/nu...
This is just an unsorted, unfiltered list of current tasks and services all over the AEI (and beyond) which could be counted as SYSOP related. It is neither compl...
Common guide lines for cluster usage This document describes common pitfalls and guide lines when using a large computing cluster. Some of the details are specifi...
Configuration Management (primer/summary/brainstormer) What's out there? These are not really meant for configuration mgmt (alone) and have their strengths somew...
Distributed/clustered file systems This page should summarize what scenarios such file systems could fulfill within Atlas and what we expect from it. Properties s...
HSM Upgrade July 2014 This is the proposed plan, small changes may be needed Move from x4270 to x4 2l Moving to new meta data server should result in more file s...
Planned steps for Atlas Update to Wheezy Steps to perform: * get new head nodes up and running * reinstall all old nodes a0101...a3842, gpu001...gpu0XX ...
Directory hierarchy for LSC files Storage structure for S4/S5/S6 data (past) In the past we used paths like these H/H1/RDS/C03/L1/H H1_RDS_C03_L1 822092472 60.gw...
Short overview on different configuration possibilities for FC arrays Our DDN FC array currently houses 600 disks (400 3 TB drives and 200 2 TB drives). DDN only ...
Benchmarking bcache Evaluating if bcache can/should be used on our compute nodes. Benchmarking was performed with iozone (revision 3.397) with the command line io...
basic configuration the default baud rate is 9600 first steps Use the serial console with a baud rate of 9600 and do system view interface Route Aggregation 1...
Create Service Data File for 6780 In case a hard drive is about to fail (or has already failed), Oracle support needs one special file collection, to create this,...
Comparing efficiency and size of compressors With samfs and archive dumps being rather large, we needed a good compression scheme for these. We use a 101GB (1033...
HSM upgrade Current status 2014 02 01T12:33Z: samfsdump/final backup stopped due to too many non archived files. Rushing to archive those as fast as possible. 20...
First steps with Solaris 11 I'm using the old x4440 machine, booting off a solaris 11 CD based on sol 11_1 text x86.iso. You must exit the grub countdown with ESC...
Einstein@Home machine head count Please check this list to make migration and finding/cabling machines easier (racks used for these machines should be 79 (water c...
Building an atlas Kernel This assumes that a stable linux kernel source tree already exists on a machine (bob, /srv/kernel/linux stable), cloned via git clone git...
Rack layout which racks contained what and when Basic information * rack rows are numbered 1 to 10 * Water cooled racks are numbered 1 to 102 * open ...
New networking set up for Atlas From 2008 till 2013 we used a flat networking structure, i.e. all computers on the data network were connected "directly" to the c...
For Users * General Introduction for Users * Useful Items * How ATLAS stores files * ErrorMessages and how to fix them (not updated) General Document...
Category:Network Category:WovenSystems This Page is about our ATLAS.CoreSwitch. You might also want to read the TRX page. Configuration access serial not wor...
ATLAS Web Preferences The following settings are web preferences of the ATLAS web. These preferences overwrite the site level preferences in . and , and c...
Using cgroups to push backfill into the background apt get update apt get y install boinc client libboinc app7 cgroup bin rsync a n0669:/etc/default/boinc clien...
LINDY COMPower Switch LITE 8 Main.AlexPost 11 May 2009: This document describes the use of the LINDY 8port Power Strip 32453. The easiest way to access the po...
Create benchmark image with Debian live build apt get install live build lb config archives live.debian.net mkdir /srv/live default cd /srv/live default lb ...
Work planned for cluster shutdown on 2013 01 15 shutdown plan The following services will be shut down 1 all compute nodes possibly with the exception of "r...
Einstein@home RAID setup testing Over time the einstein abp1 download server (mostly BRP project) went through various updates to increase download throughput (ap...
Central benchmark page All our benchmark results should be linked from this page to help rediscovering already performed benchmarks. Ideally, a summary page will ...
1.: Uns wurde gesagt, dass zu wartende Teile eine Gewährleistungsdauer von 2 Jahre habe. Nicht zu wartende Teile haben eine Gewährleistungsdauer von 5 Jahren. Dur...
AddNewUser How to add a new user All this is now done by atlas_adduser.pl from our git repo! The remaining stuff here is done for the fictitious user foo BAR, an...
ZFS Send/Receive Performance Testing Since we want to backup and move users' home file systems regularly between Thumpers we want to have as much speed as poss...
Testing different zpool layouts on Thumper Local testing with iozone Essentially the tests were made with a single iozone run per file system, zfs compression wa...
Iozone on Areca server Command line iozone a g 16G O n 32K y 32 q 32 i 0 i 2 Disk layout The disk layout for the four test was in RAID1 or RAID10 mode o...
Myrycom Benchmarking We received two myricom NICs for evaluation 03:00.0 Ethernet controller: MYRICOM Inc. Myri 10G Dual Protocol NIC (10G PCIE 8A) These netperf ...
Measured performance with this command line bonnie s 32768 d /path/to/dir u 1000 The results are # data server locally store03,32G,76714,99,358710,47,160723,3...
Testing different file systems on Atlas compute nodes We are using the attached ffsb profiles for testing, these are just a first shot at the problem, but might p...
Atlas Boinc Condor Scheduling As Condor's fetchwork does not seem to work with dynamic slots, we are working on our own "scheduling" system for BOINC Initial tho...
Atlas basic usage guide First things first Be nice to others, others should be nice to you as well :) Please read this aloud: I will be nice to other users, and ...
Move from Debian Lenny to Debian Squeeze Changes * updated packages from upstream Debian * Condor 7.6 with dynamic slots on most execute machines (exceptio...
Small tour of Atlas Atlas is a computer cluster situated in the basement of a university building near the Max Planck institute in Hannover. Since the ceiling is ...
Atlas Compute Node 2008 These Supermicro based machines were bought from Pyramid in 2008. Typical host names n0001 n1680 Spec table Chassis SC811T 300B (1...
Atlas Compute nodes Compute Node 2008 In 2008 we bought 1680 Supermicro based machines from Pyramid in 2008, getting a total of 6720 2.4GHz compute cores, 13TB R...
For Squeeze all need to be performed in a chroot, i.e. run cowbuilder update basepath /var/cache/pbuilder/base.squeeze.amd64.cow/ # prepare environment apt ...
Shutdown priorities The following list puts priorities on computers, equipment and other items of interest. Computers in racks, which will stay powered up, should...
Where shall I put my (Condor) log files? As always, the correct answer is: It depends. You can put your log files in your home, e.g. assuming your user name is MY...
Windows Cisco VPN Client How to connect to the 10.117.0.0 network of Max Plank Institute with VPN The Cicso VPN Client with the pcf file calls AEI 10NET.pfc can ...
Repair a xfs filesystem A XFS file system can become corrupted due to a power cut, a kernel bug or something else. To repair it, please use a recent version of xf...
Virtualbox and USB on gutsy Getting USB support working in virtualbox under Ubuntu gutsy 0) for general important info for virtualbox on gutsy, see FAQ: #91;31 ...
How to verify a S/MIME signed email (X.509) * save the email into a file * check that the certificate authority for this sender is known in your system, e.g...
Video capture box How to use the video capture box we got from Golm: Some good documenation how to use VLC for capturing multiple devices and creating a mosaic wi...
Ssh password less Three steps to passless ssh (without ssh agent) ... 1 Use ssh keygen to generate the key pair ssh keygen t rsa ! do NOT use passphrase, ju...
Suse Cisco VPN Installing Cisco VPN Client in Suse If you want to connect with your suse to an VPN network, it's very simple. Now we install an VPN Client and ma...
In order to convert existent SVN repositories to git there are a few simple steps to take. First of all you'll need the following tools (if not already available)...
* install netsnmp pkg get i netsmp * create a configuration file /opt/csw/share/snmp/snmpd.conf rocommunity public disk /atlashome 5% load 12 6 3 syslocati...
How to mount your Atlas home on your computer Using sshfs it is quite simple but could be a little bit on the slower side of life. Prerequisites: 1. Make sure...
How to send automatic SMS messages (via host postfix) Create a temporary file which looks like this: To: 49123456789 This is the SMS text and then move the fil...
How to read failure information on Solaris and how to "repair" old faults. Using fmdump one can look at the past error logs, while fmadm faulty will display fault...
Superseed sendmail with postfix on Solaris 10 1. update blastwave packages: pkgutil u 1. install postfix: pkgutil i postfix 1. disable sendmail: scvadm...
Restart GEOSegDB if service does not work anymore * log into geosegdb either as root/suser/ldbd (x509 certs) * assume ldbd identity * run db2start * a...
Some things to be aware of when running Octave scripts on ATLAS: Octave path searching When an Octave script calls a non built in function, Octave will look thro...
How to export/import tapes from SL3000 Export tapes Explanatory task: tapes from the secondary copy of a file system should be exported * Log into metadata se...
How to recover a file you accidentally deleted? If you $HOME is on a ZFS file system 1 you can easily retrieve files you deleted from an automatic snapshot creat...
This basic tutorial describes the correct usage of the Redmine bugtracker. After the login you will be redirected to your own page which can be customized by usig...
How to rescue data from a broken disk If a disk is "only" throwing errors, but is not entirely dead yet, dd might help, but can cause a lot of grief. This recipe ...
How to protect a web page with .htaccess If you happen to need to protect a web page but not only for LSC usage, you need to perform the following steps: * Cre...
Collect evidence data for an Oracle case (HSM) Based on based experience we should perform the following before restarting samd or even rebooting the machine: ...
using iptables A gateway with two network cards conntects a LAN with WAN. This document describes, how to forward ports of nodes in the LAN to ports of the gatewa...
Running compiled Matlab under Condor Problems Multiple users on the same node When running compiled Matlab codes under Condor, you need to be aware that the def...
Netboot This is a simple description how to boot over a network using kernel on the remote server. Server side configuration To proivde net boot capabilities, yo...
What is LVM ? Note: This may answer the question, why editing the /etc config files does nothing. If you don't have a backup, you can re create the equivalent of ...
Dangers when sourcing the Matlab runtime environment Problem When sourcing the Matlab runtime environment (currently /opt/matlab/2008a/MCR/MCRSetup_R2008a_glnxa6...
Problems with X11 when connecting with MacOS X to headnodes If you experience problems running X programs, e.g. Matlab (even in command line mode), you might be h...
Local ssh configuration You can create shortcuts as well as special settings for hosts you want to connect to in the file $HOME/.ssh/config. The following lines c...
Using ligo_data_find (successor to LSCdataFind) The command line arguments are basically the same as before with LSCdataFind. One new feature is the P or no pro...
We have a pxe bootable live system to examine a node without touching the system on the harddrive. It is basically a self made chroot environment. usage * st...
Jumpstart Solaris How To clone our Solaris Sun boxes: Create flash archive flarcreate n "s01 flash" c R / x /atlashome /atlashome/carsten/s01.flar Our conf...
Keep your X.509 certificate alive after logging out of the cluster On some occasions one needs a valid grid proxy to access data, query the file database using LS...
HOWTO: Guideline to repair an offline computer Evaluate current status (remotely) 1. Try to log in via data network 1. Try to log in via mgmt network 1....
HOWTO Update ILOM Firmware cf. ILOM Howto Update ILOM Firmware * Log in in ILOM CLI and type "version" to check * Download new Firmware versions http://w...
Main.HenningFehrmann 24 Jul 2008 abstract This page contains detected symptoms and the corresponding hardware problems. It is based on experiences. See the list ...
ATLAS Hardware Resources And Photo Gallery This is about the hardware resources that ATLAS is based on and related photos Computer node There are 1680 Computer n...
Why do I get an openssl related problem when logging into a machine? If you see something like this: GSSAPI Error: GSS Major Status: Authentication Failed GSS Min...
How to use gridftp In general the remote site should be running a gsiftp server which will accept your user credentials (X.509 grid certificate). The full syntax ...
How to monitor the current usage of LSC clusters or Atlas? There are two nice web pages summarizing these information. First of all there is watchtower which is a...
What is fdisk? fdisk is a command line partition manager available for a lot of platforms. Usage general fdisk l Lists the partition table fdisk /dev/...
How to fix a Solaris boot archive After an update it might happen, that the bootarchive is invalid/damaged. To fix this, boot into the failsafe mode (usually seco...
How to create a pulling foswiki set up with reprepro Assuming you are already in the location where the repository shall live, perform these magic steps: mkdir co...
This explains how to get FreeDOS working with TCP/IP and ssh. The Problem Sometimes, it is necessary to flash the BIOS, the IPMI card or to set the BIOS. For som...
RFC: Classes list for FAI/Lenny The Etch installation scheme with our FAI server used one class for each type of node (NODE_COMPUTE, STORAGE, ...) but this is som...
Getting VDT up and running on Debian Squeeze Problem description As of 2010 03 12 there is a new openssl version in squeeze which is somewhat incompatible with t...
Debootstrap It is sufficient to read the man page. To debootstrap from local deb mirror: debootstrap distribution dest. directory http://192.168.0.1:9999/deb...
This documentation will briefly summarize how to build packages. At the time of writing this, we focused on packages for Debian etch and lenny in i386 and amd64 v...
This is a short explanation of how to create debian source package. Assume you want to build package x.y.tar.gz. 1. Rename that one to package x.y.orig.tar.gz. 2....
Brief Condor dagman HowTo This small article will not get into any depth what DAGs are and will also not explain the terminology of Condor's dagman, for this plea...
Conserver Conserver is a server/client program enabling to pool console connections. The upstream page is at http://www.conserver.com/, in Debian the packages are...
How to create a Debian package? Tutorials: * Debian packaging tutorial by Lars Wirzenius * Debian packaging tutorial * Debian New Maintainer Guide Bad st...
Condor, BOINC and dynamic slots As Condor's dynamic slots don't yet mix with "fetch work", we build the system around a dedicated, cron based system running on a ...
ClusterMonitoring Description We need to monitor frequently various values in the cluster. See Ticket here: https://n0.aei.uni hannover.de/tracking/issues/show/...
Compiling tempo2 on Ubuntu AMD64 The pgplot packages shipped with ubuntu are not consistently compiled using fPIC, which results in an unusable cpgplot wrapper l...
Clean the local scratch space Sometimes you want to get rid off all temporary stuff you left on any compute node (e.g. you just finished a big search, all results...
Cleaning your $HOME Quick steps for those who have no time to spare 1. Locate directory which you don't need right now but which produced a lot of small Condo...
How to create a build farm. Installation/Preparation Install the necessary base packages sudo aptitude install pbuilder cowdancer reprepro rebuildd * pbuild...
How to build Matlab Before Installation You need: matlab installation cd ( or iso image ), pbuilder, installation key, license file to activate the product, ping...
updating the DB If you want to add names to the db (e.g. /etc/bind9/atlas.local.db), please make sure to follow these steps: (1) Increase serial number by using t...
Packaging on Bob This page documents how to use git buildpackage to build packages local and on the server. Quick recipe (only builds on the server) * insta...
Badblocks on XFS Essentially the same as for ext 23 as shown on the smartmontools homepage. Please follow this example: smartctl reports bad sector # smartctl ...
Using Badblocks for testing hdds on the nodes one normally uses $ badblocks v p10 /dev/sdb which checks the disk. if it makes 10 clean runs it will exit. but if...
How to backup (and restore) a SVN repository Backup First create a hotcopy somewhere where enough free space is available: mkdir /tmp/hotcopy svnadmin hotcopy ...
Automatic environment setup Introduction After the upgrade to Debian Lenny, we finally offer the psosiblity to automatically set up your environment. This is bei...
Automating telnet Here a quick and dirty bash script, showing how you could automate some configuration tasks to be done in a menu based telnet session. E.g. to d...
Apache Limit access to certain directories: Add AllowOverride AuthConfig in your config section in httpd.conf or where you define your servers. In the limit...
Building the GSTLAL Dependency Package for Lenny Preperation phase Start in a fresh location by issuing the following commands: mkdir p gstlal/gst deps 1.0/pack...
Rack numbering We only count the cooled rack space with numbers, the open racks are "counted" with letters. Please look at the following diagram: Compute Nodes T...
IP Scheme for Servers Special IPs This is an IP Scheme for special IPs. For an IP Scheme of the Management and Data Network, refer here. There are separate tabl...
jumpstart A jumpstart server provides the information and configuration used to install other nodes. It serves tftp requests to provide a kernel and installatio...
Cornell Speed Test Very brief test pulling /dev/zero from gridftp1.cac.cornell.edu to atlas2:/dev/null (units are bytes for tcp blocksize and MByte/s for speeds):...
Preperation Start pbuilder/cowbuilder environment and copy the installer to / inside this environment. aptitude install menu xdg xdg utils mkdir p /usr/share/ico...
To simply create a nice cowbuilder environment for lenny within atlas just use: cowbuilder create distribution=lenny debootstrapopts arch=amd64 basepath /var...
Cluster room cleaning schedule This page is about keeping all cluster related rooms in an orderly state, if you have suggestions for modifications, please contact...
what happens before the FAI installation * client * server * provides * * admin action * NICs do a DHCP request (BIOS default ) DHCP server IP addre...
This side contains a list of discovered network problems, corresponding tools and possible solution. Wrong MAC address port assignment symptom The node is not...
IP Scheme for Servers (Lenny) This is an IP Scheme for Lenny Servers. For an IP Scheme of the Management and Data Network, refer here. For a scheme of Etch server...
IP Scheme for Servers (Squeeze) This is an IP Scheme for Squeeze Servers. For an IP Scheme of the Management and Data Network, refer here. For a scheme of Etch se...
IP Scheme for Servers (Etch) This is an IP Scheme for Servers. For an IP Scheme of the Management and Data Network, refer here. For a scheme of Lenny Servers, re...
IP Scheme The goal is to create a scheme which eases daily life. IP mapping should be more or less straightforward and extensible. For servers look at: * indi...
Overview of Topics on LSC Software packaging for Debian * Packaging Howto 1 * Packaging Howto 2 * How to build personal lal/lalapps on ATLAS * How to ...
Scrubbing local scratch space Scrubber Design Imagine a local partition which is quite crowded and needs to be cleaned. Cleaning should be performed based on fil...
Summary of Atlas Servers/Services If you want a quick overview, please refer to our server matrix page, otherwise just read on. External machines/services Head ...
Collection of HowTos OS hangs Try to reset node. OS hangs, even after reset Possible causes 1. hdd broken look if everything is well wired, change hdd, ma...
This page presents the list of faulty hardware. See and add symptoms. Please add dates here, when the problem occurred and when it was fixed. Henning or Carsten w...
Logcheck mail locations, related scripts and other mail locations on postfix server Log mail location on postfixserver logadmin account 1. Normal logcheck mail...
Case 72065766 Opened 2009 12 01 Synopsis After running zpool scrub on s13 a vdev was marked degraded due to excessive read errors. After exchanging the supposedl...
Sun Fire X4500 The Sun Fire X4500 Servers (nick named "Thumper") are the heart of our $HOME. All user data reside on these boxes. To ensure that there are as few ...
Quellpaket update sensors Binärpaket update sensors Description Updates the database if sensor values change This package provides the infrastructure. Dependen...
Quellpaket computerdb Binärpaket computerdb Description The atlasdb/coputerdb to monitor the nodes. Dependencies ruby Maintainer Henning Perl Alexander Post ...
Quellpaket rails Binärpaket rails Description MVC ruby based framework geared for web application development Rails is a full stack, open source web framework i...
Source Package scam Binary Package scam Description: Simple Cluster Administration Management A Web application to manage and administer a cluster Dependenci...
Backup Strategy for users' $HOME Rolling snapshots Each thumper will loop over all users' file systems and create a new snapshot every 6 hours. Old snapshots wil...
The CMC control units are used to monitor values of different sensors in the Rack or in the LCP. They can also be used to set critical values. As soon as a sensor...
Local scratch on nodes local partitions Storing data locally on the nodes is possible everywhere. Please remember you are free to log into any node manually (rsh...
Location The script is currently only placed on atlas1.atlas.aei.uni hannover.de. Syntax /usr/local/bin/masscp The script has to be started with the path informa...
Benchmarks Network Disk Performance ffsb We are also using ffsb to simulated different workloads. The following tests were performed so far: * Testing diffe...
NOTE: these are the Debian PACKAGE MAINTAINER pages! If you are looking for USER instructions to install LSCSoft on Debian, please visit https://www.lsc group.p...
BACnet BACnet is a facility management protocol. The 'outside cooling' uses BACnet over IP. The Protocol is transmittet via UDP packets. BACnet4linux The BACnet4...
A introduction to Channelbonding can be found here. Description Why? Round robin is the only way to get more than the speed of a single interface for a single T...
ProCurve Switch 2900 Specification Visit http://www.hp.com/rnd/products/switches/ProCurve_Switch_2900_Series/overview.htm for more detailed information on the 29...
ProCurve Switch 1800 Specification Visit http://www.hp.com/rnd/products/switches/ProCurve_Switch_1800_Series/overview.htm for more detailed information on the 18...
"S.M.A.R.T" is for "Self Monitoring, Analysis and Reporting Technology". The SMART protocols stores disk sensor values and event counters on a special area of the...
Currently Linux knows four different schedulers: CFQ CFQ, also known as "Complete Fair Queuing", is an I/O scheduler for the Linux kernel which was written by Je...
SNMP http://en.wikipedia.org/wiki/Snmp Standard tools: snmpwalk, snmpget/set mrtg Maybe also: Nagios http://www.zenoss.com/ general Rack We are using SNMP to re...
Root shell on tty9 On startup there is always an open root shell for maintenance on tty9. Feel free to use it carefully!. /var/log/messages on tty11 On tty11 you...
Hostnames * The name of a node is n abcd . where abcd is a number between 0001 and 1340. * The data server are called d ab where ab runs from 01 to 31. ...
Services we want/need to offer Please write down a list of services which we want/need to host along with the contact person who requested this and a brief descri...
Treiberkonzept The university gets alerts and reacts acording to following scheme: smoke alert check the room and inform the MPI in the case of fire of dense s...
You can connect to the switch directly via the serial interface, using minicom . To disable tagging on all interfaces, enter (in Configure mode): switchport all...
outdated did not work. The core network consists of a WovenSystems EFX1000 with 144 10Gb/s network ports. Main.CarstenAulbert 02 Feb 2009 * EFX1000 front i...
lm sensors config atlas Description configuration for lm sensors for the atlas cluster Will install config file '/etc/sensors.conf' and add kernel modules to '/e...
EDAC memory setup On one of our file servers based on Supermicros X7DVL which has 6 memory seats labeled {1,2}{A,B,C}. Unfortunately, one needs two DIMMs (showing...
Labeling the network cables Network cables are labeled according to this scheme: LABEL TYPE Mxx Management uplinks from Allied Telesys switch in rac...
Problems with h2 h2 is the machine which was up first and thus users like it and feel homely there; h1 came later and takes a bit of the load, but the main load i...
A problem with the webinterface of the ganglia monitoring service is that really every single is shown at the first page. This can be really annoying in a big clu...
Sun Fire X4500 We have got a Sun Fire X4500 , a dual Opteron storage server with 48 SATA 500GB drives from Sun Microsystems. Installation It weights about 80kg...
Main.CarstenAulbert 08 Jan 2009 Special nodes The following nodes are considered special, please help to keep this list up to date! node name assigned job...
Principle The node number a particular file lives on is determined by its filename: take the first six digits of the md5sum of the name, take this as a hex number...
Main.CarstenAulbert 31 Dec 2008 After adding extra nodes in December 2008 we discovered that the SUN servers sent out a continuous stream of ARP requests into the...
Netconsole To enable netconsole on a node, just insert the 'netcosole' module as follows (just change node_ip): modprobe netconsole netconsole=4444@node_ip/eth1,5...
This is how to build and install personal lal/lalapps on ATLAS both use the global available libmetaio8 lalapps uses the fresh installed personal lal Insi...
Runnning bonnie on a head node in various configurations Headnode parts: * dual socket, quad core CPUs (E5342) * 16 GB RAM * Areca 1261ML * 4 or 8 H...
Abstract This documentation describes server, the corresponding functions and the location. Table of Server name location function * ip * FAI manage...
All times given are in UTC, usage and backup sizes in full GByte. If the last backup was done within the past 48hrs the times are green, otherwise red. Main.Ca...
Condor Condor High Throughput Computing System is a software framework for managing workload on a cluster of computers. It is some kind of batch system, that di...
abstract CondorDAG stays for Directed Acyclic Graph Condor DAG is a scheme to start jobs in a particular order. The start of subsequent jobs depends on the exit c...
Main.HenningFehrmann 17 Oct 2008 Todo installing the new nodes We get 336 new nodes and fill all together 8 Racks EFX recabeling Each TRX has only three uplink...
Fresh install of a X4500 Files to add/overwrite to the default installation root root/.ssh root/.ssh/id_rsa # define local key root/.ssh/authorized_key...
MTA on nodes Following are some MTA candidates for nodes nullmailer Nullmailer is a replacement MTA for hosts, which relay to a fixed set of smart relays. It is ...
What to do if a drive failure occurs? First signs A drive failure might be reported via fmdump or zpool status, e.g. # fmdump v Jul 16 22:18:52.7769 f52c874e 91...
Main.CarstenAulbert 05 Jun 2008 Data Servers Currently we have 30 data servers up and running. Within the cluster you can get these file areas automounted via /a...
Manual Matlab install 1 Grab copy of the ISO file from AEI's internal fileserver (ToDo: where) 1 Mount image via loopback on computer where it should be ins...
flar A flar image contains the solaris system archive similar to tar balls. It can be used to make a system snapshot and install it on other machines. Generatio...
$HOME file systems Where does the data live? We currently have 12 Sun Thumper X4500 which are used to store users' $HOME file systems. The users are distributed ...
Cluster File Systems A long list of what file systems are available can be found on Wikipedia, for us these file systems seem to be interesting enough for further...
How to determine which user to put onto which Sun server The idea is pretty simple and should work well. From our 12 Sun servers currently set up for users, we wi...
We have got a power strip from Rittal and Bürger Elektroniks with build in inrush current limiter. For a single node the inrush current limiter works as it is sup...
In order to manage Issue #56 there is some basic knowledge required to use the TDS 3014B. So let's get started: It is highly recommended that you use the external...
Recover from Neterion NIC running at low MTU size Just run this: ifconfig xge0 unplumb # ifconfig xge1 unplumb cd /root/xge 2.0.7.6641 solbin/ ./install.sh jumbo ...
Working Schedule for UPS test on July 14th, 2008 The "Plan" This is a rough plan of actions which need to be performed on that day (all times are CEST=UTC 0200) ...
Data available via LDR Our LDR server is named ldr.aei.uni hannover.de (externally) and ldr.atlas.local (internally). You may need to use this environment variabl...
AEI/Han LDR box is 130.75.117.101 with user hanrobot. LDR is installed under /srv/LDR and before using it, you must source the appropriate files and also set the ...
User mapping scheme This is the first idea, it should already work nicely, but we may need to adjust it in the future. Initial user mapping We will start with 10...
Case38104140 Sun in principle acknowledged this error, but said this is Neterion's business. Neterion is working to reproduce this error. I love hotline ping pong...
Condor Job on Hold How to get the output: condor_q awk '/ H / {print $1}' xargs n 1 condor_q long grep HoldReason Error: Failed to open file Error Message...
Error messages and their (likely) way of fixing them Hardware Software You can find a very short description of the symptoms here, for more information, follow ...
Though ATLAS has strong data servers, accessing files fon the local harddrive of the nodes is much faster, and given that you could read files in parallel accessi...
SNMP OIDs on SunFire X4500 Here some OIDs for Thumper Events: FAN front: OK=7 FAIL=5 enterprises.42.2.70.101.1.1.2.1.3.29 FAN 0 enterprises.42.2.70.101.1.1.2.1.3....
Usefull tools making your life easier dsh the distributed/dancer shell Introduction The dancer shell is very usefull to run identical commands on the whole cl...
Possible zpool configurations In the standard setup there are two system disks (either on controller c5 or c6 depending if one installs the box via a USB device ...
Mirroring system disks Right now jumpstart fails to create mirrored system on our thumpers (see according Case ID). Thus it is required to create the mirror after...
Case ID38101108 X4500 reboots infinitely when jumpstarting with mirror setup We encountered a problem, that when setting up a mirrored system disk during jumpst...
Performance problems with X4500 using ZFS with NFSv3 When moving a large number of small files onto our X4500 boxes we found very bad performance numbers: Job ...
This is the condor local configuration used on ATLAS global HOSTNAME=$(hostname) RELEASE_DIR=/opt/condor LOCAL_DIR=/local/condor.$(hostname) CONDOR_ADMIN=root@lo...
Here some links to useful Solaris documentation: Thumper * Sun Fire X4500 Server Administration Guide * Specifications ZFS * Manpage zfs * Tips for ...
Documentation on making fake data containing signals from pulsars in tight binaries 1. REDIRECT Making fake data containing signals from pulsars in tight binar...
SSL We have now the Wiki as well as the trac system working with http over SSL. From now on, all traffic should be encrypted and one should be forwarded to port 4...
(Open)Solaris Solaris is an operating system desinged by Sun Microsystems. In 2004 it has been open sourced in the OpenSolaris project. * Solaris overview ...
Books for AEI Hannover Wishlist of books to order for AEI Hannover Library * Robert M. Wald, "General Relativity", Library * Sivia, Skilling, "Data Analysi...
Wiki Evaluation for pulgroup or getting away from the evil elog ... Motivation and overview There are various well known problems with our current pulgroup page...
useful tricks Print maximum number of errors (output from EFX, data is on c0 under /scr/EFX/RRD/db): ls err*rrd xargs i rrdtool graph /tmp/test DEF:err_r={}:e...
SMSWarningConditions This is also a todo list which needs to be implemented. A SMS is send out to a list of numbers whenever a email is send to sms #64;postfixser...
Category:Kernel Category:Network Finally the Channelbonding Documentation has been rewritten. I hope its new structure helps answering the questions! Some illustr...
File Staging Area File containing a list of missing Results from S5R1: /MI_missingS5R1a.txt.bz2 Einstein@Home Talk for APS 2008: /EinsteinAtHomeS4R2 G080189...
SNMPTRAPD In order to receive traps on has to install snmpd on a node. Edit the /etc/default/snmpd file and set the TRAPRUN flag to yes . You will find trap ...
PLEASE DO NOT REMOVE ANY ITEM FROM THIS LIST. IF IT IS COMPLETE, THEN PLEASE MOVE IT TO THE END OF THE LIST UNDER "DONE". TO DO ClusterRoom Order fan trays ...
Configuration settings Change the configuration by using setup when logged in via ssh Increase ARP cache for this box The box comes with Linux kernel defaults, i...
Flat Fstat Metric LOGBOOK 15:48, 16 March 2008 (CET) Problematic result of old FstatMetric code is seen by running compareMetrics.m, and plotting the result usi...
NIS Developed at SUN as "Yellow Pages" (see name of commands,configuration files) is a client server directory service protocol. Visit a NIS HOWTO Configuration ...
NUTTunnel This script should enable you to tunnel UDP connections from .102 to the UPS in the prototype hall: # cat nut tunnel.sh #!/bin/bash # this is ugly ...
DNS scheme The internal domain is atlas.local (managed by us), external domain might be atlas.uni hannover.de (managed by RRZN) Nodes Nodes have three networking...
SecondTest Compute Nodes A script is run for basic testing, here are the current findings: NodeError messages n0001 Permission denied. n0002 Permission denied. n0...
ManagementSwitches Test on two kinds of management switches AlliedTelesis AT 750/48 and Linksys SRW248, both have 48 Ports 10/100T and 4 gigabit uplink ports. Al...
Description Observations low load on the system 1. When the cooling system (DC4200 and Rack) are set to 'auto' on most of their switches and the thermal load...
Monday 2008 02 04 30 nodes in rack down at about 16:00 UTC All computers and switches were off by about 16:11 UTC Tuesday 2008 02 05 Computers were started at th...
PostgreSQL Setup On Debian, the PostgreSQl package automatically creates a user 'postgres'. For a quick start, do the following (assuming you installed version 7...
Areca test 4 Abstract confirm that the controller automatically repair bad sectorts, without need of manual volume check Test steps * turn off the storage ...
Areca test 3 Abstract check if the controller does auto rebuild Test steps * there was a running rebuild, so i first powered off init 0 * the drive in r...
Areca test 2 * first see areca_test_1 Abstract check if the controller repair the drive while reading bad sectors HDD prepare * use the same setup like in...
Areca test 1 Abstract damage the raid area to test the error recovery function Initial setup Raid Subsystem Information Controller Name ARC 1261 Firmware Vers...
Stundenzettel Send your worked hours to Stundenzettel@lns01.aei.mpg.de. Subject: "worked" Insert the entrys to the mailbody. Format: ISO Date,Starttime,Endtime,De...
Netperf test with irqbalance userspace * same set up as in round 3, but with irqbalance user space daemon * virtually no change Setting eth1 tso off, ring ...
Category :Network Network Testing To ensure maximum network performance, we ran several tests. First only with a cable running between two nodes then with one or...
Round 3 * same test as in round 2 (burst mode for TCP_RR, c C everywhere) * new this time: tso off for nic Setting eth1 tso off Setting tcp congestion c...
Netperf results back2back round 1 * netperf 2.4.4 (Debian) * no special compile options * no special command line options, only UDP_STREAM, TCP_STREAM, T...
master script This script just uses dsh to call the various tests on the nodes. UDP_STREAM and TCP_STREAM are run first in single and in dual direction mode (with...
Steps performed * Power off data server * Take one drive out of the data server and put it into powered off compute node * power on compute node * cho...
Category: Rack Rittal rack design The water cooled Rittal racks are composed of two rack units sharing one Liquid Cooling Package LCP . The racks are on the rig...
With 1340 nodes it might be wise to split services among many boxes to ensure that not half of the cluster is waiting to a single server to serve data through its...
These items need to be fixed * mcelog needs to be installed and evaluated * maybe the former two and smartctl can be integrated into single script (Carsten ...
FirstTest LV compute The following items need to be checked for compute nodes (please add your name to the test you have performed). If you need much more space p...
single threaded bonnie Running bonnie in single threaded mode with this short script yielded the following results. script #!/bin/bash SCHEDS="anticipatory ...
Nodes * Power factor is too low, use different PSU! * Need electronic document with all MAC numbers (eth0,eth1,IPMI) * Why does SM suggest IMPI to have sam...
FirstTest LV storage The following items need to be checked for compute nodes (please add your name to the test you have performed). If you need much more space p...
ToDoStegmann * Very sharp edges in cable trays, need to be grinded^Wground(?) or capped with rubber * Some of the Cat7 cables are damaged one can see the ou...
These items need to be fixed * temperature monitoring needs be set (much higher) 55C at least * mcelog needs to be installed and evaluated * maybe the fo...
This document simply follows the instructions given in http://www.lsc group.phys.uwm.edu/LDR/doc/ldr.pdf. Prepare user and place where LDR will reside As root: ...
Enable blastwave package repository pkgadd d http://www.blastwave.org/pkg_get.pkg With this you can install nice tools such as gtar, gcc, wget with pkg get ins...
IPMI stays for Intelligent Platform Management Interface http://en.wikipedia.org/wiki/Intelligent_Platform_Management_Interface Issue IMPI has various purposes I...
Category : Network ethtool is used for querying settings of an ethernet device and changing them. Usage ethtool y ethX ethtool Y ethX ... Lowercase paramet...
SSH SSH User settings User settings are usually found in ${HOME}/.ssh/config and can be created by any text editor. Each setting can be specific to all hosts, a...
Compute cluster racks In each compute cluster rack will be stuffed with 42 compute nodes. Each compute nodes needs: * Height units needed: 1 * Ethernet conn...
Automatic testing To enable Woven techs to see the results of their tests, the following program will be run (until killed): Master The master process running on...
Initial set up Nodes o460 o498 (except o467) and nodes o536/o537 are connected to TRX100 1. Nodes o499 o530 and o538 o545 are connected to TRX100 2. Plan: The EFX...
Category:fai FAI is short for "Fully Automatic Installation". Installation List of important Files * /etc/fai/make fai nfsroot.conf add location of the Image...
NodeTests Tasks to do: Initial work (HP) Manual work * Blank disk of node, wipe by: dd if=/dev/zero of=/dev/sda; sync * Put MAC address into DHCP table o...
Category:Linux What is Postfix? Postfix is a mail transfer agent (MTA), a program for the routing and delivery of mail. Mails to be delivered on the local machin...
server Install rsh server using rhosts If you want to have a password less access edit in the home directory the $HOME/.rhosts and add the nodes, from which ac...
Category:Network Transmission Control Protocol (TCP) vistit wikipedia on TCP . Congestion control Is used to optimize the senders behaviour to the current netwo...
Boundary Conditions Racks * Cooling: up to 42*220W = 9.24 kW per rack * Electrical Power: up to 9.24 kW per rack * 42 horizontal and a few vertical heig...
Faimond faimond catches installation messages on port 4711 sent by the clients. The clients use natcat deliver the current status of the installation tasks like...
Lm sensors lm sensors is a hardware monitoring tool for linux (cf. http://www.lm sensors.org/). Installation apt get install lm sensors modprobe i2c dev i2c...
Softupdate Softupdate runs through the fai installation and performs all the changes which have been made after the installation process. On the client side f...
Logcheck logcheck and syslog summary Logcheck This program periodically (default: every hour) searches through the logs in /var/log (generated by syslog or sys...
Category:Network VLANs are described in IEEE 802.1Q. Description VLANs are a 4 Byte extension to the standart Ethernet frame, the VLAN tag. The Tag is set by the...
Category: Network NFS stays for Network File System. The available manuals are sufficient. If it does not work after following the instructions, it may help to r...
Category : Network Overview NerPerf consists of two programms: * netperf, a command line tool to generate network traffic to a host * netserver, the daemon...
Category:Network Networking Layout for Foundry RX32 Switch The Switch The Foundry Networks switch RX32 (BigIron RX Series Switches #91;6 #93;) has 32 slots, whi...
Tips Tricks with svn Adding all not yet added files to a repository This should work: svn status grep ^? awk '{print $2;}' xargs svn add or without the grep sv...
GlusterFS Abstract GlusterFS is a clustered file system capable of scaling to several peta bytes. It aggregates various storage bricks over Infiniband RDMA or TC...
Automake The GNU Automake and Autoconf tool exist to generate the configure and makefiles needed to compile from source. This is mostly the case when the source i...
Checkinstall Checkinstall is a tool to easily generate a debian package from a source tree. To use checkinstall, simply substitute the command with "make install"...
Cisco IPX Category: Network The first step is to gain the Cisco device an ssh connect, we must connect with the blue(!?) Console Cable with the one side to the PC...
BIOS SuperMicro Although it is possible at least on AMD based A servers to flash the BIOS using the flashrom tool, it is unfortunately impossible to use /de...
ZFS ZFS is a file system created by Sun Microsystems for Solaris/OpenSolaris in 2005. ZFS combines concepts of filesystem and volume management. The name ZFS ori...
Benchmarks General description For the Leistungsbeschreibung of the Morgane cluster in 2006, we compiled a statically linked benchmark program which can be found...
Apt proxy Category:Linux The apt proxy is running on n0, reachable via 192.168.0.1 or 130.75.117.77 both at port 9999 To Use it, add to your /etc/apt/sources.list...
Netcat Category:Network Netcat is a tool to generate network traffic. Send nc p sourceport destipaddr port and Recive nc l p listenport sourceipaddr port ...
Flow tools Category:Network flow tools contains some nice tools to measure network performance. Tools flow gen flow gen generates data that can be piped to flow...
Sophos Antivir on Samba The Goal is to have a fileserver with live virus protection..!! We need: * samba sources (I used samba 3.0.22 13.16 bacause the new one...
FAI Installation Category:fai Fai Installation at the Max Planck Institute This command install FAI with all dependences..!!! aptitude install fai quickstart P...
Suse Hacks SLES/openSuse Every one must be able to install or delete software from the shell. To do this your Suse needs an Repository with the Packages. Take Ca...
Install script Category:Installation The install script, as of friday, may 25th 2007 #!/bin/bash export IMAGE_NAME=system image.tar.bz2 cd /tmp echo get...
Category:Howto What is initrd ToDo A few hints on initrd The Script that is executed after Startup is located /usr/share/initrd tools/lnuxrc Runlevel? Files,...
Category: Installation autonomous installation debian creating a package list In order to use a package list of a good installed node type dpkg get selectio...
Using the chroot operation, one can reroot a group or user to a different directory. The Called programm cannot reach firther than the newly defined root. This wa...
Grsecurity Category: Kernel Grsecurity provides a kernel patch to enable additional security features. * constrains the write accsess in the kernel space ...