What is ATLAS?
ATLAS is a general-purpose compute cluster, located in the Albert Einstein Institute for Gravitational Physics, in Hannover Germany, on the campus of the Leibniz University Hannover. Atlas was designed by Bruce Allen, Carsten Aulbert, and Henning Fehrman, and is primarily intended for the analysis of gravitational-wave detector data. It has a large number of CPU cores and a lot of storage space (central and distributed). The main design goal was to provide very high computing throughput at very low cost, primarily for 'trivially parallel' analysis. However it can also efficiently run highly-parallel low-latency codes.
According to the June 2008 Top-500 list, ATLAS is the
58th fastest computer in the world and the
6th fastest in Germany. It was also
the fastest computer in the world based on Gigabit Ethernet. Despite being expanded we never re-ran the
Linpack benchmark and subsequently, Atlas dropped out of the Top500 list in June 2011.
Our computing and storage capacity increases steadily as you can see from the following table:
Compute nodes
ATLAS currently has 1680 compute nodes with one Xeon 3220 CPU each. The compute nodes consist of
SuperMicro PDSML−LN2+ motherboards inside a
SuperMicro SC811T−300B 1U chassis. Each CPU has 4 cores clocked at 2.4 GHz. Each compute node has 8 GBytes of memory, so on average each CPU core has access to 2 GByte of memory. Each compute nodes has a 500 GByte SATA disk in a hot-swap carrier, and a second (currently empty) carrier for an additional data storage disk. In 2010 we added 66 GPGPU servers with a 4 core Xeon (2.0GHz) and 4 Nvidia C1060 or C2050 boxes for special purposes.
Central storage
For central storage, there are 40 Linux-based data servers which can store about 10 TByte or 28 TByte of data each. These data servers use Areca 1261 RAID-6 controllers with 16 x 750 GB/2000 GB Hitachi SATA disks, and have 8 CPU cores and 16 GB of memory each. In additional, we have 13 Solaris-based Sun Fire X4500 data servers with about 18-19 TByte of storage space each, which are used for home directories. Each user has their home directory on an independent (individual)
ZFS file system. Periodic 'snapshots' of these file systems are retained for backup purposes on different data servers. Each X4500 data server is connected to the core network switch with a dedicated 10 Gb/s ethernet link. In 2010 we also added a hierarchical storage management system, which features about 80 TB of usable disk space on fast hard disks and is backed by a 1 PB tape library which will store 2 copies to ensure data redundancy.
Networking
All of these storage and computing nodes are connected with a non-blocking 1Gb/s + 10Gb/s Ethernet network (although we are running in slightly oversubscribed configuration right now). Each compute node has a pair of Gb/s ethernet links. One link is connected to an (oversubscribed) management network. The second link is connected to a
Woven Systems TRX-100 top-of-rack switch. This switch has 48 x 1 Gb/s ports [in the compute node racks we use 42 of these ports] and 3 x 10Gb/s ports. The three 10Gb/s ports from each top-of-rack switch are connected to a Woven Systems
EFX-1000 core switch. This core switch is currently one of the highest-capacity ethernet switches available, and has 144 x 10Gb/s ethernet ports.
After Fortinet bought Wovensystems in 2009, we advanced to a more complex 3 stage setup, with two core switches at the center (each up to 144 10 Gbit/s ports), meshing together up to 24 24port switches which will connect to the aforementioned trx100 (now fs100).
Facility
Atlas is located in a custom-designed data center in the basement of the AEI Laboratory building. The compute servers and storage servers are mounted in sealed Rittal 42U racks. Each pair of two racks is cooled by a Rittal LPC+ cooling system, which transfers the heat from the computers into water. This maintains the compute nodes at 20 Celsius and contains the noise and airflow within the racks. Chilled water is provided by a dedicated chilling plant consisting of three external 200kW chillers and a heat-exchanger system. All the cluster-room electrical power is provided by a dedicated 640kW MGE Galaxy 6000
UPS system. This provides a minimum of six minutes runtime if main power is lost, so that the computing systems can be smoothly shut down without any data loss or file system corruption.