What is ATLAS?

ATLAS is a general-purpose compute cluster, located in the Albert Einstein Institute for Gravitational Physics, in Hannover Germany, on the campus of the Leibniz University Hannover. Atlas was designed by Bruce Allen, Carsten Aulbert, and Henning Fehrman, and is primarily intended for the analysis of gravitational-wave detector data. It has a large number of CPU cores and a lot of storage space (central and distributed). The main design goal was to provide very high computing throughput at very low cost, primarily for 'trivially parallel' analysis. However it can also efficiently run highly-parallel low-latency codes.

According to the June 2008 Top-500 list, ATLAS is the 58th fastest computer in the world and the 6th fastest in Germany. It was also the fastest computer in the world based on Gigabit Ethernet. Despite being expanded we never re-ran the Linpack benchmark and subsequently, Atlas dropped out of the Top500 list in June 2011.

Our computing and storage capacity increases steadily as you can see from the following table:

Year number of compute cores special central data storage capacity central user storage space distributed (node) storage networking capacity
2008 5368 @ 2.4GHz   300TB 230TB 550TB 1440 Gbit/s full duplex
2009 6720 @ 2.4GHz   300TB 230TB 1180TB 1440 Gbit/s full duplex
2010 6720 @ 2.4GHz   700TB 310TB(online)+500TB(tape) 1250TB 1920 Gbit/s full duplex
264 @ 2.0GHz 264 GPGPU(Nvidia C1060+C2050)
2011 6720 @ 2.4GHz   900TB 1200TB(online)+1500TB(tape) 1250TB 1920 Gbit/s full duplex
264 @ 2.0GHz 264 GPGPU(Nvidia C1060+C2050)
2013 8044 @ 3.1GHz 120TB SSD space 2.2PB      

Compute nodes

ATLAS currently has 1680 compute nodes with one Xeon 3220 CPU each. The compute nodes consist of SuperMicro PDSML−LN2+ motherboards inside a SuperMicro SC811T−300B 1U chassis. Each CPU has 4 cores clocked at 2.4 GHz. Each compute node has 8 GBytes of memory, so on average each CPU core has access to 2 GByte of memory. Each compute nodes has a 500 GByte SATA disk in a hot-swap carrier, and a second (currently empty) carrier for an additional data storage disk. In 2010 we added 66 GPGPU servers with a 4 core Xeon (2.0GHz) and 4 Nvidia C1060 or C2050 boxes for special purposes.

Central storage

For central storage, there are 40 Linux-based data servers which can store about 10 TByte or 28 TByte of data each. These data servers use Areca 1261 RAID-6 controllers with 16 x 750 GB/2000 GB Hitachi SATA disks, and have 8 CPU cores and 16 GB of memory each. In additional, we have 13 Solaris-based Sun Fire X4500 data servers with about 18-19 TByte of storage space each, which are used for home directories. Each user has their home directory on an independent (individual) ZFS file system. Periodic 'snapshots' of these file systems are retained for backup purposes on different data servers. Each X4500 data server is connected to the core network switch with a dedicated 10 Gb/s ethernet link. In 2010 we also added a hierarchical storage management system, which features about 80 TB of usable disk space on fast hard disks and is backed by a 1 PB tape library which will store 2 copies to ensure data redundancy.

Networking

All of these storage and computing nodes are connected with a non-blocking 1Gb/s + 10Gb/s Ethernet network (although we are running in slightly oversubscribed configuration right now). Each compute node has a pair of Gb/s ethernet links. One link is connected to an (oversubscribed) management network. The second link is connected to a Woven Systems TRX-100 top-of-rack switch. This switch has 48 x 1 Gb/s ports [in the compute node racks we use 42 of these ports] and 3 x 10Gb/s ports. The three 10Gb/s ports from each top-of-rack switch are connected to a Woven Systems EFX-1000 core switch. This core switch is currently one of the highest-capacity ethernet switches available, and has 144 x 10Gb/s ethernet ports.

After Fortinet bought Wovensystems in 2009, we advanced to a more complex 3 stage setup, with two core switches at the center (each up to 144 10 Gbit/s ports), meshing together up to 24 24port switches which will connect to the aforementioned trx100 (now fs100).

Facility

Atlas is located in a custom-designed data center in the basement of the AEI Laboratory building. The compute servers and storage servers are mounted in sealed Rittal 42U racks. Each pair of two racks is cooled by a Rittal LPC+ cooling system, which transfers the heat from the computers into water. This maintains the compute nodes at 20 Celsius and contains the noise and airflow within the racks. Chilled water is provided by a dedicated chilling plant consisting of three external 200kW chillers and a heat-exchanger system. All the cluster-room electrical power is provided by a dedicated 640kW MGE Galaxy 6000 UPS system. This provides a minimum of six minutes runtime if main power is lost, so that the computing systems can be smoothly shut down without any data loss or file system corruption.

Topic revision: r27 - 23 Jun 2015, HenningFehrmann
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback