You are here: Foswiki>ATLAS Web>FirstTestLVCompute (08 Jan 2008, Carsten)Edit Attach

FirstTest LV compute

The following items need to be checked for compute nodes (please add your name to the test you have performed). If you need much more space please add footnotes).

The system design should be as simple as possible and all the components should be standard/commodity ones.:
State items present in box (just the single items for our records, details will be queried later) [CA]
Intel Xeon X3220 CPU
SuperMicro PDSML-LN2+ Motherboard (Rev 1.01)
SuperMicro 1UIPMI-B (Rev. 3.01)
4x ATP AJ56K72G8BJE6S RAM (2048 DDR2-667 ECC)
Hitachi Deskstar HDP725050GLA360
SuperMicro Chassis SC811-SCA
Ablecom SP302/1S PSU (300W)
All nodes should be able to run the same Linux operating system version.:
[X] Debian etch able to run? [CA]
All components (compute nodes, storage nodes, head nodes) should be installed on slide rails that permit easy removal and internal access. The slide rails must be rated for the necessary weight.:
[X] Slide rails available? [CA]
[X] Easy operation? Yes, but box need to be removed from rack if cover is to be opened [CA]
The amount of power and cooling available for this cluster are 220W per HU on average.:
[X] Wattage of box less than 220W per HU [CA]
How much? Booting: 150W initially, dropping to 100W (idle). Full stress: 150W [CA]
What is the power factor?:
[ ] Cos phi > 0.9 NO [CA]
Measured: Idle: 0.82, Full load: 0.86 [CA]
Putting more boxes onto a single power phase yielded better results.
It should be possible to shut down and power up the entire cluster remotely. Hardware monitoring (IPMI~v2.0 or better) should include the monitoring of the basic health (temperatures, voltages, fans) of all types of nodes and Serial over LAN (SoL) features for remote console access. :
[X] Hardware monitoring via IPMI possible? [CA]
[X] IPMI temperatures (which?) (CPU, System) [CA]
[X] IPMI voltages (which?) (CPU Core, DIMM, 3.3V, 5V, 5VSB, 12V, -12V, Battery) [CA]
[X] IPMI fan speeds (which?) (fan1 - nearest to PSU, fan2, fan3 - furthest from PSU)
[X] SoL working?
Compute nodes should have x86-64 compatible CPUs (also known as AMD64, Intel64, EM64T):
[X] Xeon 3220 (64bit extensions present) [CA]
Each compute node should have at least one processor; multiple-processor nodes are allowed.:
[X] 1 socket [CA]
Each processor should have at least one core; multi-core processors are allowed.:
[X] 4 cores/socket [CA]
Each compute node should have at least 2~GB of memory per core.:
[X] 8 GB present, i.e. 2 GB/core [CA]
The memory type should be the fastest standard ECC RAM matching the CPU and chipset. Registered/fully buffered memory should be chosen if needed for reliable operation.:
[X] ECC [CA]
[X] Fastest available for this platform? [CA]
When the nodes are provided with sufficient volumes of cooling air at 23 degrees Celsius and operated at full load (CPU and disk) the MTBF (Mean Time Between Failure) of each system power supply will be at least 100,000 hours.:
[X] MTBF of power supply > 100,000 hours? [CA]
Each node should have copper Gb/s ethernet and basic on-board video.:
[X] 2 x 1 Gb/s Intel NIC on-board [CA]
[X] XGi Volari Z7 on-board [CA]: ; In this paragraph 1~GB=$1\,000\,000\,000$~bytes. Each node needs a local SATA hard disk for the operating system and scratch space. The size of this hard disk should be at least 100~GB + n~$\times$~75~GB where $n$ is the number of CPU cores per node (250 GB for a two-core system, 400 GB for a four-core system, 700 GB for an eight-core system). Disks should be rated for 24/7~operation, and fully support S.M.A.R.T. The disk should be mounted in a hot-swappable front SATA carrier. Each compute node should also have at least one empty, but fully wired hot-swap SATA disk carrier for later extensions.
[X] 500 GB Hitachi HDP725050GLA360 (465 GB 1024-based) [CA]
[X] Second carrier fully wired and working? [CA]
[X] S.M.A.R.T. works [CA]
[X] Rated for 24/7 [CA]
Compute nodes do not need high-availability features, such as redundant components, RAID arrays etc. Compute nodes should be as simple and basic as possible.:
[X] Very basic, no special add-ons? [CA]
Each node should have an IPMI v2.0 management card (BMC) installed; the ethernet connection can be shared with the on-board ethernet network connection for remote access, or the Vendor can provide a separate oversubscribed low-performance management network for this purpose.:
[X] IPMI present and working? [CA]
Nodes must be able to perform full power-off, reset, and wakeup via IPMI under all power cycling conditions (e.g. even if external power has been switched off and back on before wakeup). Wake-on-LAN is considered a bonus.:
[X] Box is running, power off via IPMI works?
[X] Box is running, reset (power cycle) via IPMI works?
[X] Box is powered off via IPMI, restart via IPMI possible?
[X] Box is powered off via OS, restart via IPMI possible?
[X] Box is powered off via power switch, restart via IPMI possible?
[X] Box is running, power cables is removed (> 1 minute), restart via IPMI possible?
It can happen, if a IPMI request is made directly after reconnecting the nod that the IPMI card is blanked and has to be reconfigured :
[X] Box is powered off via OS, power cable removed (> 1 minute), restart via IPMI possible?
[ ] Box is powered off via IPMI, restart via WoL possible?
[ ] Box is powered off via OS, restart via WoL possible?
[ ] Box is powered off via power switch, restart via WoL possible?
[ ] Box is running, power cables is removed (> 1 minute), restart via WoL possible?
[ ] Box is powered off via OS, power cable removed (> 1 minute), restart via WoL possible?
Compute nodes must be bootable via the network (PXE) to allow remote, hands-off installation.:
[X] PXE is working [CA]
Floppy and CD drives are not needed for compute nodes.:
[X] Floppy is missing [CA]
[X] CD is missing [CA]
Vendor should provide a hands-off, network-based method for BIOS and IPMI upgrades and configuration. :
[X] BIOS can be flashed in hands-off manner?
[X] BIOS values (CMOS) can be set in hands-off manner?
[X] IPMI can be flashed in hands-off manner?
[X] IPMI values can be set in hands-off manner?
Nodes should be delivered with BIOS settings as per AEI specifications. Vendor must have an automated system to set BIOS values.:
[X] fulfilled? [CA] (except USB-Legacy mode, which we need for keyboard)
Sound components, mice and keyboards are not required. The nodes are required to work without any mouse and/or keyboard connected.:
[X] Audio is missing [CA]
[X] Mouse missing [CA]
[X] Keyboard missing [CA]
[X] System works with missing audio, mouse and/or keyboard [CA]
Mainboard sensors for all vital components (fans, temperatures, voltages) must be present and can be queried using at least one of lm_sensors or ipmitools.:
IPMI (see also above)
Temp: CPU, System
Voltage: CPU Core, DIMM, 3.3V, 5V, 5VSB, 12V, -12V, Battery
Fans: Fan1, Fan2, Fan3
[ ] lm_sensors working? (need more kernel modules, but should work)
Major/minor revision numbers of major components and embedded software (BIOS/Firmware) must be identical. (List of revision numbers, firmware version):
Supermicro SC811-SCA __________________ (Chassis) [CA]
SuperMicro PDSML-LN2+ Rev 1.01, BIOS 6.00 2007-12-14 (Motherboard) [CA]
1UIPMI-B Rev. 3.01, Firmware 2.7(IPMI) [CA]
ATP AJ56K72G8BJE6S (Memory) [CA]
Hitachi Deskstar HDP725050GLA360 Firmware GM4OA50E (Hard disk) [CA]
Ablecom SP302/1S PSU (Power Supply) [CA]

The operating system on the cluster will be a 64-bit version of Linux, with a recent 2.6 kernel. Therefore, it is required that all hardware works under this OS.:
[X] Kernel 2.6.20 and are working [CA]
The cluster must work with any major Linux distribution coming with a recent 2.6 kernel.:
[X] Debian etch is working [CA]
All compute nodes, head nodes, storage nodes and switches should be placed into 19-inch racks provided by AEI. The maximum depth of any installed equipment must not exceed 750mm.:
[X] Fulfilled for compute nodes [CA]
The air flow should be from the front to the back, i.e. cool air from the front should be taken in by the fans and the hot air blown out in the rear. This should be valid for the whole rack meaning the cooling air flow within the rack is horizontal.:
[X] Fulfilled [CA]
Each node will be clearly labeled with the node number and ethernet MAC address of the network card and of the IPMI interfaces. Three labels (node number, MAC ethernet and IPMI addresses) will appear on the front and the same three labels will appear at the rear of the chassis. The characters in labels should be as large as permitted by space on the chassis. Labels must be permanent and not peel or discolor after time. The exact naming scheme will be discussed and finalized after the order has been placed.:
[X] Node number readable from front? [CA]
[X] Node number readable from back? [CA]
[X] MAC of eth0 is readable from front? [CA]
[ ] MAC of eth0 is readable from back? NO [CA]
[X] MAC of eth1 is readable from front? [CA]
[ ] MAC of eth1 is readable from back? NO [CA]
[X] MAC of IPMI is readable from front (needed? same as eth0)? [CA]
[ ] MAC of IMPI is readable from back NO (needed? same as eth0)? [CA]
The Vendor will supply on a CD or on a floppy disk an ASCII text file containing a list of ethernet and IPMI MAC addresses and node names in a 4-column format.:
I had to type in the numbers myself
Contacted Vendor about this, should be fixed for future
[ ] Fulfilled - NO [CA] I had to type these numbers into file frown, sad smile During the acceptance tests the reported benchmark performance must be reproducible by AEI staff although we allow for a 5\% error margin on each individual score. Quoted scores were: (DR: 14.23 , HT: 5.58, MC: 3.71, DM: 2.57):
[X] Fulfilled (see this list) [CA]
 DR:15.4286      HT:5.9416       MC:3.7209       DM:2.5555
 DR:17.8210      HT:6.0118       MC:3.7356       DM:2.5555
 DR:16.6135      HT:6.0306       MC:3.7337       DM:2.5568
 DR:17.7865      HT:5.9739       MC:3.7386       DM:2.5594
 DR:17.2420      HT:6.1822       MC:3.7366       DM:2.3636
 DR:18.2146      HT:6.0803       MC:3.7466       DM:2.5543
 DR:19.2518      HT:6.0306       MC:3.7446       DM:2.5568
 DR:18.0895      HT:6.2638       MC:3.7376       DM:2.5581
 DR:16.2943      HT:6.1251       MC:3.7416       DM:2.5568
 DR:16.4593      HT:6.0831       MC:3.7416       DM:2.5555
 DR:17.7358      HT:5.9844       MC:3.7209       DM:2.5555
 DR:14.9434      HT:5.9656       MC:3.7357       DM:2.5568
 DR:17.1191      HT:6.2111       MC:3.7475       DM:2.5607
 DR:17.0760      HT:6.0887       MC:3.7140       DM:2.5607
 DR:16.4022      HT:5.9847       MC:3.7407       DM:2.5594
 DR:17.2216      HT:6.1054       MC:3.7446       DM:2.5581
 DR:18.5208      HT:6.0062       MC:3.7396       DM:2.5555
 DR:17.7747      HT:6.1966       MC:3.7376       DM:2.5607
 DR:17.7671      HT:6.1364       MC:3.7337       DM:2.5581
 DR:19.5321      HT:6.0608       MC:3.7209       DM:2.5568
 DR:19.6205      HT:6.0170       MC:3.7425       DM:2.5568
 DR:17.5073      HT:5.9899       MC:3.7258       DM:2.5581
 DR:18.6469      HT:5.8967       MC:3.7436       DM:2.5543
 DR:17.7628      HT:6.1054       MC:3.7387       DM:2.5555
 DR:16.7256      HT:5.9656       MC:3.7376       DM:2.5581
 DR:19.0415      HT:5.9363       MC:3.7406       DM:2.5581

The Vendor will run a burn-in on each node before delivery, at least 24 hours long, using a memory/disk/CPU/network-card/video-card exercise program of their choice. The Vendor will supply documentation of the burn-in tests, including specification of what programs and parameters have been used, and the name of a contact person for queries regarding these tests.:
(Contacted Vendor about this, should be fixed for future)
Topic revision: r1 - 08 Jan 2008, Carsten
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback