The following items need to be checked for compute nodes (please add your name to the test you have performed). If you need much more space please add footnotes).
The system design should be as simple as possible and all the components should be standard/commodity ones.:
State items present in box (just the single items for our records, details will be queried later) [CA]
Intel Xeon X3220 CPU
SuperMicro PDSML-LN2+ Motherboard (Rev 1.01)
SuperMicro 1UIPMI-B (Rev. 3.01)
4x ATP AJ56K72G8BJE6S RAM (2048 DDR2-667 ECC)
Hitachi Deskstar HDP725050GLA360
SuperMicro Chassis SC811-SCA
Ablecom SP302/1S PSU (300W)
All nodes should be able to run the same Linux operating system version.:
[X] Debian etch able to run? [CA]
All components (compute nodes, storage nodes, head nodes) should be installed on slide rails that permit easy removal and internal access. The slide rails must be rated for the necessary weight.:
[X] Slide rails available? [CA]
[X] Easy operation? Yes, but box need to be removed from rack if cover is to be opened [CA]
The amount of power and cooling available for this cluster are 220W per HU on average.:
[X] Wattage of box less than 220W per HU [CA]
How much? Booting: 150W initially, dropping to 100W (idle). Full stress: 150W [CA]
What is the power factor?:
[ ] Cos phi > 0.9 NO [CA]
Measured: Idle: 0.82, Full load: 0.86 [CA]
Putting more boxes onto a single power phase yielded better results.
It should be possible to shut down and power up the entire cluster remotely. Hardware monitoring (IPMI~v2.0 or better) should include the monitoring of the basic health (temperatures, voltages, fans) of all types of nodes and Serial over LAN (SoL) features for remote console access. :
[X] Hardware monitoring via IPMI possible? [CA]
[X] IPMI temperatures (which?) (CPU, System) [CA]
[X] IPMI voltages (which?) (CPU Core, DIMM, 3.3V, 5V, 5VSB, 12V, -12V, Battery) [CA]
[X] IPMI fan speeds (which?) (fan1 - nearest to PSU, fan2, fan3 - furthest from PSU)
[X] SoL working?
Compute nodes should have x86-64 compatible CPUs (also known as AMD64, Intel64, EM64T):
[X] Xeon 3220 (64bit extensions present) [CA]
Each compute node should have at least one processor; multiple-processor nodes are allowed.:
[X] 1 socket [CA]
Each processor should have at least one core; multi-core processors are allowed.:
[X] 4 cores/socket [CA]
Each compute node should have at least 2~GB of memory per core.:
[X] 8 GB present, i.e. 2 GB/core [CA]
The memory type should be the fastest standard ECC RAM matching the CPU and chipset. Registered/fully buffered memory should be chosen if needed for reliable operation.:
[X] ECC [CA]
[X] Fastest available for this platform? [CA]
When the nodes are provided with sufficient volumes of cooling air at 23 degrees Celsius and operated at full load (CPU and disk) the MTBF (Mean Time Between Failure) of each system power supply will be at least 100,000 hours.:
[X] MTBF of power supply > 100,000 hours? [CA]
Each node should have copper Gb/s ethernet and basic on-board video.:
[X] 2 x 1 Gb/s Intel NIC on-board [CA]
[X] XGi Volari Z7 on-board [CA]:
; In this paragraph 1~GB=$1\,000\,000\,000$~bytes. Each node needs a local SATA hard disk for the operating system and scratch space. The size of this hard disk should be at least 100~GB + n~$\times$~75~GB where $n$ is the number of CPU cores per node (250 GB for a two-core system, 400 GB for a four-core system, 700 GB for an eight-core system). Disks should be rated for 24/7~operation, and fully support S.M.A.R.T. The disk should be mounted in a hot-swappable front SATA carrier. Each compute node should also have at least one empty, but fully wired hot-swap SATA disk carrier for later extensions.
[X] 500 GB Hitachi HDP725050GLA360 (465 GB 1024-based) [CA]
[X] Second carrier fully wired and working? [CA]
[X] S.M.A.R.T. works [CA]
[X] Rated for 24/7 [CA]
Compute nodes do not need high-availability features, such as redundant components, RAID arrays etc. Compute nodes should be as simple and basic as possible.:
[X] Very basic, no special add-ons? [CA]
Each node should have an IPMI v2.0 management card (BMC) installed; the ethernet connection can be shared with the on-board ethernet network connection for remote access, or the Vendor can provide a separate oversubscribed low-performance management network for this purpose.:
[X] IPMI present and working? [CA]
Nodes must be able to perform full power-off, reset, and wakeup via IPMI under all power cycling conditions (e.g. even if external power has been switched off and back on before wakeup). Wake-on-LAN is considered a bonus.:
[X] Box is running, power off via IPMI works?
[X] Box is running, reset (power cycle) via IPMI works?
[X] Box is powered off via IPMI, restart via IPMI possible?
[X] Box is powered off via OS, restart via IPMI possible?
[X] Box is powered off via power switch, restart via IPMI possible?
[X] Box is running, power cables is removed (> 1 minute), restart via IPMI possible?
It can happen, if a IPMI request is made directly after reconnecting the nod that the IPMI card is blanked and has to be reconfigured :
[X] Box is powered off via OS, power cable removed (> 1 minute), restart via IPMI possible?
[ ] Box is powered off via IPMI, restart via WoL possible?
[ ] Box is powered off via OS, restart via WoL possible?
[ ] Box is powered off via power switch, restart via WoL possible?
[ ] Box is running, power cables is removed (> 1 minute), restart via WoL possible?
[ ] Box is powered off via OS, power cable removed (> 1 minute), restart via WoL possible?
Compute nodes must be bootable via the network (PXE) to allow remote, hands-off installation.:
[X] PXE is working [CA]
Floppy and CD drives are not needed for compute nodes.:
[X] Floppy is missing [CA]
[X] CD is missing [CA]
Vendor should provide a hands-off, network-based method for BIOS and IPMI upgrades and configuration. :
[X] BIOS can be flashed in hands-off manner?
[X] BIOS values (CMOS) can be set in hands-off manner?
[X] IPMI can be flashed in hands-off manner?
[X] IPMI values can be set in hands-off manner?
Nodes should be delivered with BIOS settings as per AEI specifications. Vendor must have an automated system to set BIOS values.:
[X] fulfilled? [CA] (except USB-Legacy mode, which we need for keyboard)
Sound components, mice and keyboards are not required. The nodes are required to work without any mouse and/or keyboard connected.:
[X] Audio is missing [CA]
[X] Mouse missing [CA]
[X] Keyboard missing [CA]
[X] System works with missing audio, mouse and/or keyboard [CA]
Mainboard sensors for all vital components (fans, temperatures, voltages) must be present and can be queried using at least one of lm_sensors or ipmitools.:
IPMI (see also above)
Temp: CPU, System
Voltage: CPU Core, DIMM, 3.3V, 5V, 5VSB, 12V, -12V, Battery
Fans: Fan1, Fan2, Fan3
[ ] lm_sensors working? (need more kernel modules, but should work)
Major/minor revision numbers of major components and embedded software (BIOS/Firmware) must be identical. (List of revision numbers, firmware version):
Supermicro SC811-SCA __________________ (Chassis) [CA]
SuperMicro PDSML-LN2+ Rev 1.01, BIOS 6.00 2007-12-14 (Motherboard) [CA]
1UIPMI-B Rev. 3.01, Firmware 2.7(IPMI) [CA]
ATP AJ56K72G8BJE6S (Memory) [CA]
Hitachi Deskstar HDP725050GLA360 Firmware GM4OA50E (Hard disk) [CA]
Ablecom SP302/1S PSU (Power Supply) [CA]
The operating system on the cluster will be a 64-bit version of Linux, with a recent 2.6 kernel. Therefore, it is required that all hardware works under this OS.:
[X] Kernel 2.6.20 and 2.6.23.1 are working [CA]
The cluster must work with any major Linux distribution coming with a recent 2.6 kernel.:
[X] Debian etch is working [CA]
All compute nodes, head nodes, storage nodes and switches should be placed into 19-inch racks provided by AEI. The maximum depth of any installed equipment must not exceed 750mm.:
[X] Fulfilled for compute nodes [CA]
The air flow should be from the front to the back, i.e. cool air from the front should be taken in by the fans and the hot air blown out in the rear. This should be valid for the whole rack meaning the cooling air flow within the rack is horizontal.:
[X] Fulfilled [CA]
Each node will be clearly labeled with the node number and ethernet MAC address of the network card and of the IPMI interfaces. Three labels (node number, MAC ethernet and IPMI addresses) will appear on the front and the same three labels will appear at the rear of the chassis. The characters in labels should be as large as permitted by space on the chassis. Labels must be permanent and not peel or discolor after time. The exact naming scheme will be discussed and finalized after the order has been placed.:
[X] Node number readable from front? [CA]
[X] Node number readable from back? [CA]
[X] MAC of eth0 is readable from front? [CA]
[ ] MAC of eth0 is readable from back? NO [CA]
[X] MAC of eth1 is readable from front? [CA]
[ ] MAC of eth1 is readable from back? NO [CA]
[X] MAC of IPMI is readable from front (needed? same as eth0)? [CA]
[ ] MAC of IMPI is readable from back NO (needed? same as eth0)? [CA]
The Vendor will supply on a CD or on a floppy disk an ASCII text file containing a list of ethernet and IPMI MAC addresses and node names in a 4-column format.:
NOT FULFILLED [CA]
I had to type in the numbers myself
Contacted Vendor about this, should be fixed for future
[ ] Fulfilled - NO [CA] I had to type these numbers into file
During the acceptance tests the reported benchmark performance must be reproducible by AEI staff although we allow for a 5\% error margin on each individual score. Quoted scores were: (DR: 14.23 , HT: 5.58, MC: 3.71, DM: 2.57):
[X] Fulfilled (see this list) [CA]
DR:15.4286 HT:5.9416 MC:3.7209 DM:2.5555
DR:17.8210 HT:6.0118 MC:3.7356 DM:2.5555
DR:16.6135 HT:6.0306 MC:3.7337 DM:2.5568
DR:17.7865 HT:5.9739 MC:3.7386 DM:2.5594
DR:17.2420 HT:6.1822 MC:3.7366 DM:2.3636
DR:18.2146 HT:6.0803 MC:3.7466 DM:2.5543
DR:19.2518 HT:6.0306 MC:3.7446 DM:2.5568
DR:18.0895 HT:6.2638 MC:3.7376 DM:2.5581
DR:16.2943 HT:6.1251 MC:3.7416 DM:2.5568
DR:16.4593 HT:6.0831 MC:3.7416 DM:2.5555
DR:17.7358 HT:5.9844 MC:3.7209 DM:2.5555
DR:14.9434 HT:5.9656 MC:3.7357 DM:2.5568
DR:17.1191 HT:6.2111 MC:3.7475 DM:2.5607
DR:17.0760 HT:6.0887 MC:3.7140 DM:2.5607
DR:16.4022 HT:5.9847 MC:3.7407 DM:2.5594
DR:17.2216 HT:6.1054 MC:3.7446 DM:2.5581
DR:18.5208 HT:6.0062 MC:3.7396 DM:2.5555
DR:17.7747 HT:6.1966 MC:3.7376 DM:2.5607
DR:17.7671 HT:6.1364 MC:3.7337 DM:2.5581
DR:19.5321 HT:6.0608 MC:3.7209 DM:2.5568
DR:19.6205 HT:6.0170 MC:3.7425 DM:2.5568
DR:17.5073 HT:5.9899 MC:3.7258 DM:2.5581
DR:18.6469 HT:5.8967 MC:3.7436 DM:2.5543
DR:17.7628 HT:6.1054 MC:3.7387 DM:2.5555
DR:16.7256 HT:5.9656 MC:3.7376 DM:2.5581
DR:19.0415 HT:5.9363 MC:3.7406 DM:2.5581
The Vendor will run a burn-in on each node before delivery, at least 24 hours long, using a memory/disk/CPU/network-card/video-card exercise program of their choice. The Vendor will supply documentation of the burn-in tests, including specification of what programs and parameters have been used, and the name of a contact person for queries regarding these tests.:
NOT FULFILLED [CA]
(Contacted Vendor about this, should be fixed for future)
Topic revision: r1 - 08 Jan 2008, Carsten