New networking set-up for Atlas
From 2008 till 2013 we used a flat networking structure, i.e. all computers on the data network were connected "directly" to the core. This statement is not 100% accurate, but it is close enough.
- flat network 10.0.0.0/8, single core switch, two layered network, all computers on 1 Gb/s connections
At the very core of our system we started with a single 144 port 10Gb/s CX4 switch (FS1000) and connected 32 "top of rack switches" (TRX100 now FS100). Each FS100 was connected to 42 compute nodes in each rack and had up to 4 uplinks to the core switch. All of our head and storage nodes were linked to the core via a FS100.
- still flat network 10.0.0.0/8, two core switches, 16 leaf switches creating 192 ports via switch fabric, special nodes with 10Gb/s ethernet
In order to scale out to a much larger number of 1 Gb ports, we changed the core to 2 FS1000 switches with 192 10Gb/s ports. These were connected to 16 FS500 switches which feature 24 10 Gb/s CX4 ports. We used 6 ports per FS500 for an aggregated link to each of the FS1000 effectively creating a 192 port switch fabric (Clos set-up
). For this "virtual switch" we use the term "core switch".
We used the new ports to connect all of our data servers (about 30 ports), a few head nodes and the newly acquired HSM system (about 8 ports) directly via 10Gb/s links to the core switch. All of our O(40) compute racks at the time were still connected via FS100 switches (at least 2 uplinks).
- main network shrunk to 10.0.0.0/9, add 8 line cards, 8 FS500 switches, physically same network, logically segmented w.r.t. to compute nodes, routing done by FS100, need to use special 10.128.X.0/24 routes, auto-distributed via switches/quagga
The old system used so far worked well, but the central elements of the core switch (FS1000 and FS500) use SVLANs for different networking flows. Therefore, there is the need to keep track of all MAC addresses multiple times (and with different algorithms). Given the number of 6 SVLANs in use and up to 16384 storable MAC addresses, under ideal circumstances we would have been able to accommodate 2667 MAC addresses. But this limit is highly theoretical due to imperfections imposed by the switch-internal algorithms, storage/optimization schemes and very close range of MAC addresses present on the system.
We have been able to circumvent this problem by reprogramming the MAC addresses on O(200) compute nodes before. This however, was no solution when adding O(2000) more MAC addresses to the network.
Therefore, we arrived at the following scheme.
- we shrink our initially used 10.0.0.0/8 network to 10.0.0.0/9. Until 2013 we have only used very few addresses from 10.128.0.0/9. Reducing it further to 10.0.0.0/12 would have meant changing IP addresses on a much larger scale.
- Each FS100 switch in rack X will be given the internal IP 10.128.X.254 and the external IP 10.0.0.X.
- Each compute node rack with number X will be on its own 10.128.X.0/24 subnet
- Compute nodes will use IPs from 10.128.X.0/24 range default router 10.128.X.254
- All IP addresses of 10.0.0.0/9 will be reachable directly, all 10.128.X.0/24 addresses need extra routing information added to each host
- Use OSPF for this, more specifically use quagga to distribute thee routes automatically.
- No Layer 3 NAT involved, i.e. all nodes need to know all IPs
- Move over to new node naming scheme with corresponding IP addresses, before n0001 (10.10.0.1), n0002 (10.10.0.2), ..., n1234 (10.10.12.34), ... New scheme: n0002 -> a0102 (rack 01, height unit 02) with IP 10.128.1.2
 Initially the switches were bought from WovenSystems. The large core switch was named EFX1000, the 24port switches EFX500 and the top of rack switches TRX100. After Fortinet bought WovenSystems "leftovers", these switches were renamed FS1000, FS500 and FS100.
 When buying large quantities of compute nodes one will inevitably get MAC addresses from a very small range mostly consecutive ones. Adding to that that only one of two available ports is used on each node, suddenly close to 100% of the MAC addresses on the network are odd (or even).
- 11 Oct 2013