With 1340 nodes it might be wise to split services among many boxes to ensure that not half of the cluster is waiting to a single server to serve data through its 1 Gb/s link. To achieve this, we probably need the following overkill:
We need a single DHCP master which only serves as a DHCP server and NOT as a tftp, fai/nfsroot or repository. This machine should simply be idle most of the time
We might want to have DNS on this box as well.
There should be a slave DHCP server if the first one is dead. (-> lok for failover in the man page of dhcpd).
According to the number of tftp serverrs, we should group the nodes together into the same number of groups. E.g. we use 30 file servers as tftp servers with 2 links each (no channel bonding, separate IPs), then we should create 60 groups in the dhcp config file from our database via script. Each group will then get its own allocated TFTP server.
The algorithm used for that may simply by the mod-operation, i.e. n0001 to group1, n0002 to group2, n1334 into group14 and so on. Essentially for the means of installation we will partition the whole cluster into these groups. All installation will run solely in these groups.
Each file server will run a TFTP server for its nodes.
TASK: We need to change faiboot, dosboot and so on to reflect this set-up!
Each file server will run a fai server on its own. Also the nfs-root should be on the file servers.
- What happens if m file servers are offline and not available. Is there an easy way for a failover?
- Distributing updates via dsh (kernel images, dosimages)?