This side contains a list of discovered network problems, corresponding tools and possible solution.

Wrong MAC address <-> port assignment

symptom

The node is not reachable.

synopses

If a client is connected to a switch the internal MAC hash table stores the MAC address and the corresponding port the node is connected with. It might happen that the switch wrongly assumes that the nodes is somewhere in the rest of network and the MAC addresses belongs to one of the LAGs. Sometimes the switch knows the right port number but switches permanently or frequently to a LAG entry.

check

Login onto the FS500 and check to which port the node belongs:
telnet 172.25.21.x
User: admin
Password: xxxxxxxxx
enable
show fdb-table learned 00:AA:BB:CC:DD:EE

   Mac Address       Port    IfIndex     Status   
   -----------------  --------  -------  ------------
   00:AA:BB:CC:DD:EE    LAG 1     26       Learned 
or
show fdb-table learned 00:AA:BB:CC:DD:EE

   Mac Address       Port    IfIndex     Status   
   -----------------  --------  -------  ------------
   00:AA:BB:CC:DD:EE    1/24      24       Learned  
If this MAC address belongs to a port, the first entry is wrong.

solution

Flush all hash tables of all components concurrently several times, until MAC hash tables are almost empty. (10 times with a delay of 2 seconds in between). You might want to use dsh with the -r option. Flushing the MAC hash table only on the affected switch is not sufficient. Login on a switch:
clear fdb-table learned all

No communication between two nodes

symptom

Nodes are reachable from the most of the server, but some nodes mutually do not see each other.

synopses

The MAC hash entries are correct and the nodes are reachable in general. Only some pairs can't communicate. This is probably related to some wrong entries for the line cards or fabric cards in the FS1000s.

check

  • ssh to node A works
  • ssh to node B works
  • ping from node A to node B does not work

solution

Flush all hash tables of all components concurrently several times, until MAC hash tables are almost empty. (10 times with a delay of 2 seconds in between). You might want to use dsh with the -r option. Flushing the MAC hash table only on the affected switch is not sufficient. Login on a switch:
clear fdb-table learned all

Flaky port

symptom

Nodes are reachable sometimes. Communication is bad.

synopses

The port switches off an on frequently with approximately 1Hz.

check

The light on the FS100 (probably not on the FS500) switches on and off.

solution

Currently, we changed the ports on the FS500 or reroute the traffic via another FS500 by changing the vLAN settings on the FS100. In this case all ports have to be connected via vLAN 1 or vLAN 2, to avoid the bad port. Usually the uplink port 0/49 is connected to vLAN 1 and 0/50 to vLAN 2. After a reboot of the switch, this problem might disappear. This hasn't been checked yet. Also a cluster wide flushing of the MAC hash tables might help.

Wrong MTU setting

symptom

Packets gets lost between nodes, if the frame size is bigger than 1600Bytes.

synopses

A switch does not transmit packets with jumbo frames, thought it is correctly configured.

check

Do an ping -s 8000 -M do node If there is no response the jumbo frames are probably not handled correctly. Login on the FS500s and do
enable
debug
bcm-test
ports
If in the max frame column smaller numbers than 9044 appear (exceptions are the last 4 lines) than the HW broadcom MTU setting is wrong.

solution

Proceed with:
exit
exit
configure
lag 1
mtu 9040
exit
lag 2
mtu 9040
exit 
exit
logout

You might want to check whether the MTUs are correct now in the bvm-test mode.

Random packet lost

symptom

Packets are getting lost randomly. Do an fping test over a longer time and count the number of lost packets.

synopses

Packets are getting lost randomly. The reason so far is unknown and an issue for the developer. We observed to rates. A relatively low rate can be seen everywhere. If the rate is higher it is probably a flaky port. See the flaky port issue. Usually all nodes are affected which are connected with the problematic port via a particular uplink of the FS100.

check

A global fping test shows the behavior. Each nodes does a fping to random peers and random times. If packets gets lost, a counter for this pair increases. Take a look onto the pair matrix and mark problematic pairs with a red dot. In this way one can see homogeneous distributed red points in the matrix and lines of a higher point density. These correspond to problematic ports which are probably flaky. A possible implementation would be to start on each node concurrently (via dsh) the following command line:
sleep $RANDOM%300 
fping -b8040 -p 100 -c 10 -m  -q `for i in \`seq 1 20\`;do printf " n%04d " $(( $RANDOM%1675 ));done` 2>&1  | \
awk -F "/|%| " " { if (\$10!=0) {print \"$HOSTNAME \"\$1}}" | sed "s/n//g"|  tee /tmp/crossping

collect the outcome off all nodes into one list. These are point pairs which can be plotted via e.g. gnuplot. Point represent pairs of nodes with at least one problematic connection.

solution

If the lost packet rate is higher than the average see the section for flaky ports. For the background rate we have no solution yet. In general it turned out that flushing all MAC hash tables improves the situation and reduces this rate. We observed also that the rate is almost zero for packets with an MTU of smaller than 6000Bytes and increases for bigger frames:

One probably can change to MTUs of 6000Byte.

Periodic packet lost

symptom

Doing a fping test each 148s a small number of packets gets lost.

synopses

This seems to be a regular thing if two nodes are connected to the same FS500. One or two packets gets lost in a fping test only of the FS500 is connected via the LAGs to FS1000s. Other nodes should not be attached to this switch to avoid the random lost.

check

Do an fping between two nodes connected to the same FS500s. Disconnect all other nodes from this switch. Do:
while [ TRUE ];do fping -p 10 -b8040 -C 100  -q 10.20.40.1 2>&1 | sed "s/[0-9,.,:,a-z,A-Z, ]//g"| \
  grep "\-"|wc -c | awk ' {if ($1!=0) { print systime()" "$1-1}}';done | tee /tmp/fping_out
and watch. This does not work with ICMP packets going only from A to B. There no such packet lost have been observed.

solution

This packet lost might disappear for a while when the lags are switched on and of. Log in onto the FS500 and do:
configure
lag 1
shutdown
exit
lag 2 
shutdown
no shutdown
exit
lag 1
no shutdown
exit
exit
logout
Flushing all hash tables reduces the amplitude of the periodically lost packets. It turns out that this rate is much lower than the random lost packets. Peaks appear in the power spectrum after few days of recording the fping outcome. This is probably not a big problem. One can ignore it.

Packet duplication

symptom

Once a while packets are received twice. This happens each 10-800s. If it happens, it happens 11 times in a range of 11s with a delay of 1s.

synopses

This probably is correlated to the Fabric cards. Currently we have 11 working, one is switched off.

check

One needs clean environment. A switch which just two nodes connected to it and two LAGs going to the FS1000. Write two c-programs, which are opening ICMP sockets. Send 10000 packets/s of arbitrary size from A to B. Do it in such a way that the time delay between the starting of the transmissions is always 0.1ms. Put an ID into the header if the IP packets which is the number of the send packet. It turned out that one packet is not overtaking the another one. So the IDs of the received packet is increasing and should be strictly increasing. If a packet comes in with exactly the same ID as the packet before, we have packet duplication.

solution

Probably not harmful.

General observations

A high resolution ping test of a packet send rate of 10kHz shows that, even in a lab setup of having only two nodes connected to a FS500 with to uplink LAGs to the FS1000s shows packet lost. These are not single packets but 10-40 concurrent packets gets lost. So there is a connectivity lost of 1-4ms.

-- HenningFehrmann - 18 Mar 2011
Topic revision: r16 - 25 Mar 2011, HenningFehrmann
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback