This side contains a list of discovered network problems, corresponding
tools and possible solution.
Wrong MAC address <-> port assignment
symptom
The node is not reachable.
synopses
If a client is connected to a switch the internal MAC hash table stores
the MAC address and the corresponding port the node is connected with.
It might happen that the switch wrongly assumes that the nodes is
somewhere in the rest of network and the MAC addresses belongs to one of
the LAGs. Sometimes the switch knows the right port number but switches
permanently or frequently to a LAG entry.
check
Login onto the FS500 and check to which port the node belongs:
telnet 172.25.21.x
User: admin
Password: xxxxxxxxx
enable
show fdb-table learned 00:AA:BB:CC:DD:EE
Mac Address Port IfIndex Status
----------------- -------- ------- ------------
00:AA:BB:CC:DD:EE LAG 1 26 Learned
or
show fdb-table learned 00:AA:BB:CC:DD:EE
Mac Address Port IfIndex Status
----------------- -------- ------- ------------
00:AA:BB:CC:DD:EE 1/24 24 Learned
If this MAC address belongs to a port, the first entry is wrong.
solution
Flush all hash tables of all components concurrently several times,
until MAC hash tables are almost empty. (10 times with a delay of 2
seconds in between). You might want to use
dsh with the
-r option.
Flushing the MAC hash table only on the affected switch is not
sufficient.
Login on a switch:
clear fdb-table learned all
No communication between two nodes
symptom
Nodes are reachable from the most of the server, but some nodes mutually
do not see each other.
synopses
The MAC hash entries are correct and the nodes are reachable in general.
Only some pairs can't communicate. This is probably related to some
wrong entries for the line cards or fabric cards in the FS1000s.
check
- ssh to node A works
- ssh to node B works
- ping from node A to node B does not work
solution
Flush all hash tables of all components concurrently several times,
until MAC hash tables are almost empty. (10 times with a delay of 2
seconds in between). You might want to use
dsh with the
-r option.
Flushing the MAC hash table only on the affected switch is not
sufficient.
Login on a switch:
clear fdb-table learned all
Flaky port
symptom
Nodes are reachable sometimes. Communication is bad.
synopses
The port switches off an on frequently with approximately 1Hz.
check
The light on the FS100 (probably not on the FS500) switches on and off.
solution
Currently, we changed the ports on the FS500 or reroute the traffic via
another FS500 by changing the vLAN settings on the FS100. In this case
all ports have to be connected via vLAN 1 or vLAN 2, to avoid the bad
port. Usually the uplink port 0/49 is connected to vLAN 1 and 0/50 to
vLAN 2. After a reboot of the switch, this problem might disappear. This
hasn't been checked yet. Also a cluster wide flushing of the MAC hash
tables might help.
Wrong MTU setting
symptom
Packets gets lost between nodes, if the frame size is bigger than
1600Bytes.
synopses
A switch does not transmit packets with jumbo frames, thought it is
correctly configured.
check
Do an
ping -s 8000 -M do node
If there is no response the jumbo frames are probably not handled
correctly.
Login on the FS500s and do
enable
debug
bcm-test
ports
If in the
max frame column smaller numbers than 9044 appear (exceptions
are the last 4 lines) than the HW broadcom MTU setting is wrong.
solution
Proceed with:
exit
exit
configure
lag 1
mtu 9040
exit
lag 2
mtu 9040
exit
exit
logout
You might want to check whether the MTUs are correct now in the
bvm-test mode.
Random packet lost
symptom
Packets are getting lost randomly. Do an fping test over a longer time
and count the number of lost packets.
synopses
Packets are getting lost randomly. The reason so far is unknown and an
issue for the developer. We observed to rates. A relatively low rate can
be seen everywhere. If the rate is higher it is probably a flaky port.
See the flaky port issue. Usually all nodes are affected which are
connected with the problematic port via a particular uplink of the
FS100.
check
A global fping test shows the behavior. Each nodes does a fping to
random peers and random times. If packets gets lost, a counter for this
pair increases. Take a look onto the pair matrix and mark problematic
pairs with a red dot. In this way one can see homogeneous distributed
red points in the matrix and lines of a higher point density. These
correspond to problematic ports which are probably flaky.
A possible implementation would be to start on each node concurrently (via
dsh) the following command line:
sleep $RANDOM%300
fping -b8040 -p 100 -c 10 -m -q `for i in \`seq 1 20\`;do printf " n%04d " $(( $RANDOM%1675 ));done` 2>&1 | \
awk -F "/|%| " " { if (\$10!=0) {print \"$HOSTNAME \"\$1}}" | sed "s/n//g"| tee /tmp/crossping
collect the outcome off all nodes into one list. These are point pairs
which can be plotted via e.g. gnuplot. Point represent pairs of nodes
with at least one problematic connection.
solution
If the lost packet rate is higher than the average see the section for flaky ports.
For the background rate we have no solution yet. In general it turned
out that flushing all MAC hash tables improves the situation and reduces
this rate.
We observed also that the rate is almost zero for packets with an MTU of
smaller than 6000Bytes and increases for bigger frames:
One probably can change to MTUs of 6000Byte.
Periodic packet lost
symptom
Doing a fping test each 148s a small number of packets gets lost.
synopses
This seems to be a regular thing if two nodes are connected to the same
FS500. One or two packets gets lost in a fping test only of the FS500
is connected via the LAGs to FS1000s. Other nodes should not be attached
to this switch to avoid the random lost.
check
Do an fping between two nodes connected to the same FS500s. Disconnect
all other nodes from this switch. Do:
while [ TRUE ];do fping -p 10 -b8040 -C 100 -q 10.20.40.1 2>&1 | sed "s/[0-9,.,:,a-z,A-Z, ]//g"| \
grep "\-"|wc -c | awk ' {if ($1!=0) { print systime()" "$1-1}}';done | tee /tmp/fping_out
and watch.
This does not work with ICMP packets going only from A to B. There no
such packet lost have been observed.
solution
This packet lost might disappear for a while when the lags are switched
on and of. Log in onto the FS500 and do:
configure
lag 1
shutdown
exit
lag 2
shutdown
no shutdown
exit
lag 1
no shutdown
exit
exit
logout
Flushing all hash tables reduces the amplitude of the periodically lost
packets. It turns out that this rate is much lower than the random lost
packets. Peaks appear in the power spectrum after few days of recording
the fping outcome. This is probably not a big problem. One can ignore
it.
Packet duplication
symptom
Once a while packets are received twice. This happens each 10-800s. If
it happens, it happens 11 times in a range of 11s with a delay of 1s.
synopses
This probably is correlated to the Fabric cards. Currently we have 11
working, one is switched off.
check
One needs clean environment. A switch which just two nodes connected to
it and two LAGs going to the FS1000.
Write two c-programs, which are opening ICMP sockets. Send 10000
packets/s of arbitrary size from A to B. Do it in such a way that the
time delay between the starting of the transmissions is always 0.1ms.
Put an ID into the header if the IP packets which is the number of the
send packet. It turned out that one packet is not overtaking the
another one. So the IDs of the received packet is increasing and should
be strictly increasing. If a packet comes in with exactly the same ID
as the packet before, we have packet duplication.
solution
Probably not harmful.
General observations
A high resolution ping test of a packet send rate of 10kHz shows that,
even in a lab setup of having only two nodes connected to a FS500 with
to uplink LAGs to the FS1000s shows packet lost. These are not single
packets but 10-40 concurrent packets gets lost. So there is a
connectivity lost of 1-4ms.
--
HenningFehrmann - 18 Mar 2011