BPF/XDP test
purpose
We try to ingest 4 x 40Gbps on 4 Dualport NICs, clone the traffic, write one stream to disk and transparently transmit the traffic to 4 x 40Gpbs egress ports.
We use Mellanox cards since the entire network infrastructure is based on Mellanox.
setting up an environment
To test, code and ideas are based on a BPF/XDP tutorial. See
https://github.com/xdp-project/xdp-tutorial.git.
BPF acts on the traffic coming into the kernel on the XDP level.
We have Ethernet header information along with the content. Based on the header information we can decide how we redirect the traffic. In order to learn, how BPF works we set up test environments. A first simple test works entirely with virtual NICs and network namespaces. We'd like to redirect the traffic coming in one one virtual NIC to another virtual NIC.
We setup a left and a right virtual environment.
The following commands need to be done as root or with sudo:
ip netns add left
ip netns add right
ip link add eth0 type veth peer name veth0
ip link add eth1 type veth peer name veth1
# put one end of each cable to the rooms
ip link set veth0 netns left
ip link set veth1 netns right
ip link set eth0 up
ip link set eth1 up
ip netns exec left ip link set dev lo up
ip netns exec left ip link set dev veth0 up
ip netns exec right ip link set dev lo up
ip netns exec right ip link set dev veth1 up
ip netns exec left ip addr add 10.0.0.1/24 dev veth0
ip netns exec right ip addr add 10.0.0.2/24 dev veth1
ip addr add 10.0.0.1/24 dev eth1
A minimal version of a redirection code, which does not further manipulation, reads:
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#define bpf_printk(fmt, ...) \
({ \
char ____fmt[] = fmt; \
bpf_trace_printk(____fmt, sizeof(____fmt), \
##__VA_ARGS__); \
})
SEC("xdp_redirect")
int xdp_redirect_func(struct xdp_md *ctx)
{
int action = XDP_PASS;
unsigned ifindex = 9;
action = bpf_redirect(ifindex, 0);
bpf_printk("redirecting: ingress %d queue index: %d dest: %d\n", ctx->ingress_ifindex, ctx->rx_queue_index, ifindex);
return XDP_REDIRECT;
}
SEC("xdp_pass")
int xdp_pass_func(struct xdp_md *ctx)
{
bpf_printk("passing ingress %d\n", ctx->ingress_ifindex);
return XDP_PASS;
}
char _license[] SEC("license") = "GPL";
The Makefile is:
LLC ?= llc
CLANG ?= clang
XDP_TARGETS := xdp_prog_kern
XDP_C = ${XDP_TARGETS:=.c}
XDP_OBJ = ${XDP_C:.c=.o}
all: llvm-check $(XDP_OBJ)
llvm-check: $(CLANG) $(LLC)
@for TOOL in $^ ; do \
if [ ! $$(command -v $${TOOL} 2>/dev/null) ]; then \
echo "*** ERROR: Cannot find tool $${TOOL}" ;\
exit 1; \
else true; fi; \
done
$(XDP_OBJ): %.o: %.c Makefile $(COMMON_MK) $(KERN_USER_H) $(EXTRA_DEPS)
$(CLANG) -S \
-target bpf \
-D __BPF_TRACING__ \
$(BPF_CFLAGS) \
-Wall \
-Wno-unused-value \
-Wno-pointer-sign \
-Wno-compare-distinct-pointer-types \
-Werror \
-O2 -emit-llvm -c -g -o ${@:.o=.ll} $<
$(LLC) -march=bpf -filetype=obj -o $@ ${@:.o=.ll}
.PHONY: clean $(CLANG) $(LLC)
clean:
rm -f *.ll
rm -f $(XDP_OBJ)
To make this code run, we need to adapt the interface index, which should be the index of
eth0:
id=$( ip address show dev eth0 |gawk -F ":" '/^[0-9]+:/ {print $1}' ) ; sed -i "s/ifindex = [0-9]*;/ifindex = $id;/" xdp_prog_kern.c
Compile the code with
make and we are ready to go.
setting up BPF
We need to load the BPF blobs. At the XDP level both
eth0 and
veth0 in the left network name space need to get the
xdp_pass section.
sudo ip link set dev eth0 xdp obj ./xdp_prog_kern.o sec xdp_pass
sudo ip netns exec left ip link set dev veth0 xdp obj ./xdp_prog_kern.o sec xdp_pass
The right local NIC
eth1 can load the either the redirect section or the pass section:
sudo ip --force link set dev eth1 xdp obj ./xdp_prog_kern.o sec xdp_redirect
or
sudo ip --force link set dev eth1 xdp obj ./xdp_prog_kern.o sec xdp_pass
which either triggers redirecting or passing the packages. The last two commands can be applied alternating.
the experiment
The right inner network name space NIC
veth1 needs to constantly ping its peer in the normal space
eth1:
sudo ip netns exec right ping 10.0.0.1
Listen, what arrives on the left inner network name space NIC
veth0 by
sudo ip netns exec left tcpdump -l -i veth0
The
bpf_printk function allows some communication from kernel land to user space:
sudo cat /sys/kernel/debug/tracing/trace_pipe
and we see what is going on.
Mutually change between between redirection and passing.
cloning
For cloning we need more. We need the helper function
bpf_redirect_map along with
AF_XDP sockets. See the
advanced03-AF_XDP section in the tutorial.
See
AF_XDP kernel documentation for details.
It seems that a NIC writes into a user space UMEN buffer by circumventing the kernel network stack. An
AF_XDP socket (
XSK) is created in user space and accesses the buffer.
Also two processes with two individual sockets can access the buffer, but only one NIC. The cloning process has to happen in user space.
Possible ways are:
- we create two UMEM buffers, connected to different NICs. One process accesses the two buffers and reads from one and writes to the other one. The second process only reads from one buffer and writes the content to disks. See also io_uring.
- we use bpf_clone_redirect on the TC level
- we use only bpf_redirect or bpf_redirect_map on the XDP level and either redirect things to a NIC or write the stream to a disk. In this case the FBFUSE won't get the stream for the hour of recording.
- Only redirection can also be done easily in user space by other tools.
The kernel documentation states a number of people working ob BPF. We possibly can ask one of them for some hints.
--
HenningFehrmann - 21 Dec 2021