Design: Having XDP programs per RX-queue

Background for XDP per RX-queue
Interface semantics
iproute2
bpftool
Code assesment
Code notes for AF_XDP sockets
- AF_XDP bind

From the very beginning of the XDP design it was envisoned (by Jesper) that it should be possible to assign an XDP program per NIC hardware RX-queue number. This idea was rejected by upstream, due to usability concerns.

From 2016 up until July 2018 the model was "a single XDP program per netdev", when Jakub Kicinski added support XDP_ATTACHED_MULTI for multiple XDP programs simultaneously. In this mode, the kernel allows for attaching XDP programs in the driver/generic and hardware at the same time, given that the hardware has BPF offloading support.

In this document the idea of a per-RX-queue XDP program is re-visited, and the rationale, design and semantics for this feature is outlined.

Background for XDP per RX-queue

This section explains the background and the need for supporting XDP programs attached to individual RX-queue of the NIC, and some of the steps already taken towards this goal.

Original idea

Original idea behind per RX-queue handling comes from Van Jacobson's NetChannels (PDF), as this is the building block for creating a Single Producer Single Consumer (SPSC) channel into a socket or user-application.

Small step the direction: xdp_rxq_info

In kernel v4.16 struct xdp_rxq_info was introduced, which contains information specific to each RX-queue in the driver. E.g. this struct contains the RX queue_index number.

struct xdp_rxq_info {
        struct net_device *dev;
        u32 queue_index;
        u32 reg_state;
        struct xdp_mem_info mem;
};

In kernel v4.18 both xdp_mem_type was introduced and type MEM_TYPE_ZERO_COPY was added.

The xdp_rxq_info is associated to the struct xdp_buff via the struct member rxq. The kernel code and the XDP program can use the struct xdp_rxq_info traits in diffrent ways, e.g. AF_XDP uses the xdp_mem_type to determine how the packet was allocated. If the packet has mem type MEM_TYPE_ZERO_COPY, this means that the packet was allocated using a "zero-copy allocator/buffer provided by userspace" and the kernel to user copy can be elided.

enum xdp_mem_type {
        MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
        MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
        MEM_TYPE_PAGE_POOL,
        MEM_TYPE_ZERO_COPY,
        MEM_TYPE_MAX,
};

struct xdp_mem_info {
        u16 type; /* enum xdp_mem_type */
        u16 id;
};

Bottleneck in single/global netdev XDP program

For XDP program that use AF_XDP the queue_index trait, which let the XDP program know what RX-queue the packet originated from, is especially important.

AF_XDP sockets communicates to/from user-space/kernel-space via SPSC queues. An XDP socket is accociated to a certain netdev RX-queue, which means that the (global/per-netdev) XDP program needs to track on which RX-queue the packet was received, and if there is a socket associated to that RX-queue, the packet is redirected to that socket. The XDP developer must make sure that correct ctx->rx_queue_index is XDP_REDIRECT'ed into the corresponding XSKMAP index (else frame is dropped in __xsk_map_redirect() by dev and queue_id check in xsk_rcv()).

Further, an XDP program that directs certain flows to an XDP socket typically use hardware flow-steering to make sure that the flow(s) end up on dedicated RX-queues. The non-AF_XDP RX-queue typically use the regular kernel stack. This means that the model of a global/per-netdev XDP program will then penalize the non-AF_XDP aimed flows, since a "dummy"/trivial program is executed for the flows that are headed for the kernel.

As XDP grows, and more use-cases are added, then I fear that the single XDP program per netdev is going to be a performance bottleneck. As the single XDP program, will have to perform a lot of common checks before it knows what use-case this packet match. E.g. as described above AF_XDP redirect requires reading ctx->rx_queue_index and core xsk_rcv() function also need to (re)check this is correct.

With an XDP program per RX-queue, we can instead leverage the hardware to pre-filter/sort packets, and thus simplify the XDP programs. For AF_XDP zero-copy we already depend on NIC hardware filters being setup. The optimization for AF_XDP is, that the checks of rx_queue_index (and dev) can instead be moved to setup time, instead of runtime fast-path.

The second level optimization is to store extra info per RX-queue that allows us do take a more direct action. E.g. in case of AF_XDP storing the xsk_sock allows to basically call xsk_rcv() directly, which in-return allow us to skip part of the XDP redirect-core code. (p.s. do remeber to handle the flush at NAPI-end).

Depending on NIC hardware filter setup

For XDP progs per RX-queue to make sense, we do need to setup NIC hardware filters to steer trafic to specific RX-queues. AF_XDP zero-copy already have this dependency.

There are several ways to configure NIC hardware filter, e.g. ethtool or TC hardware offloads. It is generally out of scope for XDP to do this setup itself. It is a setup dependency that need to be handled (outside and) before attaching the XDP program.

Interface semantics

The initial reason XDP programs per RX-queue was rejected was usability concerns. Thus, it is important that we define and agree on the behavior and semantic meaning of the user-interface.

Proposal: RXQ-prog takes precedence

Proposal#1: The RX-queue xdp_prog (RXQ-prog) takes precedence over the global xdp_prog (global-prog) installed on the netdev. More specifically:

If global-prog is already loaded, RXQ-prog takes-over/replaces for this specific/single RXQ.
When RXQ-prog is unloaded, global-prog takes-over (if installed). If no global-prog is installed, the RXQ-prog will simply be removed.

Details: What happens, if someone unloads the global-prog?

Proposal#1.1 (Jesper): Unload global-prog keeps RXQ-prog's intact. The global-prog only impacts the behavior when then RXQ-prog is removed.
Argument: There is a reason some application installed this specific RXQ-prog, it will be unexpected for all these application if a global unload (installing NULL xdp_prog) ignore/clear their specific setup.

For completeness: What happens, if someone load a global-prog, while existing RXQ-prog's exist:

Then the RXQ-prog still take precedence.
The global-prog is only install on the RXQs that doesn't have an RXQ-prog associated.

TODO: What about XDP_ATTACHED_MULTI and XDP_ATTACHED_HW? Any implications? XDP_ATTACHED_SKB? Can netdevs reject RXQ-prog support?

Summary: There is a 2-level hierarchy of XDP programs. One global, global-prog, and one per-RX-queue program, RXQ-prog. global-prog and RXQ-prog can be set. If RXQ-prog is set, it has precedence over global-prog. When/if RXQ-prog is removed, global-prog is re-attached if set.

Covering future semantics

NOTE: (Björn) Is this section valid? Do we really want to relax the BPF-program part? Yes, I'd like to get rid of the retpoline, but not the BPF program. What kind of builtins, other than trivial programs (e.g. always XDP_PASS or always XDP_DIRECT) are there?

It is part of the future optimization plans to change the RXQ-prog, such that it isn't an actual BPF-prog any-longer. E.g. in case of AF_XDP sockets, where all frames from a specific RXQ need to be redirected into a single xsk_sock, then we can avoid invoking BPF (which is a retpoline indirect call) and instead call xsk_rcv() directly.

How do we keep the userspace tools and semantics the same?

E.g. when listing the RXQ-prog's we should/must still display that an XDP program/hook is running on the RXQ ? (if we are really nice, we can display that this is a "builtin" type X)

E.g. when sysadm force-fully unload/remove the RXQ-prog from a queue, but this is a "builtin" type, then we should likely maintain the semantics and allow the sysadm to unload/remove this XDP-"hook".

Introspection tools

For the user/sysadm it is also very important that the existing tools for listing XDP program are extended with support for listing the RXQ-prog's.

In proposal#1.1: Clearing the global-prog does not clear the RXQ-prog's, but we can extend the tools with a new option to force-fully unload all XDP programs, although this is disruptive to the applications.

iproute2

The iproute2 suite (ip link) has support to query a device for XDP programs and also attaching XDP programs. This would need to be extended for per-RX-queue programs (RXQ-prog).

For the query, the json output could be extended with an "qid" item. In the example below, RX-queue 7 has an attached RXQ-prog, and the program without the qid is the global-prog.

"xdp": {
    "mode": 4,
    "attached": [ {
	    "mode": 1,
	    "prog": {
		"id": 7,
		"tag": "b3b16d7ef8ea98e4",
		"jited": 1
	    }
	},{
	    "mode": 1,
	    "qid": 7,
	    "prog": {
		"id": 11,
		"tag": "a0e2f2463514a59c"
		"jited": 1
	    }
	} ]
},

and one example without a global-prog:

"xdp": {
    "mode": 1,
    "qid": 7,
    "prog": {
	"id": 11,
	"tag": "a0e2f2463514a59c",
	"jited": 1
    },
    "attached": [ {
	    "mode": 1,
	    "qid": 7,
	    "prog": {
		"id": 11,
		"tag": "a0e2f2463514a59c",
		"jited": 1
	    }
	} ]
},

In order to attach an RXQ-prog, the "qid" argument would need to be added to the "xdp" subcommand of ip-link.

TODO: Would it make sense to expose the number of available RX-queues via iproute2? Currently a user need to query that via "ethtool –show-channels".

bpftool

The bpftool utility has, similar to iproute2, support to query a device for XDP programs, and require RXQ-prog support as well. Query would be analogous to iproute2.

Code assesment

Initial code assesment

Most of the XDP drivers already have an xdp_prog pointer stored in some per RX-queue data-structure. Thus, part of the driver code should be ready to handle per RX-queue xdp_prog.

The general code idea is that the driver simply fetch and run xdp_prog assigned to its RX-queue data-structure. Runtime the driver doesn't really care if this is a global or RX-queue specific program. (Setup-time the driver or preferably core can add extra constraints checks for RX-queue specific programs for optimization reasons).

The general question is with two types of xdp_prog's a global and RX-queue specific, where do we store these? As these are going to be a need to reapply e.g. the global xdp_prog in case a RX-queue xdp_prog is removed.

And it is possible/good to to keep this info centrally in e.g. net_device, to simplify the driver interface?

Extending XDP-setup NDO commands

We (obviously) need to extend enum bpf_netdev_command with a new command, and extend struct netdev_bpf with a new union-struct to carry our info.

Questions:

Q1: What new commands do we need?
Q2: What should this new union-struct contain?
- Should we think ahead and add a builtin-type member already?

The simplest way is adding an "qid" to the query/setup commands. XDP_QUERY_PROG with an qid present, would query for the RXQ-prog on a certain queue. Analogous for setup. The logic for fallback when RXQ-prog is removed and such would be in the dev_xdp_{un,}install similar to how it is done today.

Downsides?

Code diff pointer:

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1377d085ef99..02feb990119c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -881,6 +881,7 @@ struct netdev_bpf {
                /* XDP_SETUP_PROG */
                struct {
                        u32 flags;
+                       u16 queue_id; /* -1 is global */
                        struct bpf_prog *prog;
                        struct netlink_ext_ack *extack;
                };
@@ -889,6 +890,7 @@ struct netdev_bpf {
                        u32 prog_id;
                        /* flags with which program was installed */
                        u32 prog_flags;
+                       u16 queue_id; /* -1 is global */
                };
                /* BPF_OFFLOAD_MAP_ALLOC, BPF_OFFLOAD_MAP_FREE */
                struct {

Refactor idea: move xdp_rxq_info to net_device/netdev_rx_queue

Should we move xdp_rxq_info into net_device->_rx[] which is struct netdev_rx_queue. (Saeed actually proposed this originally).

/* This structure contains an instance of an RX queue. */
struct netdev_rx_queue {
#ifdef CONFIG_RPS
        struct rps_map __rcu            *rps_map;
        struct rps_dev_flow_table __rcu *rps_flow_table;
#endif
        struct kobject                  kobj;
        struct net_device               *dev;
        struct xdp_rxq_info             xdp_rxq;
#ifdef CONFIG_XDP_SOCKETS
        struct xdp_umem                 *umem;
#endif
} ____cacheline_aligned_in_smp;

As can be seen it already contains an xdp_rxq_info member xdp_rxq, which is used by generic XDP. But given (Daniel was wise enough) to add a restriction that XDP-native and XDP-generic cannot co-exist on the same net_device, thus this member could also be used by native-XDP.

Refactor idea: xdp/bpf_prog into netdev_rx_queue/net_device

The "global" bpf_prog in generic-XDP is stored in net_device member xdp_prog. For generic-XDP to gain XDP-prog per RX-queue support, we could extend netdev_rx_queue with a xdp_prog member (type struct bpf_prog).

It would be interesting to investigate if it is possible to make drivers (native-XDP) also use net_device->xdp_prog or netdev_rx_queue->xdp_prog instead of storing this in driver local data structures. (As XDP-native and XDP-generic cannot co-exist, this should be possible).

This would also remove the need for passing the XDP program in the ndo_bpf implementation for the XDP-case, but not for the XDP HW-case.

Code gotchas: driver runtime changing num queues

We need to check what happens when the driver change the number of queues runtime. This can happen via ethtool:

ethtool --help | grep Channels
        ethtool -l|--show-channels DEVNAME	Query Channels
        ethtool -L|--set-channels DEVNAME	Set Channels

Drivers already need to handle this, and last-time I checked this was handled correctly in drivers. Still, this is a code area we need to take into account.

AFAIK this also affect net_device->_rx[] and dev->real_num_rx_queues, which might be relevant according to above refactor-ideas.

NOTE: (Björn) For AF_XDP zero-copy we restrict/disable the posibility to change the number of queues via ethtool when there is a socket bound to a queue. This could apply to RXQ-prog as well.

Code notes for AF_XDP sockets

AF_XDP bind

The AF_XDP socket bind call is an interesting hook point, that could be leveraged.

NOTE (Björn): I have some ideas here (RXQ-prog with the same lifetime as a socket, and more crazy stuff like relocating sockets into BPF code similar to BPF_LD_MAP_FD, but BPF_LD_SOCKET_FD).

The xdp_umem_assign_dev() call in xsk_bind() call XDP setup-NDO:

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 07156f43d295..6004634858b3 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -473,7 +473,7 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
                              xs->umem->chunk_mask);
                xskq_set_umem(xs->umem->cq, xs->umem->size,
                              xs->umem->chunk_mask);
-
+// Notice: end-up calling into dev-XDP-setup call dev->netdev_ops->ndo_bpf()
                err = xdp_umem_assign_dev(xs->umem, dev, qid, flags);
                if (err)
                        goto out_unlock;

Inside xdp_umem_assign_dev() it also register umem in netdev->_rx[] (and _tx[]):

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index a264cf2accd0..8424368f834e 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -87,7 +87,7 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
                err = -EBUSY;
                goto out_rtnl_unlock;
        }
-
+       // Registers itself in netdev->_rx[] and _tx[]
        xdp_reg_umem_at_qid(dev, umem, queue_id);
        umem->dev = dev;
        umem->queue_id = queue_id;
@@ -104,7 +104,7 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
        bpf.command = XDP_SETUP_XSK_UMEM;
        bpf.xsk.umem = umem;
        bpf.xsk.queue_id = queue_id;
-
+       // Calling XDP setup-NDO:
        err = dev->netdev_ops->ndo_bpf(dev, &bpf);
        if (err)
                goto err_unreg_umem;

Design: Having XDP programs per RX-queue

Table of Contents