XDP multi buffer design

Table of Contents

This document is a serious attempt to find a way to support multi-buffer packets with XDP. With the important criteria of not hurting performance of the single-buffer per packet design.

Keywords: xdp vs. jumbo-frame, TSO/LRO, packet header split

Why XDP don't have multi-buffer support

XDP is designed for maximum performance, which is why certain driver-level use-cases were not supported, like multi-buffer packets (like jumbo-frames). As it e.g. complicated the driver RX-loop and memory model handling.

The single buffer per packet design, is also tied into eBPF Direct-Access (DA) to packet data, which can only be allowed if the packet memory is in contiguous memory. This DA feature is essential for XDP performance.

Use-cases for multi-buffer

What are the use-cases for multi-buffer packets:

Proposal#1: XDP only access first-buffer

The eBPF Direct-Access (DA) feature is essential for performance. Thus, we don't want to add an abstraction layer, that transparently gives XDP BPF-prog access to payload across multiple buffers.

One way forward is to define that XDP only get access to the first packet buffer, and it cannot see subsequent buffers. We can probably allow the bpf program to adjust subsequent buffers pushing/pulling headers through bpf helpers.

We will extend xdp_buff/xdp_frame adding multi-buffer bit (mb) as an indication that this is a multi-buffer packets, as this is needed by internal helpers (see later) and might be useful for XDP-developer (e.g. return XDP_PASS for these kind of frames).

Proposal#2: Access to additional fragments

Although by default you only have access to the first segment (fragment 0), it would make sense for an XDP application to access data further on in the packet.

If this gets done by a helper similar to bpf_msg_pull_data(), the linearization of the data would make this a show stopper from a performance point of view. As an alternative, it would be nice if we could get a rolling window into the packet, by selecting the fragment (page) we would like to look at. The downside of this is that the XDP program needs to take into account that some data (let's say a specific network struct) might be split across multiple pages.

There are several ways this could be implemented. The helper could simple be called something like bpf_xdp_fragment_select(struct xdp_buff *xdp_md, u8 frag). The implementation details could be different.

  • We would use the existing data* pointer fields to point to the current

fragment to access and introduce a data_current_fragment (or alike variable) to indicate to which fragment the data* values point to (or leave this to the XDP application to track).

  • Alternatively, we could keep the data* fields as is, and add some additional

fragment_data* fields. However, this would add additional fields to the xdp_md/xdp_buff structure, and it would require additional helper functions (see below).

XDP multi-buffer extensions and complications

How and what do we need to extend the XDP data structured with to handle multi-buffer packets?

XDP tail adjust

The BPF XDP helper named: bpf_xdp_adjust_tail change length of the packet. Currently it can only shrink the packet (but we need to extended it to allow extending the tail, as e.g IPSEC and DNS-cache needs this).

UPDATE helper can also grow packet size as of kernel v5.8 commit c8741e2bfe87 ("xdp: Allow bpf_xdp_adjust_tail() to grow packet size").

With multi-buffer XDP packets (and proposal#1) then this adjust tail helper becomes problematic.

Wouldn't it be easier to disallow a BPF-prog with this helper, when driver have configured multi-buffer? Or will it be too restrictive, if jumbo-frame is very uncommon and only enabled because switch infra could not be changed.

Perhaps it is better to let bpf_xdp_adjust_tail() fail runtime?

Use-cases for adjust tail

Use-cases that need to adjust tail of packet:

  • ICMP replies directly from XDP need to shorten packet tail, but this use-case doesn't use fragments. See bpf-sample.
  • IPsec need to add/extend packet tail for IPset-trailer, again unlikely that this needs fragments(?). (This use-case convinced me that we need to add extend-tail support to bpf_xdp_adjust_tail)
  • DNS or memcached replies directly from XDP, need to extend packet tail, to have room for reply. (It would be interesting to allow larger replies, but I'm not sure we should ever support that).

XDP access to full packet length?

We likely need to provide the full packet length to XDP core and also XDP-developer. The use-case for XDP-developers is validating the length of the packet against what packet headers claim.

But if we need to know the full length, when the first-buffer is processed. Then realize that this affect the drivers RX-loop, because then we need to "collect" all the buffers before we can know the length (although some HW provide this in first descriptor).

We likely have to change drivers RX-loop anyhow, as XDP_TX and XDP_REDIRECT will also need to "collect" all buffers before the packet can be forwarded. (Although this could potentially happen later in driver loop when it meet/find the End-Of-Packet descriptor bit).

Storage space for multi-buffer references/segments

A multi-buffer packet consist of several frame segments. The data structure used for holding these packet buffers/segments also needs to be discussed.

To describe each segment we need a pointer (to the page or data start), offset and length. There are (at-least) two existing kernel data structures that have such a layout, bio_vec and skb_frag_t (aka skb_frag_struct).

Layout of skb_frag_struct:

$ pahole -C skb_frag_struct vmlinux
struct skb_frag_struct {
        struct {
                struct page * p;                         /*     0     8 */
        } page;                                          /*     0     8 */
        __u32                      page_offset;          /*     8     4 */
        __u32                      size;                 /*    12     4 */

        /* size: 16, cachelines: 1, members: 3 */
        /* last cacheline: 16 bytes */

Layout of bio_vec:

$ pahole -C bio_vec vmlinux
struct bio_vec {
        struct page              * bv_page;              /*     0     8 */
        unsigned int               bv_len;               /*     8     4 */
        unsigned int               bv_offset;            /*    12     4 */

        /* size: 16, cachelines: 1, members: 3 */
        /* last cacheline: 16 bytes */

The skb_frag_t would be most obvious, as we already have to write this when creating an SKB, in skb_shared_info area. Which layout looks like this:

$ pahole -C skb_shared_info vmlinux
struct skb_shared_info {
        __u8                       __unused;             /*     0     1 */
        __u8                       meta_len;             /*     1     1 */
        __u8                       nr_frags;             /*     2     1 */
        __u8                       tx_flags;             /*     3     1 */
        short unsigned int         gso_size;             /*     4     2 */
        short unsigned int         gso_segs;             /*     6     2 */
        struct sk_buff     * frag_list;                  /*     8     8 */
        struct skb_shared_hwtstamps hwtstamps;           /*    16     8 */
        unsigned int               gso_type;             /*    24     4 */
        u32                        tskey;                /*    28     4 */
        atomic_t                   dataref;              /*    32     0 */

        /* XXX 8 bytes hole, try to pack */

        void *                     destructor_arg;       /*    40     8 */
        skb_frag_t                 frags[17];            /*    48   272 */

        /* size: 320, cachelines: 5, members: 13 */
        /* sum members: 312, holes: 1, sum holes: 8 */

The problem is also that size of these structs (16 bytes) per buffer/segment, and we likely need to support 17 segments, as this need to be compatible with SKBs, which result in a size of 272 bytes.

One idea: is that we simply use the same memory area, that we have to store skb_shared_info into. As this allow us to get the SKB setup for free, when doing XDP_PASS or when doing SKB alloc after XDP_REDIRECT.

Side-note: There have been suggestion before, to unify bio_vec and skb_frag_t (see here) Update: This unification has already been done (see here)

Helper-proposal#1 for XDP multi-buffer bpf helpers

Assuming the XDP multi-buffer layout relies on a skb_shared_info structure at the end of the first buffer to link all subsequent buffers, we can leverage on skb_frags* utilities in bpf helpers. Moreover, assuming each buffer length is one PAGE, it is possible to compute the headroom/tailroom available in the buffer to push/pop headers through bpf helpers.

Critique: IMHO these helpers are too low level, as they UAPI expose manipulating offset and len of individual segments/fragments directly, this will limit flexibility of the internal kernel API.

  • bpf_xdp_adjust_frag_offset(index,offset)

    Adjust skb_frag_t offset of a fragment with index index in a xdp multi-buffer.

  • bpf_xdp_adjust_frag_len(index,len)

    Adjust skb_frag_t length of a fragment with index index in a xdp multi-buffer.

  • bpf_xdp_get_frag(index,*offset,*len)

    Return offset and length of a fragment with index index.

  • bpf_xdp_get_frag_count()

    Return the total number of fragments for a given xdp multi-buffer.

Helper-proposal#2 for XDP multi-buffer bpf helpers

Assuming "XDP only access first-buffer" (Proposal#1) is selected, then the BPF-program cannot access the remaining buffers/segments. The BPF-programmer still needs some knowledge about these remaining buffers/segments. The question is what is makes-sense to expose and why (the use-case).

Technically this can be either exported/exposed via 1) the ctx object "xdp_md", or 2) via BPF-helper calls.

Helper proposal-A: Allow moving data between first and second buffer. The use-case is for packet-header split (see Googles presentation). The NIC hardware might have split the packet wrong, either placed some packet headers in second buffer, or placed some payload-data in first buffer. Adding this helper, will allow the BPF-programmer to fix-up the packet-header split done by hardware. When the BPF-program parse packet headers, the program logic can detect the wrong header split, and use this helper to adjust.

In the case where you need to add something to the first fragment (0) of a packet, let's say add a tunnel header, there might not be enough room as the hardware filled the entire first page. We need some additional helpers, some examples:

bpf_xdp_fragment_split(struct xdp_buff *xdp_md, u8 fragment, u32 split_at); bpf_xdp_fragment_add(struct xdp_buff *xdp_md, u8 add_frag_after_frag);

Discuss: What info does BPF-programmer need about buffers/segments?

Info: multi-buffer packet?

BPF-programmer should have a way to tell if an xdp_buff contains more segments/buffers. The current proposal (upstream) is to mark this via a single bit in xdp_buff->mb (and xdp_frame->mb), but this is an internal representation.

How do we expose this to the BPF-program?


  • Give BPF-prog info via ctx xdp_md.
  • Add BPF-helper to extract this information.

Sameeh have proposed BPF-helper: bpf_xdp_get_frag_count()

Info: Use-case counting bytes (accounting)

The XDP BPF-programmer will often need to do basic traffic accounting of packets and bytes (as XDP hooks doesn't provide this). Thus, we need to export/expose at least byte counters from buffers/segments.

Sameeh have proposed :

Date: 2021-09-20 Mon 18:33