Brainstorm notes

Table of Contents

This doc is for brainstorming. Watch out for likely duplicates, clean them up progressively.

Turn these notes into real issue or bug descriptions and move them into other docs like xdp_issues.html

Various brainstorm ideas

Crazy-idea: XDP-redirect re-handle about-to-be-dropped packets

Ref slides:

Idea slide 31: On XDP-redirect packets can be dropped, after the XDP prog have run, and today we have a 'drop-tracepoint' to handle/debug this. It would be more interesting to call a new XDP-type program that get the about-to-be-dropped packet, and allow it to take a new decision. Wondering what this can be used for, click the youtube link in the slide, and blow-your-mind! (

Tracking and troubleshooting eBPF

Want to track when BPF progs gets loaded/unloaded

  • Both for system audit and troubleshooting purposes

BPF code do have LSM/SElinux hooks

  • LSM maintainers request, use Audit subsystem instead
  • Jiri Olsa currently working on this

Usability: XDP redirect not impl by driver gives WARN_ON

E.g. AF_XDP via xdpsock uses redirect, and there is not feedback when running prog on driver without redirect-support. Need to look in dmesg to see big splast.

Usability: HW offload on non-supported driver/firmware return EINVAL

E.g. Loading XDP/ebpf prog with HW offload on Netronome NIC, but wrong firmware, just says EINVAL. No help or hints that firmware needs to be upgraded.

How can we improve the usability / user-experience here?

Same EINVAL when loading on a NIC driver other than Netronome. Can we instead return ENOTSUP ("Operation not supported"), to differentiate the cases?

Extend xdp_buff with member data_hard_end

Issue arose when driver ixgbe and i40e choose to violate one-page-per-packet. After this point, the frame-tail size became variable. This cause issues when creating an SKB from an xdp_frame, see the workarounds in cpumap and veth, where the truesize is wrong and skb_shared_info placement is variable.

This can be solved by introducing a new xdp_buff member named data_hard_end, which specify the maximum address data_end can be extended to.

This will allow:

  1. BPF helper (bpf_xdp_adjust_tail) to extend "tail" (not just shrink as today). This allows for use-cases (1) do fast DNS replies via XDP_TX, (2) NDIV (haproxy) also have some use-case that use tail-extend.
  2. Create better SKBs based on an xdp_frame. (1) Allow placing skb_shared_info section consistently. (2) Use correct truesize. (3) Allow for TCP coalescing these kind of SKBs.

DMA mapping xdp_frame

Keeping DMA mapping in XDP return API.

samples/bpf upstream napi_monitor

Missing an ifindex to match on in tracepoint

Napatech have descriptors in-front of packet payload

Proprietary commercial companies like NapaTech, which also maps packets into userspace, deliver dynamic descriptor info in-front of the packet, like our data_meta area.

As far as I can see, via their public docs[4], they have 4 different dynamic descriptor formats. With BTF and metadata we should have more flexibility than them :-)

It is a bit interesting to look at what they expose. I recommend looking/clicking at the header-file[5][6][7][8] as it shows they use a lot of C-bit-fields to compress the size. Do BTF support C-bit fields?

[4] [5] pktdescr_dyn1.h [6] pktdescr_dyn2.h [7] pktdescr_dyn3.h [8] pktdescr_dyn4.h

I find it interesting to see that in (dyn1+2), default decode offsets into the packet of L3 and L4 (and L4 payload), but allows them to be programmable. Also notice the "color" members, which are programmable, and sometimes are use as a 64-bit unique correlation key[8] (e.g. identifying flows).

Maybe I should have looked at their standard format before the dynamic(?) Their pktdescr_std0.h [9] is placed in front of all packets being received by the adapter when the adapter is operating in STANDARD or EXTENDED mode. The Extended descriptors are placed after pktdescr_std0.h [9], which contains a lot of info on the types in different layers, see[10][11][12].

[9] pktdescr_std0.h [10] pktdescr_ext7.h [11] pktdescr_ext8.h [12] pktdescr_ext9.h

Old list: watch out for dublicates

Old list of stuff I need to work-on/fix for XDP/bpf project:

  • XDP return frame API (needed by AF_XDP ZC)
  • Bulking API for return frame API
  • Bulking API for ndo_xdp_xmit
  • Address massive XDP regression due to CONFIG_RETPOLINE
  • Introduce bulking for generic-XDP (PoC test show +30% perf!!!)
  • Implement ndo_xdp_xmit for macvlan (fast guest delivery)
  • Improve BPF doc
  • Improve XDP doc
  • Work on XDP article with Toke+Alexei+Daniel
  • Better integration of XDP in Suricata (multiple small thing)
  • Find XDP feature/capability API (use in Suricata)
  • Help integrate AF_XDP in Suricata
  • Ship bpftool in distros (start with static linked libbpf)
  • Make libbpf a shared lib in distros (fix lib versioning)
  • Help (Ahern) get XDP route/FIB lookup helper integrated
  • Work on bridge FIB table lookup helper
  • XDP get more info transferred to CPUMAP skb creation time
  • cpumap: RX hash support
  • cpumap: HW csum offload/info
  • Improve XDP cpumap redirect example: flow hashing (fix NIC HW hash)
  • Help get AF_XDP API and performance aligned
  • Help get AF_XDP zero-copy integrated via XDP return API
  • convert tracepoint to use ifindex instead of names (strcpy overhead)
  • XDP_REDIRECT: Detect buggy-drivers forgetting to clear per-CPU map
  • Streamline eBPF map-create return codes on errors
  • Upstream xdp_bench01 sample to be standard way to measure XDP perf
  • Fix that TCP traffic with XDP generic on virtual net_devices are broken
  • xdp: avoid leaking info stored in xdp_frame data on page reuse
  • XDP_REDIRECT implemement in every driver
  • XDP data-meta implemement in every driver
  • Improve samples/bpf: XDP progs should take ifconfig/net_device names
  • Improve samples/bpf: Avoid including ./arch/x86/include/asm/cpufeature.h

Date: 2021-09-20 Mon 18:33