Top-level XDP project management

Current high-priority tasks @short
Important medium-term tasks @medium
Longer-term preliminary plans @long
Site navigation

This file contains the high-level XDP tasks. The items in this file are divided into three categories, reflecting the urgency and time frame we envision them being completed; although this is somewhat of a soft categorisation, as the development is also dependent on the priorities of the people and organisations involved. These categories are:

Current high-priority tasks
Important medium-term tasks
Longer-term preliminary plans

Tasks are filed under the headings corresponding to these categories below, and each category contains a description of what it (is supposed to) include.

Current high-priority tasks @short

These are tasks that are blockers for getting the XDP system ready for general consumption, i.e., to a state that might reasonably be considered "feature complete". This means that tasks in this category include essential pieces of missing functionality, or significant user interface issues that need to be resolved for wider adoption.

Tasks in this category should have a fleshed-out problem description, and preferably a plan and someone working on them.

TODO Consistency for statistics with XDP

The short of it is that we need consistency in the counters across NIC drivers and virtual devices. Right now stats are specific to a driver with no clear accounting for the packets and bytes handled in XDP.

Progress: @dsahern (David Ahern) have started an email thread on the subject: consistency for statistics with XDP mode

TODO Missing update of ifconfig counters

Some drivers are not updating the "ifconfig" stats counters, when in XDP mode. This makes receive or send via XDP invisible to sysadm/management tools. This for-sure is going to cause confusion.

Closer look at other drivers.

ixgbe driver is doing the right thing.
i40e had a bug, where RX/TX stats are swapped (fixed in commit cdec2141c24e(v4.20-rc1) ("i40e: report correct statistics when XDP is enabled")).
mlx5 driver is not updating the regular RX/TX counters, but A LOT of other ethtool stats counters (which are the ones I usually monitor when testing).

NEXT Figure out the counter semantics upstream

Need to have an upstream discussion, on what is the semantic. IHMO the regular RX/TX counters must be updated even for XDP frames.

TODO Statistics per XDP-action

Accounting per XDP-action is also inconsistent across drivers. Some driver account nothing, while others have elaborate counters exposed as ethtool stats.

The common pattern is that XDP programs do their own accounting of action as they see fit, and export this as BPF maps to their associated userspace application, which a very per application specific approach.

David Ahern (@dsahern) argues that sysadm's need something more generic and consistent, as they need a way to diagnose the system, regardless of the XDP program that some developer asked to get installed. This argument that a 3rd person should be able to diagnose a running XDP program was generally accepted upstream.

There were a lot of performance concerns around introducing XDP-action counters. Generally upstream voices wanted this to be opt-in, e.g. something that the sysadm explicitly enable when needed, and not default on.

NEXT Implement simple XDP-actions counter and measure

As Saeed suggested do something really simple and central like:

+++ b/include/linux/filter.h
@@ -651,7 +651,9 @@ static __always_inline u32 bpf_prog_run_xdp(const
struct bpf_prog *prog,
         * already takes rcu_read_lock() when fetching the program, so
         * it's not necessary here anymore.
         */
-       return BPF_PROG_RUN(prog, xdp);
+       u32 ret = BPF_PROG_RUN(prog, xdp);
+       xdp->xdp_rxq_info.stats[ret]++
+       return ret;
 }

WARNING: Realised above code is subject to speculative execution side-channel leaking.

And measure if the performance concerns are real or not. Ilias or Jesper can measure this on a slow ARM64 platform (espressobin), as only testing this on Intel might lead to the wrong conclusion.

TODO Exporting XDP-actions counter to userspace

Collecting XDP-actions counter is only the first step. We need to figure out the best solution for exporting this to userspace.

One option is to piggyback on ethtool stats, which the drivers already use, and it would also standardise the driver related XDP ethool stats.

TODO Better ndo_xdp_xmit resource management

Stalled on Intel/Magnus NIC queue management interface.

Driver resources needed to handle a ndo_xdp_xmit() is currently tied to the driver having loaded an RX XDP program. This is strange, as allocating these Driver TX HW resources is independent.

This can quickly lead to exhausting HW resources, like IRQs lines or NIC TX HW queues, given it is assumed a TX queue is alloc/dedicated for each CPU core.

NEXT Change non-map xdp_redirect helper to use a hidden map

To be able to tie resource allocation to the interface maps (devmap), we first need to change the non-map redirect variant so it uses a map under the hood. Since xdp_redirect_map() is also significantly faster than the non-map variant, this change should be a win in itself.

v1 comments and discussion: Patch 1 Patch 2

v3 patchwork link

TODO Ethtool interface for enabling TX resources

Turns out the initial idea of using insertion into devmap as a trigger for resource allocation doesn't work because of generic XDP. So we'll need an ethtool interface; look into the existing channel configuration interface on the kernel side and figure out how to express XDP resource allocation in a good way.

Interface for defining what a TX resource is

Need to define:

Number of TX queues
Algorithm for picking one on a given redirect (e.g., hash on CPU)
Queueing behaviour

How to handle life cycle management?

TODO Add automatic TX resource allocation to libbpf

Because we can't tie resource allocation to map insertion on the kernel side, we need to solve the UI interface in userspace. So add a hook/wrapper to libbpf that will automatically allocate TX resources when inserting into a map.

TODO XDP feature flags

We are probably going to need feature flags for XDP after all. There are use cases (e.g. Surricata, VM migration) that will want to know what to expect from the system before committing to loading an XDP program.

DONE Submit LPC network track talk about: XDP feature detection

CLOSED: [2019-07-31 Wed 17:10]

Title:

Improving the XDP User eXperience: via feature detection

Abstract:

The most common asked question is: Does my NIC support XDP, and our current answer is read the source code. We really need to come up with a better answer.

The real issue is that users can attach an XDP bpf_prog to a drivers that use features the driver doesn't implement, which cause silent drops. Or user doesn't notice, that NIC loading fallback to generic-XDP, which is first discovered when observing lower performance, or worse not all features are supported with generic-XDP, resulting in unexpected packet drops.

BPF feature detection, recently added to bpftool, is based on probing the BPF-core by loading BPF-programs using individual features (notice BPF load time, not attaching it).

Even if your BPF loader doesn't use feature probing, it will notice if loaded on a incompatible kernel. As an BPF-prog using something the kernel BPF-core doesn't support will get rejected at load-time, before you attach the BPF-prog.

This doesn't work for XDP, as features vary on a per driver basis. Currently an XDP BPF-prog isn't aware of that driver it will get used on, until driver attach-time. Unfortunately, due to BPF tail-calls, we cannot use the driver attach-time hook to check for compatibility (given new XDP BPF-progs can be indirectly "attached" via tail-call map inserts).

In this talk, we will investigate the possibilities of doing XDP feature check at BPF load-time, by assigning an ifindex to the BPF-prog. The ground work have already been laid by XDP hardware offload, which already need ifindex at BPF load-time (to perform BPF byte-code translation into NIC compatible code).

The open question are:

Can the verifier detect/deduce XDP feature in use, for us?

How does drivers express/expose XDP features?

Are features more than XDP return codes, like meta-data support?

How does this interact with generic-XDP?

How to expose this to userspace? (to answer does NIC support XDP)

How to handle tail-call map inserts?

NEXT Propose a driver API to communicate feature flags

Daniel: Needs to go through driver BPF ndo

Needs to be an API that queries support. We cannot validate on program load time because of tail call. Not even with cooperation from the tail-calling program, because that may not know what features are used by the programs it is tail-calling into.

Notes: implementation plan

On load: verifier populates bpf_prog->feature_bits
On load, if ifindex: reject if prog->features ^ iface->features
On attach: reject if prog->features ^ iface->features
Tail call maps:
- On insert: if (map->feature_lock) {reject_if(prog->features ^ map->feature_lock)} else Map->features |= prog->features
- On prog load, if ifindex: reject if map->features ^ ifindex->features, else progs->maps[]->feature_lock = ifindex->features progs->features |= ifindex->features

TODO Add a userspace API to query features

Netlink? Ethtool?

DONE Multiple XDP programs on a single interface

CLOSED: [2020-12-15 Tue 13:18]

Being able to load multiple programs on the same XDP interface is an important use case if XDP is to see more widespread deployment (as otherwise applications will step on each others' toes when trying to accelerate using XDP).

The upstream solution for this is known as "dynamic re-linking", which allows BPF global functions to be replaced by other programs. The last piece of the kernel functionality for this was posted upstream on January 18th: https://lore.kernel.org/bpf/20200118000657.2135859-1-ast@kernel.org/T/#t

DONE Build an XDP chain loader on top of the kernel support

CLOSED: [2020-06-02 Tue 18:45]

This is implemented as part of libxdp in the XDP tools repo. The current implementation works for loading a dispatcher with multiple programs, but only if they are all attached at the same time (until kernel support for re-attaching freplace programs land).

DONE Implement freplace re-attachment

CLOSED: [2020-12-15 Tue 13:17]

To support seamlessly attaching additional XDP programs onto an interface, we need the kernel to support re-attaching an existing freplace program in multiple places, so the programs can be attached to a new dispatcher which can then be atomically swapped onto the interface XDP hook.

This is now supported in the 5.10 kernel, with the userspace support available in libxdp and xdp-tools.

Important medium-term tasks @medium

These are tasks that are important to fix in the medium term, but that are not immediate blockers for functionality. This includes things like expanding driver support, and adding new features that improve things, but which are not essential for the basic usefulness of XDP.

Tasks in this category should at a minimum have a fleshed-out problem description.

TODO Usability of programs in samples/bpf

The samples/bpf programs xdp_redirect + xdp_redirect_map are very user unfriendly. #1 they use raw ifindex'es as input + output. #2 the pkt/s number count RX packets, not TX'ed packets which can be dropped silently. Red Hat QA, got very confused by #2.

DONE Change sample programs to accept ifnames as well as indexes

CLOSED: [2019-06-25 Tue 13:32]

Fixed by https://patchwork.ozlabs.org/patch/1121683/

NEXT Add TX counters to redirect samples/bpf programs

Simply include/sample the net_device TX stats.

DONE Fix unloading wrong XDP on xdp-sample exit

CLOSED: [2020-01-24 Fri 10:50]

The XDP sample programs unconditionally unload the current running XDP program (via -1) on exit. If users are not careful with the order in-which they start and stop XDP programs, then they get confused.

This was fixed upstream by by Maciej Fijalkowski.

TODO Change XDP-samples to enforce native-XDP and report if not avail

The default behaviour when attaching an XDP program on a driver that doesn't have native-XDP is to fallback to generic-XDP, without notifying the user of the end-state.

This behaviour is also used by xdp-samples, which unfortunately have lead end-users to falsely think a given driver supports native-XDP. (QA are using these xdp-samples and create cases due to this confusion).

Proposal is to change xdp-samples to enforce native-XDP, and report if this was not possible, together with help text that display cmdline option for enabling generic-XDP/SKB-mode.

TODO Add xdpsock option to allow XDP_PASS for AF_XDP zero-copy mode

In AF_XDP zero-copy mode, sending frame to the network stack via XDP_PASS results in an expense code path, e.g new page_alloc for copy of payload and SKB alloc. We need this test how slow this code path is.

Also consider testing XDP-level redirect out another net_device with AF_XDP-ZC enabled. (I think this will just drop the packets due to mem_type).

TODO xdp_monitor: record and show errno

It would be a big help diagnosing XDP issues if the xdp_monitor program also reported the errno.

TODO xdp_monitor: convert to use raw-tracepoints

The raw-tracepoints are suppose to be much faster, and XDP monitor want to have as little impact on the system as possible. Thus, convert to use raw-tracepoints.

TODO BPF-selftests - top-level TODOs

The kernel git-tree contains a lot of selftests for BPF located in: tools/testing/selftests/bpf/.

XDP (and its performance gain) is tied closely to NIC driver code, which makes it hard to implement selftests for (including benchmark selftests). Still we should have a goal of doing functional testing of the XDP core-code components (via selftests).

Since driver veth got native-XDP support, we have an opportunity for writing selftests that cover both generic-XDP and native-XDP.

DONE bpf-selftest: improve XDP VLAN selftests

CLOSED: [2020-01-24 Fri 10:55]

Assignment is to improve the selftest shell-script to test both generic-XDP and native-XDP (for veth driver).

In-progress:

Patchset V1

XDP add/remove VLAN headers have a selftest in tools/testing/selftests/bpf/ in files test_xdp_vlan.c and test_xdp_vlan.sh. This test was developed in conjunction with fixing a bug in generic-XDP (see kernel commit 297249569932 ("net: fix generic XDP to handle if eth header was mangled")).

Since driver veth got native-XDP support, the selftest no-longer tests generic-XDP code path.

The ip utility (from iproute2) already support specifying, that an XDP prog must use generic XDP when loading an XDP prog (option xdpgeneric).

TODO bpf-selftest: find XDP-selftests affected by veth native-XDP

When driver veth got native-XDP support, then the XDP-selftests that were based on veth changed from testing generic-XDP into testing native-XDP.

Assignments:

Determine how many and which veth based XDP-selftests are affected
Convert these selftests to test both generic-XDP and native-XDP

TODO Make more XDP tests using BPF_PROG_TEST_RUN

Tweet by Björn Töpel (@bjorntopel):

Many people aren't aware of the BPF_PROG_TEST_RUN command. It's really neat being able to test your XDP programs "offline". The selftests use this a lot. Docs: https://t.co/GDd7SfNYng and examples in tools/testing/selftests/bpf/.

NEXT Could this be a "introduction job"?

TODO Busy-poll support for AF_XDP

Adding BUSY_POLL support to AF_XDP sockets was presented at the Linux Plumbers Conference 2018 in Vancouver, BC. With this feature, the NAPI context of the driver is executed from the process context by the application calling the poll() syscall, instead of being driven by the softirq mechanism. This has a number of benefits, for example, being able to efficiently use a single core for application, driver and other kernel infra that the application might need. With softirq, we would need two cores to maintain any performance. Another benefit is that the cachelines containing the descriptors do not have to bounce between two caches, since this is now a core local operation as the driver and the application is on the same core. The drawback is that we now have to use a syscall (poll) in the data path and this will slow things down.

There is already a busy_poll mechanisms in the kernel: /proc/sys/net/core/busy_poll. When writing a non zero value in this file, the busy poll support will be enabled for ALL sockets in the system. There are a number of issues with this existing code when applied to AF_XDP sockets.

The batch size is hardcoded to 8, a value that is too small for the

fast processing of XDP.

The semantics of poll() in busy_poll mode is that if you provide more

than one file descriptor, it will drive the napi context of the first one supplied and if it has a packet, then it will NOT drive any of the other. In other words, it will quit once it has found an fd with a packet. This will not work for us, since we need all fd's napis to be called since it is very likely that a packet will be found in each of them. One could argue that this can be solved in user-space by manipulating the array of fds supplied to poll() before every singel call, but this would really complicate multi socket handling in user-space.

The option is global across all sockets. Enough said.

My suggestion for addressing these issues is to introduce a new busy_poll option that is only for AF_XDP called XDP_BUSY_POLL_BATCH_SIZE (or something like it). This is a setsockopt that can be supplied to individual AF_XDP sockets and the batch size can thus also be set individually by suppling a value > 0. The semantics of this mode is that both Rx and Tx have to be driven by calling poll(). There is no guarantee that your packets will arrive or be sent unless you call poll() (a sendto() will still work for the Tx case, though, but it is not necessary). In this first patch set, we can still get interrupts and processing from NAPI in this mode, but we have some ideas on how to disable this so that NAPI is only driven from poll(). But that is for a later patch set. Note that the sematics would not change when we introduce this as we already today say that you must call poll(), since there is no guarantee otherwise that you will receive or send packets.

When suppling multiple busy_poll AF_XDP sockets to poll() all of them will get theire napi contexts executed, so it is guaranteed that all of them will be driven. It is also possible to mix regular sockets, global busy_poll sockets and the new AF_XDP sockets in the same poll() call. The semantics for each type will be maintained, as expected.

From an implementation point of view, I believe this can be implemented with minimal changes to the net and fs code. We can get this new behavior by using the standar fd (non-busy poll path) and then drive the napi from the xsk specific poll callback. We do need to change one internal interface in order to be able to have a variable batch size. And Jesper's xdp_rxq_info struct need to be enlarged with a napi_id field that the drivers need to populate. This can then be used by the xsk poll code to drive the correct NAPI.

TODO Exposing more kernel data structures through helpers

One of the strengths of XDP over kernel bypass solutions is the ability to re-use existing in-kernel data structures, such as the routing table. This happens through kernel helpers. We already have routing, but we will need more helpers.

NEXT Layer-2 bridging helper

The obvious next step is a l2 bridging helper that mirrors the l3 routing one. Should be fairly straight forward to implement.

RFC series posted for this.

TODO What to do about broadcast/multicast?

See Handling multicast below.

TODO Connection/flow tracking

Exposing either full conntrack, or the more light-weight flow tracking support would make things like stateful firewalls easier to implement. Probably need a concrete use case for this first, though.

DONE Port iproute2 to libbpf

CLOSED: [2020-12-15 Tue 13:18]

The iproute2 XDP loader does not use libbpf, because its implementation predates the introduction of libbpf as the "official" upstream-blessed library. This means that there are compatibility issues between the way iproute2 handles maps, and the way libbpf-based loaders do.

The easiest way to fix this is probably just to port iproute2 to use libbpf, which has now been merged into upstream iproute2.

TODO Metadata available to programs

The metadata available to XDP programs through the XDP context object could be expanded with other useful entries. This section collects lists of which items would be useful to have, and explains why for each of them.

DONE XDP frame length

CLOSED: [2020-10-19 Mon 12:55]

DONE: Merge commit: https://git.kernel.org/torvalds/c/5cc5924d8315

The length of the XDP frame (as opposed to the data packet) is needed for various things:

Tail-extend (e.g., DNS); currently packets can only be shrunk at the tail
Correctly reporting skb true-size when generated from XDP frame
For moving skb allocation out of drivers (long-term)

TODO Metadata from hardware

There are various hardware metadata items that would be useful for XDP programs to access, to reduce the amount of processing that needs to happen in eBPF. These include:

Checksum
Hash value
Flow designator
Higher-level protocol header offsets
Timestamps

TODO Needs BTF-based metadata

To express this in a vendor-neutral way, we probably need to depend on the BTF-based metadata for XDP

TODO Improvements to XDP_REDIRECT

TODO Handling multicast

We want to be able to REDIRECT a packet to multiple interfaces in order to implement multicast. Relevant for the Layer-2 bridging helper.

From the discussion at netconf2019 (), we probably want to implement it by:

Using a map-in-map (or map flag) to designate a group of egress interfaces (put multiple ifaces into a map and send to all of them in one operation).
Just copy the packet rather than try to do clever refcnt stuff for multi-xmit.

As a first pass, we probably want to implement a simple multi-xmit, and leave off the possibility for the eBPF program to modify the packet between destinations.

TODO Queueing and QoS

There is currently no queueing in XDP_REDIRECT, apart from the bulking queue used in the devmaps. This means that rate transitions can't be handled properly (e.g., redirect 100Gbps->10Gbps).

To fix this, we would need to allow queueing of some sort. Options:

Allow user to define a queueing structure as part of the TXQ setup
Create the low-level hooks necessary for eBPF-programmable queueing
Have some kind of "intermediate queueing structure" that can be a redirect target, does queueing, and can forward packets on afterwards

TODO Handling XDP_REDIRECT failures

Presently, an XDP program cannot know if a call to bpf_redirect_map() is going to fail. This means that programs end up doing things like using duplicate maps for checking if a redirect map entry exists, or to packets being silently dropped.

DONE Allow lookups in devmap/cpumap

CLOSED: [2019-06-29 Sat 18:24]

Return pointer to same ID as we insert, just copy it to a scratch space first.

Merged as 0cdbb4b09a0.

DONE Make bpf_redirect_map() check map contents

CLOSED: [2019-06-29 Sat 18:25]

If we simply make the redirect_map() helper check that something exists in that position in the map, we at least catch the case where a map entry is empty (where we know it is going to fail). If we then later add the capability for a map to ensure that TX resources are available on insert, we will get this check for free.

What is the right return code for the helper on failure? We already return XDP_ABORTED on invalid flags, so we could create a new REDIRECT_PASS_ON_INVALID that make non-existent map entries return XDP_PASS. This makes it easy to implement "defer to stack on error" type programs, and it makes it possible to disambiguate between "flag not understood" and "flag understood and map is empty".

Mellanox drivers turn a failure in xdp_do_redirect() into an ABORTED action, other drivers do not.

Merged as 43e74c0267a3.

NEXT In samples/bpf: use that bpf_redirect_map() can check map content

Toke's patchset for allowing reading redirect maps were accepted upstream, and also the fallback return codes if map index were unpopulated. BUT the samples/bpf programs does not take advantage of this. Task is to update samples/bpf to demonstrate these new features.

CANCELLED Retire bpf_redirect() helper CANCELLED

CLOSED: [2020-01-16 Thu 15:49]

Rather than extending the bpf_redirect() helper to use hidden maps, it is probably better to document it as deprecated and point users to use the bpf_redirect_map() helper.

TODO What happens if redirect fails even though TX resources are available?

Even with the above, we can still get failures if, e.g., TX ring runs out of space. We can't really handle this by return, so what should we do instead? Maybe just defer this to XDP hook at TX.

TODO Add selftests for devmaps

[2019-08-09 Fri 20:51]

TODO REDIRECT_MAP idx type - what happens with namespace moving?

[2019-05-13 Mon 14:03]

TODO Multi-buffer XDP (aka. jumbo-frames)

As described in the design document areas/core/xdp-multi-buffer01-design.html we need to come up with a serious attempt at supporting multi-buffer XDP handling. The term multi-buffer covers the uses cases: jumbo-frames, TSO/LRO, packet header split.

Add TODO steps here to move this forward.

DONE Start upstream discussion on multi-buffer

CLOSED: [2019-07-03 Wed 13:15]

Link to upstream discussion: https://lore.kernel.org/netdev/20190626103829.5360ef2d@carbon/

TODO Create plan to move forward with initial constraints

As Willem points out: As long as we don't arrive at a design that cannot be extended with those features later.

TODO Find driver developers that will participate in multi-buffer work

We need some developers that have an actual use-case for this multi-buffer feature, and also buy-in from some driver team, else this sub-project will not see any progress. The Amazon guys seems to have a interest.

Longer-term preliminary plans @long

These are longer-term plans or ideas that either fall into the "nice to have" category, or which are blocked on other pieces of kernel development; as well as ideas for things we may include in the future but which is not a high priority at the moment.

Tasks in this category do not require any particular level of description; so some of them may simply have a sentence or two describing them.

Maybe write a tuning doc as well?

WAIT status as this is low priority for now.

A limitation of the current design of XDP is that programs get no feedback if a redirect to another device fails. Instead, the packet is just silently dropped, and the only way to see why is by attaching to the right tracepoint. This is especially problematic when forwarding packets from a fast device to a slower one. And the way XDP_REDIRECT is implemented, there is no way for the XDP program to gain insight into the state of the device being forwarded to.

We believe that a possible fix for this is to add another eBPF hook at packet egress from a device, i.e., at the latest possible time before a packet is put into the device TX ring. At this point, it is possible for the driver to supply information about the current state of the TX ring buffer (such as free space), which the eBPF program can react appropriately to, for example by signaling ingress XDP programs to send traffic another way if the TX ring is full, or by implementing AQM-like reactions when TX ring pressure increases.

A crazy idea is to allow this egress eBPF hook to perform a new XDP action if it sees the TX ring is full, such as redirecting the frame out another interface. Allowing the full XDP feature set of modifying and truncating packet length would also make is possible to implement a signaling protocol like that described in http://doi.acm.org/10.1145/3098822.3098825.

Top-level XDP project management

Table of Contents