Project management for CPUMAP

Important medium-term tasks @medium
- TODO Use-case cpumap: Solving qdisc TX lock congestion for BW shaping
Longer-term preliminary plans @long
- Allocate SKB outside driver code
- TODO Various cpumap features
Completed tasks
- DONE Implement CPUMAP redirect with connection hashing over CPUs
- DONE Feature: cpumap bulk alloc SKBs
Site navigation

This document contains org-mode tasks and TODOs for cpumap.

This BPF map type CPUMAP (BPF_MAP_TYPE_CPUMAP) is used by XDP to XDP_REDIRECT into via BPF helper bpf_redirect_map(). This map type redirects raw XDP frames to another CPU. The remote CPU will do SKB-allocation and call the normal network stack.

This is a scalability and isolation mechanism, that allow separating the early driver network XDP layer, from the rest of the netstack, and assigning dedicated CPUs for this stage. This basically allows for 10G wirespeed pre-filtering via bpf, on the serving end-host running Linux.

Top-level headings indicate category, as explained in ../xdp-project.html.

Important medium-term tasks @medium

TODO Use-case cpumap: Solving qdisc TX lock congestion for BW shaping

This is a project that verifies and demonstrate how cpumap can be used for more complex use-cases like IP-forwarding and bandwidth shaping, while working in concert with other parts of the network stack, like qdisc and other eBPF hooks.

The use-case is when a service provider (ISP or cloud provider) want to bandwidth rate-limited outgoing traffic for each service e.g. via HTB qdisc, but don't need global NIC rate-limiting. Where the service that need individual rate-limiting can be a broad variety, e.g. ISP end-user customers buying bandwidth, daemon services using network traffic, a container, etc. As long as this can be matched via a BPF hook or TC filter, and that the NIC link-rate is higher than the sold rate.

The problem: With a default HTB setup, such a service will not scale with the number of CPU and NIC hardware RX+TX queues. The reason is that every transmitted packet have to be serialised through the qdisc root-lock on the HTB root qdisc. This cause many CPUs to cause congestion on this lock.

This project demonstrates how this can be solved by using the MQ qdisc and attaching a qdisc HTB on each leaf of MQ, and via an eBPF prog make sure the customers/services grouping is maintained and hit the same CPU+MQ TXq.

DONE cpumap: Create script MQ-HTB silo setup

https://github.com/netoptimizer/network-testing/blob/master/tc/tc_mq_htb_setup.sh

DONE cpumap: Test cpumap redirect works with MQ-HTB silo setup

Yes, MQ-HTB silo setup (example tc_mq_htb_setup.sh) it works.

BUT watch-out, as XPS (Transmit Packet Steering) will take precedence over any changes to skb->queue_mapping. You need to disable XPS via mask=00 in files /sys/class/net/DEV/queues/tx-*/xps_cpus .

To help people out, here is a script for easier setup of XPS: xps_setup.sh Run the script with –help.

DONE Allow BPF clsact write access to skb->queue_mapping

Summary: Choose NOT to create BPF helper for skb_set_queue_mapping, instead we allow BPF TC hook (clsact/cls bpf) write access to skb->queue_mapping.

Fixed in BPF-NEXT commit: https://git.kernel.org/bpf/bpf-next/c/74e31ca850c1

When doing CPUMAP redirect, then the SKB created have no skb->queue_mapping configured. This could be set by e.g. using TC filter/action skbedit (see man tc-skbedit(8)), but we also want to allow this from TC cls_bpf. Reading queue_mapping from TC cls_bpf is already possible.

DONE Code analysis before BPF helper for skb_set_queue_mapping

Following the kernel code is a little tricky, so adding some notes here. Invoking skb_set_queue_mapping() directly set skb->queue_mapping to the value provided.

static inline void skb_set_queue_mapping(struct sk_buff *skb, u16 queue_mapping)
{
        skb->queue_mapping = queue_mapping;
}

The confusing part is the value zero, is used for signalling that no queue number is "recorded". So, when helper functions record or extract the queue_mapping, they adjust with +/-1. The naming of the helper function are also a bit confusing, as they are called skb_xxx_rx_queue(). One could say that the queue_mapping is 1-indexed, when using skb_set_queue_mapping() to set it directly.

static inline void skb_record_rx_queue(struct sk_buff *skb, u16 rx_queue)
{
        skb->queue_mapping = rx_queue + 1;
}

static inline u16 skb_get_rx_queue(const struct sk_buff *skb)
{
        return skb->queue_mapping - 1;
}

static inline bool skb_rx_queue_recorded(const struct sk_buff *skb)
{
        return skb->queue_mapping != 0;
}

When skbedit record the queue_mapping it also does a direct store of the value, without taking this adjustment into account (looked through kernel code net/sched/act_skbedit.c and iproute2 code tc/m_skbedit.c).

When kernel extract the skb->queue_mapping in skb_tx_hash() the helpers skb_rx_queue_recorded() and skb_get_rx_queue() are used.

/*
 * Returns a Tx hash based on the given packet descriptor a Tx queues' number
 * to be used as a distribution range.
 */
static u16 skb_tx_hash(const struct net_device *dev,
                       const struct net_device *sb_dev,
                       struct sk_buff *skb)
{
        u32 hash;
        u16 qoffset = 0;
        u16 qcount = dev->real_num_tx_queues;

        if (dev->num_tc) {
                u8 tc = netdev_get_prio_tc_map(dev, skb->priority);

                qoffset = sb_dev->tc_to_txq[tc].offset;
                qcount = sb_dev->tc_to_txq[tc].count;
        }

        if (skb_rx_queue_recorded(skb)) { // Test if set here
                hash = skb_get_rx_queue(skb); // Extract here
                while (unlikely(hash >= qcount))
                        hash -= qcount;
                return hash + qoffset; // RETURN here
        }

        return (u16) reciprocal_scale(skb_get_hash(skb), qcount) + qoffset;
}

Details: How skb_tx_hash connected to __netdev_pick_tx() can also be a little tricky to spot. Unless XPS via get_xps_queue overrides by finding a "new_index", then skb_tx_hash will be asked to determine the queue_index. (And for IP-forwarding there is not sk socket).

DONE TC-skbedit: possible issue with TC action skbedit

The man page for tc-skbedit(8) does not mention that the queue_mapping is 1-indexed, but my code analysis says so, report and perhaps test before. This is true, but it might not be an issue for users of skbedit, because TC MQ-indexing classids also start at "1".

The biggest surprise for users is likely that XPS will overwrite their choice (and XPS seems to be default enabled). Users must first disable XPS via mask=00, before their changes via skbedit takes any effect.

DONE cpumap: create git project xdp-cpumap-tc

Create git project xdp-cpumap-tc, that demonstrate how XDP cpumap redirect can be used together with Linux TC (Traffic Control) for solving the Qdisc locking problem.

TODO xdp-cpumap-tc: add some documentation

TODO xdp-cpumap-tc: public release of code/repo

TODO cpumap: quantify effect of cpumap (currently) don't transfer RX-csum info

Verify csum is recalculated on IP-forward TX path for cpumap redirected packets. Try to asses/measure overhead this cause.

Longer-term preliminary plans @long

Allocate SKB outside driver code

The revolutionary and controversial idea with cpumap redirect is to remove the SKB allocation inside the drivers RX-path. Instead an xdp_frame is created that contains enough info to allocated the SKB later outside the driver code, in cpumap's case on a remote CPU, but the idea is to take this even further.

The future plans are working towards creating Ethernet drivers that have no SKB allocations. Instead they will send these xdp_frame packets (in a bulk) to the network stack, that will handle creation of the SKBs and populate the appropriate SKB-fields. This work is currently stalled on a generic way to transfer info needed for the SKB-fields, this related to XDP metadata and BTF working areas.

TODO Various cpumap features

This section contains various feature ideas related to cpumap.

TODO BUG: cpumap not working for generic-XDP

The cpumap does not handle redirect for generic-XDP. This have already confused people. For completeness this should be fixed, but there is also an opportunity for performance improvements.

TODO Feature: cpumap could use netif_receive_skb_list()

TODO Feature: cpumap implement GRO handling

TODO sample xdp_redirect_cpu should output interface-name and ifindex

This is a minor improvement. QA is giving output from xdp_redirect_cpu, but forget to say what interface this is loaded on. Solve by simply outputting the interface in the output. Even better would be outputting driver name from corresponding ifindex.

DONE investigate sample xdp_redirect_cpu counters

Playing with xdp_redirect_cpu, and the enqueue and kthread counters are not getting updated, something in stats reporting looks wrong. Maybe latest upstream changes to libbpf broke something?

Fixed in: f9e6bfdbaf0c ("samples/bpf: Xdp_redirect_cpu fix missing tracepoint attach")

TODO cpumap: implement dynamic load-balancer that is OoO safe

This is a project to test the limits of eBPF programming.

The kernels existing facilities, that cpumap is "competing" against, RSS (Receive Side Scaling) and Receive Packet Steering (RPS) have mechanisms that makes sure that Out-of-Order (OoO) packets does not occur when the config change runtime. For cpumap this is left up to the eBPF programmer to hand if he needs to do dynamic config changes.

The question is how difficult is this to implement in eBPF, and do we need some helpers or extra state exposed to eBPF, e.g. in-flight-packets, for this to be possible to implement in eBPF.

Idea from slide 29 + 30: Implement in BPF-code an Out-of-Order safe way to use cpumap-redirect to dynamically load-balance IP-flows.

Completed tasks

DONE Implement CPUMAP redirect with connection hashing over CPUs

The xdp_redirect_cpu code should default distribute packets via IP-flow hashing to avoid creating Out-of-Order packets, if someone just runs this out-of-the box.

Upstream in commits:

c4c202175424 ("Merge branch 'bpf-sample-cpumap-lb'")
1bca4e6b1863 ("samples/bpf: xdp_redirect_cpu load balance like Suricata")
11395686586b ("samples/bpf: add Paul Hsieh's (LGPL 2.1) hash function SuperFastHash")

DONE Feature: cpumap bulk alloc SKBs

Upstream: 8f0504a97e1b ("bpf: cpumap do bulk allocation of SKBs")

It would be fairly trivial to bulk alloc SKBs via kmem_cache_alloc_bulk() API, when dequeuing in kthread (cpu_map_kthread_run). One point is that when dequeuing xdp_frames from ptr_ring, a cacheline between two CPUs are touched, and we should reduce the time window as much as possible. Then there is of-cause also the performance improvement of bulk alloc.

Project management for CPUMAP

Table of Contents