Evolving the page_pool API

The page_pool is NOT finished
Overall goals of page_pool
- mlx5 today
History
The XDP dream scenario
Status today: lost reliable return hook
Regaining the return hook or tracking pages
Understanding page_pool details
Warning: missing pieces in page_pool
- Currently: fully-locked page-recycle call
Benchmark based development
- What is the overhead of "fully-locked page-recycle call"
- What is the overhead of page order-0 allocations
XDP optimization for ndo_xdp_xmit

The page_pool API was accepted into kernel 4.18, and only used by one driver mlx5. This doc is about evolving the page_pool API.

Generally kernel 4.18 introduced the ability for XDP to have knowledge about what memory allocator type the driver (per) RX-queue is using, via xdp_rxq_info and info in xdp_frame. This opens up for innovation in this area.

The page_pool is NOT finished

Lets be clear upfront: The page_pool API is not finished (As of this writing <2018-11-28 Wed> and kernel release v4.19).

It is not ready to be used by all drivers, in the current state. The mlx5 driver is also not using the API optimally.

We both need to evolve the API, but also need to take some fundamental design choices for the direction of the page_pool API.

Overall goals of page_pool

The overall goals of the page_pool is:

(1) Keep pages DMA-mapped (driver only use DMA-sync where needed). (2) Faster recycle API than page-allocator (see No description for this link) (3) Common allocator API for simplifying NIC drivers. (4) Better integration with network stack, not just XDP usage.

mlx5 today

Today mlx5 is not using the DMA-mapping part and is basically using page_pool as a fast alloc-side cache (see later why it is fast), and a return point for XDP-redirect. Packets send towards netstack still have a refcnt based recycle scheme, as a driver-layer on-top of page_pool.

History

The page_pool was adapted from Jesper's proposals at MM-summit 2016+2017:

http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf http://people.netfilter.org/hawk/presentations/MM-summit2017/MM-summit2017-JesperBrouer.pdf

Jesper posted some RFC page_pool patches around that time. That code was adapted and is now part of the network stack, under net/core/page_pool.c and include/net/page_pool.h.

The original code was designed to be part of kernel the MM-layer. It hooked into the put_page() call, and also used parts of struct-page for storing info that allows for returning the page back to the originating device (to allow keeping the page DMA mapped for the device).

Basic idea:

Pages stays DMA mapped as long as they are in the pool.
Driver use DMA-sync API to control when data is writable by CPU.
No need to keep track of pages traveling system, they are returned by put_page API.

The XDP dream scenario

The XDP dream scenario is about returning frame-pages, with no refcnt changes to the struct-page. The optimal XDP case is 1-page per frame/packet with page refcnt==1, and a put_page() call that catch refcnt==1, and recycle the page for reuse.

This is basically what happens, today in mlx5, when XDP-redirect activate ndo_xdp_xmit() out another device, and at other-device DMA-TX completion the pages are returned via xdp_return_frame, what call page_pool_put_page.

The goal of zero-atomic operations is very hard to achieve, as some kind of sync operation is needed when recycling pages, back to originating device, but that can be amortized through bulking (see later).

When no XDP program is configured on the device, it might be better for memory usage to split the page into two packets, and use a refcnt based model, as the overhead of other part of network stack might not dwarf the benefit of this memory vs speed optimization.

Status today: lost reliable return hook

The big difference today is the page_pool is not hooked into put_page and the MM-layer. This means we lost the reliable return hook point, which made it easy to keep the page DMA mapped.

When XDP redirecting (XDP_REDIRECT) as an xdp_frame, the XDP return API gives us a reliable return hook. (We don't do it today, but it should be fairly easy to keep the frame DMA mapped for redirect+return path).

When the frame travel as an SKB into the network stack, then we lost the reliable return hook, as the netstack returns the pages based refcnt.

Regaining the return hook or tracking pages

When wanting to keep pages DMA mapped, how to make sure: (1) that DMA mapped frames are not leaked, but (2) all appropriately DMA unmapped eventually, and (3) where to store the DMA-addr (for use in DMA-sync and eventual unmap)

There are three options, two options for creating a return hook or instead keeping track of outstanding frames traveling the system.

What are the options for reestablishing a return hook point? The two return hooks are:

Page alloactor layer via put_page.
Network stack layer via __kfree_skb().

The alternative is to keep track of outstanding frames at the driver layer via elevating refcnt.

An advantage of a guaranteed return hook, which page_pool doesn't depend on, is that the device driver can use memory pages that are special, e.g. device HW memory area, or pages already mapped into userspace.

Return-hook: put_page

The original idea was to modify put_page() to catch pages with refcnt==1 and recycle those. This was rejected upstream, but meanwhile a hook have been created at the exact spot we need.

The function calls used is called: put_devmap_managed_page().

It is used by HMM (Heterogeneous Memory Management), which is used by device memory like GPU on board memory. The DAX system also leverage this via type MEMORY_DEVICE_FS_DAX.

The question is (1) can page_pool also leverage this, (2) is the performance good enough.

TODO investigate: The page "zonenum" must be ZONE_DEVICE, which semantic is unclear, more info needed. Can this type of page be used for "normal" network stack delivery?

The code that end-up being called is: __put_devmap_managed_page(page); The callback in __put_devmap_managed_page(), is implemented by calling: page->pgmap->page_free(page, page->pgmap->data);

From struct-page the part containing this area looks like:

struct {        /* ZONE_DEVICE pages */
        /** @pgmap: Points to the hosting device page map. */
        struct dev_pagemap *pgmap;
        unsigned long hmm_data;
        unsigned long _zd_pad_1;        /* uses mapping */
};

TODO: Read include/linux/memremap.h and figure out, (a) that struct dev_pagemap is used for, and (b) what it means and what requirements are associated with being a MEMORY_DEVICE_PRIVATE type using ZONE_DEVICE.

For storing the DMA-addr (dma_addr_t), we could use the hmm_data area.

In mm/hmm.c the callback pgmap->page_free is assigned to hmm_devmem_free, and pagemap->data is the instance of the devmap itself. This could fairly easily be adapted for calling page_pool_put_page(), where the pagemap->data will be the instance of the page_pool.

The level and use of indirect calls (e.g. page->pgmap->page_free) is slightly concerning (in a spectre-v2 context).

Feasibility: (not needed for page_pool as is) Investigate if pages can be VMA mapped into userspace. There are indications that this is not possible, given the ZONE_DEVICE and page->pgmap share page->mapping area.

Return-hook: via SKB

Another possible return hook is when the SKB (sk_buff) is freed.

SKB freeing basically all goes through __kfree_skb(), and following the code path expanding code-inside for reaching freeing the page addr:

void __kfree_skb(struct sk_buff *skb)
{
        skb_release_all(skb) {
                if (likely(skb->head)) {
                        skb_release_data(skb) {
                                skb_free_head(skb) {
                                        if (skb->head_frag) {
                                                skb_free_frag(skb->head) {
                                                        page_frag_free(addr);
                                                }
                                        }
                                }
                        }
                }
        }
}

Here the skb->head_frag bit is used for saying that the packet data is allocated from a page (both covering page fragments and order-0 pages).

We need some extra info at __kfree_skb() time, in-order to know what page_pool the data-page need to be returned to, and this info can only be derived from the SKB. Extending struct sk_buff (SKB) is a very sensitive topic, and is in-general not allowed. Thus, we have to reduce the info needed in the SKB to the absolute minimum.

The DMA-addr (8-bytes) could be stored in the struct-page at page->private.

The XDP-redirect system, uses struct xdp_mem_info for storing the necessary info in the (struct) xdp_frame. The xdp_mem_info is 8 bytes, and could be reduced further, (1) 'type' can easily be made smaller (given enum xdp_mem_type is smaller), and (2) is artificially capped at MEM_ID_MAX=0xFFFE.

struct xdp_mem_info {
        u32                        type;                 /*     0     4 */
        u32                        id;                   /*     4     4 */
        /* size: 8, cachelines: 1, members: 2 */
        /* last cacheline: 8 bytes */
};

Reusing xdp_mem_info for storing the info, would allow us to reuse the xdp_return_frame() API more or less directly.

Crazy idea: Use one bit in SKB, saying this originates from an xdp_frame, and via skb->head pointer, find offset to start of xdp_frame, and then use the struct xdp_mem_info, and return frame like/via xdp_return_frame() API.

Keep track of pages

The network stack current handling of packet-data pages is refcnt based.

So, what is the trick used by drivers that want to (1) keep the page DMA-mapped and (2) do faster recycle (than going through page-allocator)?. It obviously need to leverage a refcnt based scheme on the page.

The general idea is to elevate the refcnt on the page, before sending it through the network stack, and keep the page pointer on a tracking waiting-queue/list (often a queue). Thus, this waiting-queue contains pages that are in-flight, and there is no signal or callback when they are "done". To figure-out when a page is available for reuse, the driver must poll or revisit this waiting-queue and by observing the page refcnt know if the page can be reused now.

Most drivers poll or revisit the waiting-queue, when doing other work, like when needing to refill the RX-ring with fresh frames/pages. There are different strategies when head of the waiting-queue contains a page that is still in-flight, as it could mean several things. (1) could indicate too fast recycling and next pages are likely also not-ready for reuse, or (2) could be this page is being held by e.g. TCP stack for retransmit while the next page could be ready for reuse. Thus, there is a choice to either (a) keep the page for next round (hoping it will be reusable soon) or (b) release the page by DMA-unmapping and returning the page to the page-allocator.

When the page_pool interacts with XDP, it gets pages returned from XDP via the xdp_return_frame API hook/callback. Given network stack doesn't provide such hook, the mlx5 driver used its existing refcnt based recycle queue, on-top of page_pool. And page_pool was modified to fall-through to put_page() if refcnt is not equal 1. This double layer result in the page_pool cannot be responsible for the DMA-mapping.

An alternative to introducing a return-hook for the network stack, is to integrate a refcnt based waiting-queue inside page_pool, that is only used when frames/pages travel through the network stack. This is a hybrid approach, but integrating this would allow page_pool to keep the DMA-mapping.

Understanding page_pool details

Some details about the page_pool API that might not be obvious.

Extremely fast alloc page

The page_pool leverage the knowledge/requirement, that allocations MUST happen from NAPI context. (During driver init of RX ring, not in NAPI context, it is known that no concurrent users of this page_pool exist, thus it is still safe).

A NIC driver creates a page_pool per RX-queue. Combined with the protection provide by NAPI context (per RX-queue), allow page_pool to get pages from a completely unlocked array-style stack-queue (see struct pp_alloc_cache). It is difficult to get any faster than this.

Code from: __page_pool_get_cached()

if (likely(pool->alloc.count)) {
        /* Fast-path */
        page = pool->alloc.cache[--pool->alloc.count];
        return page;
}

Extremely fast recycle direct

Another optimization leveraged by page_pool is that, for frames that needs to be dropped while still running under the RX NAPI context, either for error cases or explicit drop due to XDP_DROP action. The API call page_pool_recycle_direct() can be used, which as described above, simply returns the page to an array stack (code in __page_pool_recycle_direct()).

One advantage is that even with XDP_REDIRECT, the redirect core-code can choose to drop frames and see almost the same drop performance as driver level code (via calling xdp_return_frame_rx_napi).

Fast rhashtable lookup in __xdp_return

In __xdp_return there is a rhashtable_lookup() call on the xdp_mem_info->id, which looks like a performance issue, but it isn't. While we would like to amortized this via an explicit bulk return API, the lookup is surprisingly fast.

The reason behind this being fast, is that the hash function defined (xdp_mem_id_hashfn) is not really a hash function, as it simply returns the id as a key. Given we use a cyclic increasing ID, the key/id is unique and "hash-distributed" enough already.

Warning: missing pieces in page_pool

Currently: fully-locked page-recycle call

When the RFC page_pool got ripped out and converted it to be used in network stack, the ALF (Array-based Lock Free) queue was dropped. Instead the ptr_ring was used and replaced the internal page_pool queue. The ptr_ring actually do have some performance advantages over ALF-queue, e.g. reduces the cross-CPU cache-coherency talk, and is faster cross CPU.

One disadvantage is that ptr_ring_produce (or ptr_ring_produce_bh) call takes a lock. And it is currently called per returned page, see __page_pool_recycle_into_ring(). This obviously is a scalability issue waiting to happen, when/if multiple CPUs want to return packet originating from the same RX-queue.

See No description for this link for benchmarks showing this is still very fast, in the non-lock congested case.

Thus, this need to be fixed/improved. The basic idea to address this is through bulking. But there are two ways to introduce (1) expose an explicit bulk return API, or (2) hide it in the page_pool API via clever lockless per CPU store (that return pages in a bulk).

Jesper have a lot of details for option (2), as a significant performance gain can be achieved by having knowledge about (and separating) what context the kernel is running in (softirq/bh, hardirq, process-context).

Benchmark based development

In-order to live-up to the performance requirements of XDP, it is important that we use a benchmark driven development model. Thus, it is encouraged doing Proof-of-Concept (PoC) implementations and modifications to the kernel and evaluate the performance, for every step.

What is the overhead of "fully-locked page-recycle call"

Benchmark based development notes: Proof-of-Concept (PoC) modified mlx5 driver and bq_xmit_all (devmap.c) to call xdp_return_frame (vs normally calling xdp_return_frame_rx_napi), xdp_redirect_map frame out a driver with ndo_xdp_xmit, but manually unload XDP from egress driver, causing driver ndo_xdp_xmit to return error, causing all xdp_frame to be freed/returned. Diff (1/14340444-1/12728465)*10^9 = -8.83 ns difference.

The performance with xdp_return_frame was: 12,728,465 pps.
The perf with xdp_return_frame_rx_napi was: 14,340,444 pps

Thus, in a non-lock congested case, the performance is still very good.

What is the overhead of page order-0 allocations

The base overhead of allocating and freeing an order-0 page in a tight loop, always hitting the PCP fast-path, measured with prototype kernel module page_bench01.

dmesg output
[ 1315.759541] page_bench01: alloc_pages order:0(4096B/x1) 223 cycles per-4096B 223 cycles
[ 1315.768139] time_bench: Type:alloc_pages_order_step Per elem: 223 cycles(tsc) 62.096 ns (step:0)
 - (measurement period time:0.062096443 sec time_interval:62096443)
 - (invoke count:1000000 tsc_interval:223548984)

Show alloc+free order-0 page cost: 223 cycles or 62.096 ns. This almost consume the entire budget for 10Gbit/s wirespeed which is 67.2 ns. Converting, 62.096 ns to packets per sec equal (1/62.096*10^9) 16,104,097 pps, which doesn't represent any real pps (just to relate this to pps numbers).

XDP optimization for ndo_xdp_xmit

When xdp_frames are redirected and send out a remote-device via ndo_xdp_xmit, then TX frame is again DMA-mapped for transmit on this remote-device, and at XDP-TXQ completion time the frame/page is DMA-unmapped again.

On many systems, the underlying DMA-engine/device is the same. Thus, the DMA-addr of the page that we keep DMA-mapped for RX, is actually the same and valid for the TX net_device. It would be an optimization to avoid this DMA-map/unmap and replace it with a DMA-sync, if this condition is satisfied.

Thus, we would need an API (accessible by XDP in ndo_xdp_xmit), that can (1) tell us if the condition is satisfied, and (2) that can return the DMA-addr to be used for TX.

Evolving the page_pool API

Table of Contents