Implementing page_pool return-hook via SKB

Basic idea
(pre-RFC) Wrong recycle on two branches
RFC-patchset
- RFC-feedback
  - Case: skb_try_coalesce()
  - Case: skb_gro_receive()
Patchset: DMA addr in struct page
Testing procedures
- Multicast traffic
- TCP traffic

This document is the development notes, while implementing page_pool return-hook via SKB (See section "Return-hook: via SKB" in page_pool01_evolving_API.html)

Basic idea

SKB freeing basically all goes through __kfree_skb(), and following the code path expanding relevant code-inside (see below), then is it fairly clear that skb_free_head() is a candidate for a page_pool return hook.

void __kfree_skb(struct sk_buff *skb)
{
        skb_release_all(skb) {
                if (likely(skb->head)) {
                        skb_release_data(skb) {
                                skb_free_head(skb) {
                                        // Catch page_pool pages here via xdp_return_skb_page
                                        if (skb->head_frag) {
                                                skb_free_frag(skb->head) {
                                                        page_frag_free(addr);
                                                }
                                        }
                                }
                        }
                }
        }
}

(pre-RFC) Wrong recycle on two branches

Git tree: https://github.com/apalos/bpf-next Branches:

mvneta_03_page_pool_recycle
mvneta_04_page_pool_recycle_xdp

The issue is wrong placement of xdp_mem_info in struct sk_buff, as it is placed in between headers_start and headers_end.

The issue arise when either using skb_copy() or pskb_copy(), which allocates a new data-area and copy over contents. But __copy_skb_header() memcpy, old SKB area between headers_start to headers_end, which also include xdp_mem_info. Thus, on kfree_skb the hook for page_pool return is invoked, for this mem-area that didn't originate from page_pool.

Move xdp_mem_info, just after members (flags) head_frag and pfmemalloc. As a future plan, we could move introduce a __u8 flags member to xdp_mem_info and move flags head_frag and pfmemalloc into this area.

Fixed in this commit that will be squashed before upstreaming.

RFC-patchset

Upstream RFC patchset was send <2018-12-07 Fri>.

Subject: [net-next PATCH RFC 0/8] page_pool DMA handling and allow to recycles frames via SKB
Message-ID: <154413868810.21735.572808840657728172.stgit@firesoul>
https://patchwork.ozlabs.org/project/netdev/list/?series=80284&state=%2a

This patchset corresponds to:

Git tree: https://github.com/apalos/bpf-next
Branch: mvneta_07_page_pool_recycle

RFC-feedback

DaveM didn't reject the 4 bytes extra in SKB (link)

DaveM pointed out that page->private is only 32-bit on 32-bit Architectures AND use-cases exists where hardware need 64-bit DMA addresses. Thus, the page->private member isn't large enough in those use-cases.

Eric Dumazet pointed out that we might be missing some case, that manipulates the pages directly:

skb_try_coalesce
splice() stuff
GRO
skb cloning
tcp_zerocopy_receive() (TCP RX zero copy)
more generally all paths that can grab pages from skbs.

One design choice that solves many of Eric's concerns is that page_pool pages (that still contain DMA state) MUST NOT be used as part of SKB "frags". Only the skb->head that points to the data "page", can belong to a page_pool. After our changes to mvneta driver, we use build_skb() that uses the page_pool data-pointer as skb->head. The driver (mvneta) still have jumbo-frame support which uses SKB "frags", to handle that case, we "release" the page_pool state by calling page_pool_unmap_page() in driver, when assigning (the payload beyond PAGE_SIZE) it as "frags" (via skb_add_rx_frag() call).

An page_pool page, can actually be "appended"/merged into another SKB as a skb->frag_list. This might be the least intrusive thing, to detect mem_info->type is set, and not reject e.g. coalesce but instead fall-over to skb->flag_list coalescing.

TODO: go through each case Eric pointed out and check constraints.

Case: skb_try_coalesce()

skb_try_coalesce() was broken, as it converts the skb->head into a skb "frags" of another SKB.

Look into code solutions, in prioritized order:

"release" page_pool state, via calling page_pool_unmap_page()
use skb->frag_list coalesce instead
simply reject coalesce action if mem_info->type is set (last-resort)

Case: skb_gro_receive()

Patchset: DMA addr in struct page

Cover-letter

Subject: Fix page_pool API and dma address storage

As pointed out by David Miller in [1] the current page_pool implementation stores dma_addr_t in page->private. This won't work on 32-bit platforms with 64-bit DMA addresses since the page->private is an unsigned long and the dma_addr_t a u64.

Since no driver is yet using the DMA mapping capabilities of the API let's fix this by storing the information in 'struct page' and use that to store and retrieve DMA addresses from network drivers.

As long as the addresses returned from dma_map_page() are aligned the first bit, used by the compound pages code should not be set.

Ilias tested the first two patches on Espressobin driver mvneta, for which we have patches for using the DMA API of page_pool.

[1]: https://lore.kernel.org/netdev/20181207.230655.1261252486319967024.davem@davemloft.net/

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>

— Question: Who of the maintainers MM or netdev will take these changes?

Stgit mail command

stg mail --prefix=net-next --version="V3" --edit-cover  --cc meup \
 --to netdev --to linux-mm@kvack.org \
 --cc willy@infradead.org --cc davem \
 --cc mgorman@techsingularity.net --cc akpm \
 --cc toke --cc ilias \
 --cc tariq --cc saeed \
 --cc alex \
 mm-dma_addr_to-struct-page..dma_attr

Using DMA_ATTR_SKIP_CPU_SYNC

page_pool: use DMA_ATTR_SKIP_CPU_SYNC for DMA mappings

As pointed out by Alexander Duyck, the DMA mapping done in page_pool needs to use the DMA attribute DMA_ATTR_SKIP_CPU_SYNC.

As the principle behind page_pool keeping the pages mapped is that the driver takes over the DMA-sync steps.

Testing procedures

Multicast traffic

Multicast traffic requires skb_clone'ing, so here is a test case with multicast:

I setup:

IP 192.168.200.1 is configured on the espressobin board.
IP 192.168.200.2 is configured on client/PC

On espressobin start iperf server:

ifconfig eth0 192.168.200.1
route add -net 224.0.0.0 netmask 240.0.0.0 eth0
iperf -s -u -B 226.94.1.1 -i 1

On client/PC start iperf client::

iperf -c 226.94.1.1 -u -T 32 -t 30 -i 1 -B 192.168.200.2 -b1000Mbit

With iperf multicast the server joins the mcast stream and starts receiving packets.

TCP traffic

Both scp and iperf seem to trigger skb_try_coalesce()

On espressobin start iperf server:

ifconfig eth0 192.168.200.1
iperf -s

iperf -c 192.168.200.1

or from a host send a file to espressobin via scp

Implementing page_pool return-hook via SKB

Table of Contents