page_pool handling in-flight frames during shutdown

Table of Contents

This issue is (unfortunately) blocking drivers from using the DMA-mapping feature of page_pool.

Issue: lifetime of device + page_pool

There is an issue/challenge when a page_pool object itself is freed, while there are still packets in-flight (that need to be returned to the page_pool). (The issue is more prone to happen, when SKBs can carry page_pool pages).

Today, this is handled by __xdp_return() via RCU lookups (in rhashtable), which will simply call put_page() if the page_pool ID isn't in the rhashtable. This is only valid/correct if page_pool is NOT used for DMA-mapping.

Solution#1 - kill performance

(Wrong solution)

The naive solution, that would kill performance, is to have a refcnt on page_pool for each outstanding packet-page. And use this refcnt to know when all pages have been returned, and thus when it is safe to free the page_pool.

Solution#2

Create a smaller object that can be put into rhashtable, that only store the struct device pointer, that is the one used for the DMA-unmap call.

This is not optimal, as we still don't know when this smaller object can be freed from the rhashtable. (This both pin-down rhashtable IDs and memory, for the lifetime of the kernel)

Solution#3

Keep counters for packets in-flight, or rather outstanding-pages. This will allow us to know when it is safe to free (and remove the page_pool object from rhashtable).

This solution can like solution#1, easily kill performance, depending on implementation details. E.g. it is bad to have a single in-flight counter that is updated on both alloc and free time, because it will cause cache-line bouncing (stressing CPU cache coherency protocol).

Properties of page_pool to realise:

  • Alloc happens RX time and is protected by NAPI/softirq, which guarantees no-concurrent access e.g. to page_pool pp_alloc_cache.
  • Free can run concurrently on remote CPUs, and ptr_ring is used for synchronise return of pages (via producer lock).
  • The ptr_ring doesn't account number of objects in the ring.

The proposed solution is having two (unsigned) counters, that can be on different cache-lines, and can be used to deduct in-flight packets. This is done by mapping the unsigned "sequence" counters onto signed Two's complement arithmetic operations. This is e.g. used by kernel's time_after macros, described in kernel commit 1ba3aab3033b and 5a581b367b5, and also explained in this Wikipedia article and RFC1982.

Thus, these two increment counters only need to be read and compared, when checking if it's safe to free the page_pool structure. Which will only happen when driver have disconnected RX/alloc side. Thus, on a non-fast-path.

The counters should track number of "real" page allocation/free operations. The pages getting recycled, and is storing in internal ptr_ring, is basically counted as part of in-flight pages. This, also reduce the effect on the fast-path recycling code.

On the alloc side the counter can be incremented lockless, as it is updated under NAPI/softirq protection.

On the free-side, the operation can happen on remote CPUs. The operation that can happen (on remote CPUs) is in-case ptr_ring is full at return/free time, which result in the page being returned to page-allocator. Thus, this counter need to be updated atomically. The atomic cost could be mitigated via using lib/percpu_counter.c.

Number of pages held 'in-flight' by the page_pool, is also relevant for drivers not using the DMA mapping, as indicate/include the pages stored in the ptr_ring. Later, when/if system comes under memory pressure, we want to allow page-allocator to request page_pool to release resources, where this count could be used to select among page_pools. It is also useful for stats and debugging.

Complications with solution#3

Driver take down procedure need to change. Fortunately only one driver (mlx5) uses page_pool.

Current procedure is:

static void mlx5e_free_rq(struct mlx5e_rq *rq)
{
        // [...]
        xdp_rxq_info_unreg(&rq->xdp_rxq);
        if (rq->page_pool)
                page_pool_destroy(rq->page_pool);

With handling of in-flight packet-pages, we might have to postpone calling page_pool_destroy(). We could move this into xdp_rxq_info_unreg(), but XDP mem model code need to stay allocator agnostic. Thus, we likely need to extend struct xdp_mem_allocator with destructor call back.

Further more, we likely need to have both a "request-cleanup" and "destructor" callback. As we want to release as many memory resources as possible as early as possible, and only wait for the packet-pages in-flight.

Extra: We can add a REG_STATE_DYING to XDP (struct xdp_rxq_info), which can/might help us catch invalid driver use-cases.

TODO: cleanup/error code-path in drivers

We cannot completely remove page_pool_destroy() from the API, as drivers setup paths can fail, and they might need to free the page_pool resource explicitly. We could export a __page_pool_free() function.

Found bug/issue.

xdp: fix leak of IDA cyclic ID if rhashtable_insert_slow fails

Fix error handling case, where inserting ID with rhashtable_insert_slow fails in xdp_rxq_info_reg_mem_model, which leads to never releasing the IDA ID, as the lookup in xdp_rxq_info_unreg_mem_model fails and thus ida_simple_remove() is never called.

Fix by releasing ID via ida_simple_remove(), and mark xdp_rxq->mem.id with zero, which is already checked in xdp_rxq_info_unreg_mem_model().

diff --git a/net/core/xdp.c b/net/core/xdp.c
index 4b2b194f4f1f..762abeb89847 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -301,6 +301,8 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
        /* Insert allocator into ID lookup table */
        ptr = rhashtable_insert_slow(mem_id_ht, &id, &xdp_alloc->node);
        if (IS_ERR(ptr)) {
+               ida_simple_remove(&mem_id_pool, xdp_rxq->mem.id);
+               xdp_rxq->mem.id = 0;
                errno = PTR_ERR(ptr);
                goto err;
        }

Notes while cleanup patchset

Keeping some notes while cleaning up patchset. This section might not make sense, consider removing this section later.

Prepare mlx5

Removed comment.

+void __page_pool_free(struct page_pool *pool)
+{
+       /* API user must call page_pool_request_shutdown first, and
+        * assure that it was successful
+        */

page_pool: introduce page_pool_free

In case driver fails to register the page_pool with XDP return API (via xdp_rxq_info_reg_mem_model()), then the driver can free the page_pool resources more directly than calling page_pool_destroy(), which does a unnecessarily RCU free procedure.

This patch is preparing for removing page_pool_destroy(), from driver invocation.

mlx5: more correct usage of page_pool API

The page_pool API states user is responsible for invoking page_pool_put_page once. This were not done in mlx5e_page_release() when recycle is false. This e.g. happens from mlx5e_free_rq() when tearing down resources.

This API omission is not critical, as mlx5 doesn't use page_pool for DMA-mapping yet. This becomes important later when tracking in-flight frames. This change makes the pages on the driver local page_cache, to go through the page_pool system.

In mlx5e_free_rq() moved the page_pool_destroy() call to after the mlx5e_page_release() calls, as it is more correct.

Strange use in i40e

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 320562b39686..441323ca1464 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3248,7 +3248,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring)

        if (ring->vsi->type == I40E_VSI_MAIN)
                xdp_rxq_info_unreg_mem_model(&ring->xdp_rxq);
-
+// FIXME: Why not using xdp_rxq_info_unreg() ???
        ring->xsk_umem = i40e_xsk_umem(ring);
        if (ring->xsk_umem) {
                ring->rx_buf_len = ring->xsk_umem->chunk_size_nohr -

page_pool: handle in-flight pages and remove page_pool_destroy

This patch adds tracking of in-flight pages for page_pool.

The existing scheme where the XDP memory model unregister waits one RCU grace-period is insufficient for page_pool, as xdp_frame's can potentially be stuck on a remote drivers TX ring queue for longer.

This patch adds accounting for pages allocated and freed, tracking in-flight pages. This adds more strict requirements for page_pool, as it can no-longer be used as shim-layer for the page allocator. There is now a more strict requirement of calling page_pool_put_page() or page_pool_unmap_page(), when a page leaves the page_pool.

For page accounting

This patch removes page_pool_destroy().

Adds page_pool_request_shutdown(), which is used by XDP memory model unregister.

If a driver implements a local page caching scheme on top of this (like mlx5) which also cache DMA-mappings, then the in-flight accounting, can be used for detecting if it leaks DMA mappings.

Notes for patchset RFCv1-pre-upstream

stg mail –edit-cover –version="RFCv1-pre-upstream" –to meup \ –to ilias –to toke –to tariq \ –cc ivan.khoronzhuk@linaro.org –cc grygorii.strashko@ti.com \ 01-net-page_pool-add-helper1..split-page-DMA-issue

Subj: Bugfixes related to page_pool

I've not had time to write a cover-letter…

I wanted to show you the patches, I'm currently working on, which I guess are blocking the drivers/net/ethernet/ti/ update. I'm running out of time (as it my birthday today and I have plans), so just dumping these… not-cleaned up patches.

My notes are here: https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool03_shutdown_inflight.org

Notes for patchset RFCv2-pre-upstream

stg mail –edit-cover –version="RFCv2-pre-upstream" –to meup \ –to ilias –to toke –to tariq \ –cc ivan.khoronzhuk@linaro.org –cc grygorii.strashko@ti.com \ 01-net-page_pool-add-helper1..split-page-DMA-issue

Subj: Bugfixes related to page_pool

Just wanted to send out an updated version of the patchset.

Patchset submission

Reorg order of patches, such that mlx5 and cpumap fixes appear before introducing in-flight handling. API page_pool_release_page introduced earlier.

Cover letter V1 + V2

xdp: page_pool fixes and in-flight accounting

This patchset fix page_pool API and users, such that drivers can use it for DMA-mapping. A number of places exist, where the DMA-mapping would not get released/unmapped, all these are fixed. This occurs e.g. when an xdp_frame gets converted to an SKB. As network stack doesn't have any callback for XDP memory models.

The patchset also address a shutdown race-condition. Today removing a XDP memory model, based on page_pool, is only delayed one RCU grace period. This isn't enough as redirected xdp_frames can still be in-flight on different queues (remote driver TX, cpumap or veth).

We stress that when drivers use page_pool for DMA-mapping, then they MUST use one packet per page. This might change in the future, but more work lies ahead, before we can lift this restriction.

This patchset change the page_pool API to be more strict, as in-flight page accounting is added.

Stg mail command V1:

stg mail --edit-cover --version="net-next v1" --to meup \
 --to netdev \
 --to ilias --to toke --to tariq \
 --cc ivan.khoronzhuk@linaro.org --cc grygorii.strashko@ti.com \
 --cc mcroce@redhat.com --cc toshiaki.makita1@gmail.com \
 01-net-page_pool-add-helper1..tracepoint02-page-pool

Message-ID: <156045046024.29115.11802895015973488428.stgit@firesoul>

Link: https://lore.kernel.org/netdev/156045046024.29115.11802895015973488428.stgit@firesoul/

Stg mail command V2:

stg mail --edit-cover --version="net-next v2" --to meup \
 --to netdev \
 --to ilias --to toke --to tariq \
 --cc ivan.khoronzhuk@linaro.org --cc grygorii.strashko@ti.com \
 --cc mcroce@redhat.com --cc toshiaki.makita1@gmail.com \
 01-net-page_pool-add-helper1..fix-device-refcnt

Link: https://lore.kernel.org/netdev/156086304827.27760.11339786046465638081.stgit@firesoul

https://patchwork.ozlabs.org/project/netdev/list/?series=114508&state=%2a

cpumap issue and fix

Patch-desc: xdp: page_pool related fix to cpumap

When converting an xdp_frame into an SKB, and sending this into the network stack, then the underlying XDP memory model need to release associated resources, because the network stack don't have callbacks for XDP memory models. The only memory model that needs this is page_pool, when a driver use the DMA-mapping feature.

Introduce page_pool_release_page(), which basically does the same as page_pool_unmap_page(). Add xdp_release_frame() as the XDP memory model interface for calling it, if the memory model match MEM_TYPE_PAGE_POOL, to save the function call overhead for others. Have cpumap call xdp_release_frame() before xdp_scrub_frame().

Patch-desc: mlx5: more strict use of page_pool API

The mlx5 driver is using page_pool, but not for DMA-mapping (currently), and is a little too relaxed about returning or releasing page resources, as it is not strictly necessary, when not using DMA-mappings.

As this patchset is working towards tracking page_pool resources, to know about in-flight frames on shutdown. Then fix places where mlx5 leak page_pool resource.

In case of dma_mapping_error, when recycle into page_pool.

In mlx5e_free_rq() moved the page_pool_destroy() call to after the mlx5e_page_release() calls, as it is more correct.

In mlx5e_page_release() when no recycle was requested, then release page from the page_pool, via page_pool_release_page().

Patch-desc: xdp: tracking page_pool resources and safe removal

This patch is needed before we can allow drivers to use page_pool for DMA-mappings. Today with page_pool and XDP return API, it is possible to remove the page_pool object (from rhashtable), while there are still in-flight packet-pages. This is safely handled via RCU and failed lookups in __xdp_return() fallback to call put_page(), when page_pool object is gone. In-case page is still DMA mapped, this will result in page note getting correctly DMA unmapped.

To solve this, the page_pool is extended with tracking in-flight pages. And XDP disconnect system queries page_pool and waits, via workqueue, for all in-flight pages to be returned.

To avoid killing performance when tracking in-flight pages, the implement use two (unsigned) counters, that in placed on different cache-lines, and can be used to deduct in-flight packets. This is done by mapping the unsigned "sequence" counters onto signed Two's complement arithmetic operations. This is e.g. used by kernel's time_after macros, described in kernel commit 1ba3aab3033b and 5a581b367b5, and also explained in RFC1982.

The trick is these two incrementing counters only need to be read and compared, when checking if it's safe to free the page_pool structure. Which will only happen when driver have disconnected RX/alloc side. Thus, on a non-fast-path.

It is chosen that page_pool tracking is also enabled for the non-DMA use-case, as this can be used for statistics later.

After this patch, using page_pool requires more strict resource "release", e.g. via page_pool_release_page() that was introduced in this patchset, and previous patches implement/fix this more strict requirement.

Drivers no-longer call page_pool_destroy(). Drivers already call xdp_rxq_info_unreg() which call xdp_rxq_info_unreg_mem_model(), which will attempt to disconnect the mem id, and if attempt fails schedule the disconnect for later via delayed workqueue.

Patch-desc: xdp: force mem allocator removal and periodic warning

If bugs exists or are introduced later e.g. by drivers misusing the API, then we want to warn about the issue, such that developer notice. This patch will generate a bit of noise in form of periodic pr_warn every 30 seconds.

It is not nice to have this stall warning running forever. Thus, this patch will (after 120 attempts) force disconnect the mem id (from the rhashtable) and free the page_pool object. This will cause fallback to the put_page() as before, which only potentially leak DMA-mappings, if objects are really stuck for this long. In that unlikely case, a WARN_ONCE should show us the call stack.

Tests

(Established): Test if __xdp_return() can hit no page_pool id issue

First establish if this code can be hit:

diff --git a/net/core/xdp.c b/net/core/xdp.c
index 3d53f9f247e5..6114c80393db 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -338,6 +338,8 @@ static void __xdp_return(void *data, struct xdp_mem_info *mem, bool napi_direct,
                        napi_direct &= !xdp_return_frame_no_direct();
                        page_pool_put_page(xa->page_pool, page, napi_direct);
                } else {
+                       pr_warn("%s() XXX issue if page_pool(id:%d) use DMA\n",
+                               __func__, mem->id);
                        put_page(page);
                }
                rcu_read_unlock();

The driver mlx5 (in mlx5e_xdp_set) reset the NIC-ring "channels", when changing between XDP and non-XDP mode.

In that case, the mlx5 driver doesn't reuse the page_pool, instead when a XDP program is attached it "close" and free all the "channels", where mlx5e_free_rq calls xdp_rxq_info_unreg as shown by this perf-probe stack trace:

xdp_rxq_info  1745 [001]  1529.547422: probe:xdp_rxq_info_unreg_2: (ffffffff8179caa6)
        ffffffff8179caa7 xdp_rxq_info_unreg+0x17 (/boot/vmlinux-5.1.0-bpf-next-page-pool+)
        ffffffffa022d32a mlx5e_free_rq+0x3a ([mlx5_core])
        ffffffffa022f0b2 mlx5e_close_channel+0x22 ([mlx5_core])
        ffffffffa0231486 mlx5e_close_channels+0x26 ([mlx5_core])
        ffffffffa0232ac7 mlx5e_close_locked+0x47 ([mlx5_core])
        ffffffffa0232d4c mlx5e_xdp+0x19c ([mlx5_core])
        ffffffff8176a39c dev_xdp_install+0x3c (/boot/vmlinux-5.1.0-bpf-next-page-pool+)
        ffffffff81788806 do_setlink+0xcd6 (/boot/vmlinux-5.1.0-bpf-next-page-pool+)
        ffffffff81788a10 rtnl_setlink+0xd0 (/boot/vmlinux-5.1.0-bpf-next-page-pool+)
        [...]

And xdp_rxq_info_reg is called by mlx5e_alloc_rq, but cannot be seen by (below) call stack as it is inlined in mlx5e_open_rq.

xdp_rxq_info  1806 [000]  1883.326305:     probe:xdp_rxq_info_reg: (ffffffff8179cae0)
        ffffffff8179cae1 xdp_rxq_info_reg+0x1 (/boot/vmlinux-5.1.0-bpf-next-page-pool+)
        ffffffffa022e6e3 mlx5e_open_rq+0x153 ([mlx5_core])
        ffffffffa0231395 mlx5e_open_channels+0xc25 ([mlx5_core])
        ffffffffa023289a mlx5e_open_locked+0x2a ([mlx5_core])
        ffffffffa0232d8a mlx5e_xdp+0x1da ([mlx5_core])
        ffffffff8176a39c dev_xdp_install+0x3c (/boot/vmlinux-5.1.0-bpf-next-page-pool+)
        ffffffff8177663e dev_change_xdp_fd+0xce (/boot/vmlinux-5.1.0-bpf-next-page-pool+)
        ffffffff81788806 do_setlink+0xcd6 (/boot/vmlinux-5.1.0-bpf-next-page-pool+)
        ffffffff81788a10 rtnl_setlink+0xd0 (/boot/vmlinux-5.1.0-bpf-next-page-pool+)
        ffffffff81786da2 rtnetlink_rcv_msg+0x122 (/boot/vmlinux-5.1.0-bpf-next-page-pool+)
        ffffffff817d4157 netlink_rcv_skb+0x37 (/boot/vmlinux-5.1.0-bpf-next-page-pool+)
        ffffffff817d3a49 netlink_unicast+0x169 (/boot/vmlinux-5.1.0-bpf-next-page-pool+)
        ffffffff817d3d71 netlink_sendmsg+0x291 (/boot/vmlinux-5.1.0-bpf-next-page-pool+)
        ffffffff817519b0 sock_sendmsg+0x30 (/boot/vmlinux-5.1.0-bpf-next-page-pool+)
        ffffffff81752eb8 __sys_sendto+0xe8 (/boot/vmlinux-5.1.0-bpf-next-page-pool+)
        ffffffff81752f04 __x64_sys_sendto+0x24 (/boot/vmlinux-5.1.0-bpf-next-page-pool+)
        ffffffff81002252 do_syscall_64+0x42 (/boot/vmlinux-5.1.0-bpf-next-page-pool+)
        ffffffff81a0008c entry_SYSCALL_64+0x7c (/boot/vmlinux-5.1.0-bpf-next-page-pool+)
            7f4d4090e4ed __send+0x1d (/usr/lib64/libc-2.26.so)
                  402332 main+0x422 (/home/jbrouer/kernel-bpf-samples/xdp_rxq_info)

For testing, I used XDP redirect map command:

sudo ./xdp_redirect_map  $(</sys/class/net/mlx5p1/ifindex) \
                         $(</sys/class/net/ixgbe1/ifindex)

It took a couple of tries. Confirmed: The pr_warn() was triggered, when XDP program was stopped, while having a packet generator running. It might have increased the chance that the ixgbe adaptor was causing resets:

May 21 16:36:14 broadwell kernel: ixgbe 0000:01:00.1 ixgbe2: initiating reset to clear Tx work after link loss
May 21 16:36:14 broadwell kernel: ixgbe 0000:01:00.1 ixgbe2: Reset adapter
May 21 16:36:15 broadwell kernel: ixgbe 0000:01:00.1 ixgbe2: detected SFP+: 4
May 21 16:36:15 broadwell kernel: ixgbe 0000:01:00.1 ixgbe2: initiating reset to clear Tx work after link loss
May 21 16:36:15 broadwell kernel: XXX mlx5e_free_rq()
May 21 16:36:15 broadwell kernel: XXX mlx5e_free_rq()
May 21 16:36:15 broadwell kernel: XXX mlx5e_free_rq()
May 21 16:36:15 broadwell kernel: XXX mlx5e_free_rq()
May 21 16:36:15 broadwell kernel: XXX mlx5e_free_rq()
May 21 16:36:15 broadwell kernel: XXX mlx5e_free_rq()
May 21 16:36:15 broadwell kernel: mlx5_core 0000:03:00.0 mlx5p1: Link up
May 21 16:36:15 broadwell kernel: ixgbe 0000:01:00.1 ixgbe2: Reset adapter
May 21 16:36:15 broadwell kernel: __xdp_return() XXX issue if page_pool(id:207) use DMA
May 21 16:36:15 broadwell kernel: __xdp_return() XXX issue if page_pool(id:207) use DMA
May 21 16:36:15 broadwell kernel: __xdp_return() XXX issue if page_pool(id:207) use DMA
May 21 16:36:15 broadwell kernel: __xdp_return() XXX issue if page_pool(id:207) use DMA
May 21 16:36:15 broadwell kernel: __xdp_return() XXX issue if page_pool(id:207) use DMA
May 21 16:36:15 broadwell kernel: __xdp_return() XXX issue if page_pool(id:207) use DMA
[...]
May 21 16:36:19 broadwell kernel: __xdp_return() XXX issue if page_pool(id:207) use DMA
May 21 16:36:19 broadwell kernel: __xdp_return() XXX issue if page_pool(id:207) use DMA
May 21 16:36:19 broadwell kernel: __xdp_return() XXX issue if page_pool(id:207) use DMA
May 21 16:36:19 broadwell kernel: ixgbe 0000:01:00.1: removed PHC on ixgbe2
May 21 16:36:19 broadwell kernel: ixgbe 0000:01:00.1 ixgbe2: detected SFP+: 4
May 21 16:36:19 broadwell kernel: ixgbe 0000:01:00.1: Multiqueue Enabled: Rx Queue count = 6, Tx Queue count = 6 XDP Queue count = 0
May 21 16:36:19 broadwell kernel: ixgbe 0000:01:00.1: registered PHC device on ixgbe2
May 21 16:36:19 broadwell kernel: ixgbe 0000:01:00.1 ixgbe2: detected SFP+: 4
May 21 16:36:19 broadwell kernel: ixgbe 0000:01:00.1 ixgbe2: NIC Link is Up 10 Gbps, Flow Control: RX/TX

Test work-in-progress patch

After fixing mlx5 to call xdp_rxq_info_unreg(&rq->xdp_rxq) later and let the non-recycle path call page_pool_put_page(). The basics work:

[ 1290.790220] XXX __mem_id_disconnect() id:184
[ 1290.797905] XXX mlx5e_free_rq()
[ 1290.802080] XXX __mem_id_disconnect() id:185
[ 1290.807158] ixgbe 0000:01:00.1 ixgbe2: detected SFP+: 4
[ 1290.813819] ixgbe 0000:01:00.1 ixgbe2: initiating reset to clear Tx work after link loss
[ 1290.814690] XXX mlx5e_free_rq()
[ 1290.827424] XXX __mem_id_disconnect() id:186
[ 1290.832841] XXX __page_pool_safe_to_destroy() inflight:511
[ 1290.839355] XXX xdp_rxq_info_unreg_mem_model() - start page_pool shutdown/destroy id(186)
[ 1290.896866] mlx5_core 0000:03:00.0 mlx5p1: Link up
[ 1291.035127] ixgbe 0000:01:00.1 ixgbe2: Reset adapter
[ 1291.335453] ixgbe 0000:01:00.1 ixgbe2: detected SFP+: 4
[ 1291.587066] ixgbe 0000:01:00.1 ixgbe2: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[ 1291.875054] XXX mem_id_disconnect_defer_retry() id:186 call __mem_id_disconnect again
[ 1291.883661] XXX __mem_id_disconnect() id:186

Update: this bug was caused by CPUMAP redirect not calling page_pool_unmap_page():

[ 7600.046747] XXX mem_id_disconnect_defer_retry() id:121 call __mem_id_disconnect again
[ 7600.064706] XXX __mem_id_disconnect() id:121
[ 7600.070009] XXX page_pool_inflight() inflight:17102512 hold:17103469 released:957
[ 7600.078811] XXX __page_pool_request_shutdown() inflight:17102512
[ 7600.085844] XXX page_pool_inflight() inflight:17102512 hold:17103469 released:957
[ 7600.094627] XXX __page_pool_safe_to_destroy() inflight:17102512
[ 7600.101574] XXX mem_id_disconnect_defer_retry() id:121 call schedule_delayed_work
[ 7601.134756] XXX mem_id_disconnect_defer_retry() id:121 call __mem_id_disconnect again
[ 7601.152716] XXX __mem_id_disconnect() id:121
[ 7601.158024] XXX page_pool_inflight() inflight:17102512 hold:17103469 released:957
[ 7601.166825] XXX __page_pool_request_shutdown() inflight:17102512
[ 7601.173865] XXX page_pool_inflight() inflight:17102512 hold:17103469 released:957
[ 7601.182648] XXX __page_pool_safe_to_destroy() inflight:17102512

Confused mlx5 doesn't fully use ring-size

Summary (TLDR): There was nothing wrong with the in-flight tracking, it was just mlx5 driver that only fills it's ring-buffer with page_pool size minus 64.

The mlx5 driver configures two different ring-sizes depending on if XDP is used or not. For the XDP case ring-size 1024 is used, and non-XDP 512 is used.

When debugging, I was seeing (XDP case) only 960 pages "tracked" on a unused page_pool ring (for non-XDP case 448):

[  370.223589] XXX page_pool_inflight() inflight:960 hold:960 released:0
[  370.231079] XXX __page_pool_request_shutdown() inflight:960

It turned out that the mlx5 driver refill function mlx5e_post_rx_mpwqes() only refill up-to ring-size minus 64.

@@ -624,7 +624,11 @@ bool mlx5e_post_rx_mpwqes(struct mlx5e_rq *rq)
        mlx5e_poll_ico_cq(&sq->cq, rq);

        missing = mlx5_wq_ll_missing(wq) - rq->mpwqe.umr_in_progress;
-
+       pr_warn("XXX DEBUG %s()  missing:%d x64:%d (%ld)\n", __func__,
+               missing, (missing*64), MLX5_MPWRQ_PAGES_PER_WQE); //DEBUG
+// Results first time called:
+// missing:7 x64:448 (64)   <-- non-XDP
+// missing:15 x64:960 (64)  <-- XDP-mode
        if (unlikely(rq->mpwqe.umr_in_progress > rq->mpwqe.umr_last_bulk))
                rq->stats->congst_umr++;

@@ -635,7 +639,7 @@ bool mlx5e_post_rx_mpwqes(struct mlx5e_rq *rq)
        head = rq->mpwqe.actual_wq_head;
        i = missing;
        do {
-               if (unlikely(mlx5e_alloc_rx_mpwqe(rq, head)))
+               if (unlikely(mlx5e_alloc_rx_mpwqe(rq, head))) // bulks 64
                        break;
                head = mlx5_wq_ll_get_wqe_next_ix(wq, head);
        } while (--i);

TCP performance difference with XDP_PASS

Tom Barbette reported that when loading an XDP_PASS program, then TCP flows were slower and cost more CPU, but only for driver mlx5 (not for e.g. i40e). (via video link: https://www.youtube.com/watch?v=o5hlJZbN4Tk&feature=youtu.be)

On my system it was even worse, with an XDP_PASS program loaded, and iperf (6 parallel TCP flows) I would see 100% CPU usage and total 83.3 Gbits/sec. With non-XDP case, I saw 58% CPU (43% idle) and total 89.7 Gbits/sec

This was kind of hard to root-cause, but I solved it by increasing the TCP socket size used by the iperf tool, like this:

$ iperf -s –window 4M


Server listening on TCP port 5001 TCP window size: 416 KByte (WARNING: requested 4.00 MByte)


Given I could reproduce, I took at closer look at perf record/report stats, and it was actually quite clear that this was related to stalling on getting pages from the page allocator (function calls top#6 get_page_from_freelist and free_pcppages_bulk).

Using my tool: ethtool_stats.pl https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl

It was clear that the mlx5 driver page-cache was not working:

Ethtool(mlx5p1  ) stat:     6653761 (   6,653,761) <= rx_cache_busy /sec
Ethtool(mlx5p1  ) stat:     6653732 (   6,653,732) <= rx_cache_full /sec
Ethtool(mlx5p1  ) stat:      669481 (     669,481) <= rx_cache_reuse /sec
Ethtool(mlx5p1  ) stat:           1 (           1) <= rx_congst_umr /sec
Ethtool(mlx5p1  ) stat:     7323230 (   7,323,230) <= rx_csum_unnecessary /sec
Ethtool(mlx5p1  ) stat:        1034 (       1,034) <= rx_discards_phy /sec
Ethtool(mlx5p1  ) stat:     7323230 (   7,323,230) <= rx_packets /sec
Ethtool(mlx5p1  ) stat:     7324244 (   7,324,244) <= rx_packets_phy /sec

While the non-XDP case looked like this:

Ethtool(mlx5p1  ) stat:      298929 (     298,929) <= rx_cache_busy /sec
Ethtool(mlx5p1  ) stat:      298971 (     298,971) <= rx_cache_full /sec
Ethtool(mlx5p1  ) stat:     3548789 (   3,548,789) <= rx_cache_reuse /sec
Ethtool(mlx5p1  ) stat:     7695476 (   7,695,476) <= rx_csum_complete /sec
Ethtool(mlx5p1  ) stat:     7695476 (   7,695,476) <= rx_packets /sec
Ethtool(mlx5p1  ) stat:     7695169 (   7,695,169) <= rx_packets_phy /sec
Manual consistence calc: 7695476-((3548789*2)+(298971*2)) = -44

With the increased TCP window size, the mlx5 driver cache is working better, but not optimally, see below. I'm getting 88.0 Gbits/sec with 68% CPU usage.

Ethtool(mlx5p1  ) stat:      894438 (     894,438) <= rx_cache_busy /sec
Ethtool(mlx5p1  ) stat:      894453 (     894,453) <= rx_cache_full /sec
Ethtool(mlx5p1  ) stat:     6638518 (   6,638,518) <= rx_cache_reuse /sec
Ethtool(mlx5p1  ) stat:           6 (           6) <= rx_congst_umr /sec
Ethtool(mlx5p1  ) stat:     7532983 (   7,532,983) <= rx_csum_unnecessary /sec
Ethtool(mlx5p1  ) stat:         164 (         164) <= rx_discards_phy /sec
Ethtool(mlx5p1  ) stat:     7532983 (   7,532,983) <= rx_packets /sec
Ethtool(mlx5p1  ) stat:     7533193 (   7,533,193) <= rx_packets_phy /sec
Manual consistence calc: 7532983-(6638518+894453) = 12

To understand why this is happening, you first have to know that the difference is between the two RX-memory modes used by mlx5 for non-XDP vs XDP. With non-XDP two frames are stored per memory-page, while for XDP only a single frame per page is used. The packets available in the RX-rings are actually the same, as the ring sizes are non-XDP=512 vs. XDP=1024.

I believe, the real issue is that TCP use the SKB->truesize (based on frame size) for different memory pressure and window calculations, which is why it solved the issue to increase the window size manually.

Case: XDP_PASS and 6 parallel iperf TCP flows:

May 31 15:24:23 broadwell kernel: XXX mlx5e_free_rq()
May 31 15:24:23 broadwell kernel: XXX __mem_id_disconnect() id:49
May 31 15:24:23 broadwell kernel: XXX page_pool_inflight() inflight:1136 hold:10665185 released:10664049
May 31 15:24:23 broadwell kernel: XXX __page_pool_request_shutdown() inflight:1136
May 31 15:24:23 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:10665185 released:10665185
May 31 15:24:23 broadwell kernel: XXX mlx5e_free_rq()
May 31 15:24:23 broadwell kernel: XXX __mem_id_disconnect() id:50
May 31 15:24:23 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:10665185 released:10665185
May 31 15:24:23 broadwell kernel: XXX page_pool_inflight() inflight:1039 hold:10766397 released:10765358
May 31 15:24:23 broadwell kernel: XXX __page_pool_request_shutdown() inflight:1039
May 31 15:24:23 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:10766397 released:10766397
May 31 15:24:23 broadwell kernel: XXX mlx5e_free_rq()
May 31 15:24:23 broadwell kernel: XXX __mem_id_disconnect() id:51
May 31 15:24:23 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:10766397 released:10766397
May 31 15:24:23 broadwell kernel: XXX page_pool_inflight() inflight:1109 hold:10648725 released:10647616
May 31 15:24:23 broadwell kernel: XXX __page_pool_request_shutdown() inflight:1109
May 31 15:24:23 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:10648725 released:10648725
May 31 15:24:23 broadwell kernel: XXX mlx5e_free_rq()
May 31 15:24:23 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:10648725 released:10648725
May 31 15:24:23 broadwell kernel: XXX __mem_id_disconnect() id:52
May 31 15:24:23 broadwell kernel: XXX page_pool_inflight() inflight:1104 hold:10770818 released:10769714
May 31 15:24:23 broadwell kernel: XXX __page_pool_request_shutdown() inflight:1104
May 31 15:24:23 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:10770818 released:10770818
May 31 15:24:24 broadwell kernel: XXX mlx5e_free_rq()
May 31 15:24:24 broadwell kernel: XXX __mem_id_disconnect() id:53
May 31 15:24:24 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:10770818 released:10770818
May 31 15:24:24 broadwell kernel: XXX page_pool_inflight() inflight:1132 hold:10635235 released:10634103
May 31 15:24:24 broadwell kernel: XXX __page_pool_request_shutdown() inflight:1132
May 31 15:24:24 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:10635235 released:10635235

$ nstat -n && sleep 1 && nstat
#kernel
IpInReceives                    221138             0.0
IpInDelivers                    221138             0.0
IpOutRequests                   60230              0.0
TcpInSegs                       221136             0.0
TcpOutSegs                      60230              0.0
TcpExtTCPHPHits                 216175             0.0
TcpExtTCPHPAcks                 1                  0.0
TcpExtTCPBacklogCoalesce        4595               0.0
TcpExtTCPBacklogDrop            1                  0.0
TcpExtTCPRcvCoalesce            4270               0.0
TcpExtTCPOFOQueue               87                 0.0
TcpExtTCPWantZeroWindowAdv      215                0.0
TcpExtTCPOrigDataSent           1                  0.0
TcpExtTCPDelivered              1                  0.0
TcpExtTCPAckCompressed          48                 0.0
IpExtInOctets                   10644770764        0.0
IpExtOutOctets                  3132896            0.0
IpExtInNoECTPkts                7357218            0.0

Case: non-XDP and 6 parallel iperf TCP flows:

May 31 15:24:24 broadwell kernel: XXX mlx5e_alloc_rq() pool_size: 512
May 31 15:24:24 broadwell kernel: XXX mlx5e_alloc_rq() pool_size: 512
May 31 15:24:24 broadwell kernel: mlx5_core 0000:03:00.0 mlx5p1: Link up
[...]
May 31 15:28:58 broadwell kernel: XXX mlx5e_free_rq()
May 31 15:28:58 broadwell kernel: XXX __mem_id_disconnect() id:55
May 31 15:28:58 broadwell kernel: XXX page_pool_inflight() inflight:589 hold:13434 released:12845
May 31 15:28:58 broadwell kernel: XXX __page_pool_request_shutdown() inflight:589
May 31 15:28:58 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:13434 released:13434
May 31 15:28:58 broadwell kernel: XXX mlx5e_free_rq()
May 31 15:28:58 broadwell kernel: XXX __mem_id_disconnect() id:56
May 31 15:28:58 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:13434 released:13434
May 31 15:28:58 broadwell kernel: XXX page_pool_inflight() inflight:514 hold:11891 released:11377
May 31 15:28:58 broadwell kernel: XXX __page_pool_request_shutdown() inflight:514
May 31 15:28:58 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:11891 released:11891
May 31 15:28:58 broadwell kernel: XXX mlx5e_free_rq()
May 31 15:28:58 broadwell kernel: XXX __mem_id_disconnect() id:57
May 31 15:28:58 broadwell kernel: XXX page_pool_inflight() inflight:595 hold:14093 released:13498
May 31 15:28:58 broadwell kernel: XXX __page_pool_request_shutdown() inflight:595
May 31 15:28:58 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:14093 released:14093
May 31 15:28:58 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:11891 released:11891
May 31 15:28:58 broadwell kernel: XXX mlx5e_free_rq()
May 31 15:28:58 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:14093 released:14093
May 31 15:28:58 broadwell kernel: XXX __mem_id_disconnect() id:58
May 31 15:28:58 broadwell kernel: XXX page_pool_inflight() inflight:598 hold:180742 released:180144
May 31 15:28:58 broadwell kernel: XXX __page_pool_request_shutdown() inflight:598
May 31 15:28:58 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:180742 released:180742
May 31 15:28:58 broadwell kernel: XXX mlx5e_free_rq()
May 31 15:28:58 broadwell kernel: XXX __mem_id_disconnect() id:59
May 31 15:28:58 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:180742 released:180742
May 31 15:28:58 broadwell kernel: XXX page_pool_inflight() inflight:608 hold:173294 released:172686
May 31 15:28:58 broadwell kernel: XXX __page_pool_request_shutdown() inflight:608
May 31 15:28:58 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:173294 released:173294

$ nstat -n && sleep 1 && nstat
#kernel
IpInReceives                    286604             0.0
IpInDelivers                    286604             0.0
IpOutRequests                   205151             0.0
TcpInSegs                       286604             0.0
TcpOutSegs                      205147             0.0
TcpExtDelayedACKLost            1                  0.0
TcpExtTCPHPHits                 280054             0.0
TcpExtTCPBacklogCoalesce        5925               0.0
TcpExtTCPDSACKOldSent           1                  0.0
TcpExtTCPRcvCoalesce            5959               0.0
TcpExtTCPWantZeroWindowAdv      2                  0.0
IpExtInOctets                   11276032476        0.0
IpExtOutOctets                  10668956           0.0
IpExtInNoECTPkts                7778473            0.0

CPUMAP redirect not releasing page_pool pages

Update: this bug was caused by CPUMAP redirect not calling page_pool_unmap_page(), should likely have a page_pool_release_page.

When fixing this, we need to extend XDP return API to handle this "release".

Produced shutdown issue via: sudo ./xdp_redirect_cpu –dev mlx5p1 –qsize 64 –cpu 4 –prog xdp_cpu_map0 –sec 3

Netperf TCP_STREAM and meanwhile stop XDP program.

Jun 04 17:52:10 broadwell kernel: XXX __mem_id_disconnect() id:61
Jun 04 17:52:10 broadwell kernel: XXX page_pool_inflight() inflight:8092173 hold:8093040 released:867
Jun 04 17:52:10 broadwell kernel: XXX __page_pool_request_shutdown() inflight:8092173
Jun 04 17:52:10 broadwell kernel: XXX page_pool_inflight() inflight:8092113 hold:8093040 released:927
Jun 04 17:52:10 broadwell kernel: XXX __page_pool_safe_to_destroy() inflight:8092113
Jun 04 17:52:10 broadwell kernel: XXX __mem_id_disconnect() id:62
Jun 04 17:52:10 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:960 released:960
Jun 04 17:52:10 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:960 released:960
Jun 04 17:52:10 broadwell kernel: XXX __mem_id_disconnect() id:63
Jun 04 17:52:10 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:960 released:960
Jun 04 17:52:10 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:960 released:960
Jun 04 17:52:10 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:960 released:960
Jun 04 17:52:10 broadwell kernel: XXX __mem_id_disconnect() id:64
Jun 04 17:52:10 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:960 released:960
Jun 04 17:52:10 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:960 released:960
Jun 04 17:52:10 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:960 released:960
Jun 04 17:52:10 broadwell kernel: XXX __mem_id_disconnect() id:65
Jun 04 17:52:10 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:960 released:960
Jun 04 17:52:10 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:960 released:960
Jun 04 17:52:10 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:960 released:960
Jun 04 17:52:10 broadwell kernel: XXX __mem_id_disconnect() id:66
Jun 04 17:52:10 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:960 released:960
Jun 04 17:52:10 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:960 released:960
Jun 04 17:52:10 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:960 released:960
Jun 04 17:52:10 broadwell kernel: XXX page_pool_inflight() inflight:0 hold:960 released:960
Jun 04 17:52:10 broadwell kernel: mlx5_core 0000:03:00.0 mlx5p1: Link up
Jun 04 17:52:11 broadwell kernel: XXX mem_id_disconnect_defer_retry() id:61 call __mem_id_disconnect again
Jun 04 17:52:11 broadwell kernel: XXX __mem_id_disconnect() id:61
Jun 04 17:52:11 broadwell kernel: XXX page_pool_inflight() inflight:8092113 hold:8093040 released:927
Jun 04 17:52:11 broadwell kernel: XXX __page_pool_request_shutdown() inflight:8092113
Jun 04 17:52:11 broadwell kernel: XXX page_pool_inflight() inflight:8092113 hold:8093040 released:927
Jun 04 17:52:11 broadwell kernel: XXX __page_pool_safe_to_destroy() inflight:8092113
Jun 04 17:52:12 broadwell kernel: XXX mem_id_disconnect_defer_retry() id:61 call schedule_delayed_work
Jun 04 17:52:12 broadwell kernel: XXX mem_id_disconnect_defer_retry() id:61 call __mem_id_disconnect again
Jun 04 17:52:12 broadwell kernel: XXX __mem_id_disconnect() id:61
Jun 04 17:52:12 broadwell kernel: XXX page_pool_inflight() inflight:8092113 hold:8093040 released:927
Jun 04 17:52:12 broadwell kernel: XXX __page_pool_request_shutdown() inflight:8092113
Jun 04 17:52:12 broadwell kernel: XXX page_pool_inflight() inflight:8092113 hold:8093040 released:927
Jun 04 17:52:12 broadwell kernel: XXX __page_pool_safe_to_destroy() inflight:8092113
Jun 04 17:52:13 broadwell kernel: XXX mem_id_disconnect_defer_retry() id:61 call schedule_delayed_work
Jun 04 17:52:13 broadwell kernel: XXX mem_id_disconnect_defer_retry() id:61 call __mem_id_disconnect again
Jun 04 17:52:13 broadwell kernel: XXX __mem_id_disconnect() id:61

Tracepoints

I got inspired by proof-reading Brendan Gregg's book on BPF, specifically the bpftrace tools examples. By adding some tracepoints, we can allow for debugging and getting stats from page_pool, without adding more kernel code.

Playing with bpftrace

Can I attach to several XDP tracepoints, and then extract only the first two struct members, which will be:

TP_STRUCT__entry(
        __field(int, map_id)  or __field(int, prog_id)
        __field(u32, act)

bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

bpftrace -e 'tracepoint:xdp:xdp_redirect_map { @[comm] = count(); }'

List avail: bpftrace -lv tracepoint:xdp:xdp_redirect_map

$ sudo bpftrace -lv tracepoint:xdp:xdp_redirect_map
tracepoint:xdp:xdp_redirect_map
    int prog_id;
    u32 act;
    int ifindex;
    int err;
    int to_ifindex;
    u32 map_id;
    int map_index;
bpftrace -e 'tracepoint:xdp:xdp_redirect_map { @[comm, args->prog_id] = count(); }'

Currently all XDP tracepoints have "act" for XDP action:

$ sudo bpftrace -e 'tracepoint:xdp:xdp_* { @action[args->act] = count(); }'
Attaching 8 probes...
^C

@action[4]: 9965660

Examples: attach to all map XDP tracepoints:

sudo bpftrace -e 'tracepoint:xdp:xdp_*map* { @map_id[comm, args->map_id] = count(); }'
Attaching 5 probes...
^C

@map_id[swapper/2, 113]: 1428
@map_id[swapper/0, 113]: 2085
@map_id[ksoftirqd/4, 113]: 2253491
@map_id[ksoftirqd/2, 113]: 25677560
@map_id[ksoftirqd/0, 113]: 29004338
@map_id[ksoftirqd/3, 113]: 31034885

Could not find a native bpftrace way to exit after N seconds, so I used this shell trick instead:

sudo bpftrace -e \
 'tracepoint:xdp:xdp_devmap_xmit { @[args->map_id] = @[args->map_id] + args->sent; }' &\
PID=$! ; sleep 10 && sudo kill -SIGINT $PID

Output:

$ sudo bpftrace -e \
>  'tracepoint:xdp:xdp_devmap_xmit { @[args->map_id] = @[args->map_id] + args->sent; }' &\
> PID=$! ; sleep 10 && sudo kill -SIGINT $PID
[1] 6029
Attaching 1 probe...
[jbrouer@broadwell ~]$ 

@[173]: 144475692

bpftrace for XDP/page_pool disconnect/shutdown

sudo bpftrace -e 'kprobe:__mem_id_disconnect { @[kstack] = count(); }'

Re-introduce bug in CPUMAP to provoke shutdown/disconnect issue.

$  sudo bpftrace -e  'kprobe:__mem_id_disconnect { @[kstack] = count(); }'
Attaching 1 probe...
^C

@[
    __mem_id_disconnect+1
    xdp_rxq_info_unreg_mem_model+100
    xdp_rxq_info_unreg+30
    mlx5e_free_rq+132
    mlx5e_close_channel+34
    mlx5e_close_channels+38
    mlx5e_close_locked+71
    mlx5e_xdp+412
    dev_xdp_install+60
    dev_change_xdp_fd+206
    do_setlink+3286
    rtnl_setlink+208
    rtnetlink_rcv_msg+290
    netlink_rcv_skb+55
    netlink_unicast+361
    netlink_sendmsg+657
    sock_sendmsg+48
    __sys_sendto+232
    __x64_sys_sendto+36
    do_syscall_64+66
    entry_SYSCALL_64_after_hwframe+68
]: 6
@[
    __mem_id_disconnect+1
    xdp_rxq_info_unreg_mem_model+100
    xdp_rxq_info_unreg+30
    mlx5e_free_rq+132
    mlx5e_close_channel+34
    mlx5e_close_channels+38
    mlx5e_close_locked+71
    mlx5e_xdp+412
    dev_xdp_install+60
    do_setlink+3286
    rtnl_setlink+208
    rtnetlink_rcv_msg+290
    netlink_rcv_skb+55
    netlink_unicast+361
    netlink_sendmsg+657
    sock_sendmsg+48
    __sys_sendto+232
    __x64_sys_sendto+36
    do_syscall_64+66
    entry_SYSCALL_64_after_hwframe+68
]: 6
@[
    __mem_id_disconnect+1
    mem_id_disconnect_defer_retry+27
    process_one_work+390
    worker_thread+48
    kthread+273
    ret_from_fork+31
]: 121

Coded up a force-shutdown mechanism after 120 attempts:

Jun 09 13:03:57 broadwell kernel: mlx5_core 0000:03:00.0 mlx5p1: Link up
Jun 09 13:04:28 broadwell kernel: mem_id_disconnect_defer_retry() stalled mem.id=28 shutdown 31 attempts 30 sec
Jun 09 13:04:58 broadwell kernel: mem_id_disconnect_defer_retry() stalled mem.id=28 shutdown 61 attempts 61 sec
Jun 09 13:05:29 broadwell kernel: mem_id_disconnect_defer_retry() stalled mem.id=28 shutdown 91 attempts 92 sec
Jun 09 13:06:00 broadwell kernel: mem_id_disconnect_defer_retry() stalled mem.id=28 shutdown 121 attempts 122 sec
Jun 09 13:06:01 broadwell kernel: ------------[ cut here ]------------
Jun 09 13:06:01 broadwell kernel: Still in-flight pages:10053168 hold:10054146 released:978
Jun 09 13:06:01 broadwell kernel: WARNING: CPU: 1 PID: 0 at net/core/page_pool.c:349 __page_pool_free+0x78/0xc0
Jun 09 13:06:01 broadwell kernel: Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp tun bridge ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter rpcrdma sunrpc ib_umad rdma_ucm ib_ipoib rdma_cm iw_cm ib_cm mlx5_ib i40iw ib_uverbs ib_core coretemp intel_cstate intel_uncore ftdi_sio intel_rapl_perf pcspkr usbserial i2c_i801 wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad pcc_cpufreq sch_fq_codel ip_tables x_tables igb ixgbe i40e mdio mlx5_core ptp nfp i2c_algo_bit i2c_core pps_core hid_generic
Jun 09 13:06:01 broadwell kernel: CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W         5.2.0-rc1-bpf-next-page-pool+ #171
Jun 09 13:06:01 broadwell kernel: Hardware name: Supermicro Super Server/X10SRi-F, BIOS 2.0a 08/01/2016
Jun 09 13:06:01 broadwell kernel: RIP: 0010:__page_pool_free+0x78/0xc0
Jun 09 13:06:01 broadwell kernel: Code: 4d 85 e4 75 3f 48 89 df e8 65 ff ff ff 84 c0 75 1b 8b 8b 40 05 00 00 48 c7 c7 38 83 16 82 8b 53 20 89 d6 29 ce e8 68 d7 92 ff <0f> 0b 48 8b bb 08 05 00 00 e8 0a db a7 ff 48 89 df 5b 5d 41 5c e9
Jun 09 13:06:01 broadwell kernel: RSP: 0018:ffffc9000003eee8 EFLAGS: 00010296
Jun 09 13:06:01 broadwell kernel: RAX: 0000000000000039 RBX: ffff88882a376000 RCX: 0000000000000000
Jun 09 13:06:01 broadwell kernel: RDX: 0000000000000039 RSI: ffffffff82a64fd9 RDI: ffffffff82a62b6c
Jun 09 13:06:01 broadwell kernel: RBP: ffff88882a3764c8 R08: ffffffff82a64fa0 R09: 0000000000028cc0
Jun 09 13:06:01 broadwell kernel: R10: ffffc9000003ef40 R11: 0000000080000001 R12: 0000000000000000
Jun 09 13:06:01 broadwell kernel: R13: 0000000000000202 R14: 0000000000000009 R15: 0000000000000000
Jun 09 13:06:01 broadwell kernel: FS:  0000000000000000(0000) GS:ffff88885c640000(0000) knlGS:0000000000000000
Jun 09 13:06:01 broadwell kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 09 13:06:01 broadwell kernel: CR2: 00007f54c1822000 CR3: 000000087f20a006 CR4: 00000000003606e0
Jun 09 13:06:01 broadwell kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 09 13:06:01 broadwell kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Jun 09 13:06:01 broadwell kernel: Call Trace:
Jun 09 13:06:01 broadwell kernel:  <IRQ>
Jun 09 13:06:01 broadwell kernel:  __xdp_mem_allocator_rcu_free+0x4d/0x50
Jun 09 13:06:01 broadwell kernel:  rcu_core+0x2c4/0x4d0
Jun 09 13:06:01 broadwell kernel:  __do_softirq+0xdd/0x323
Jun 09 13:06:01 broadwell kernel:  irq_exit+0xb6/0xc0
Jun 09 13:06:01 broadwell kernel:  smp_apic_timer_interrupt+0x68/0x150
Jun 09 13:06:01 broadwell kernel:  apic_timer_interrupt+0xf/0x20
Jun 09 13:06:01 broadwell kernel:  </IRQ>
Jun 09 13:06:01 broadwell kernel: RIP: 0010:cpuidle_enter_state+0xad/0x420
Jun 09 13:06:01 broadwell kernel: Code: c4 0f 1f 44 00 00 31 ff e8 30 37 9d ff 80 7c 24 0b 00 74 12 9c 58 f6 c4 02 0f 85 40 03 00 00 31 ff e8 07 4e a2 ff fb 45 85 ed <0f> 88 0e 03 00 00 4c 2b 24 24 49 63 f5 48 ba cf f7 53 e3 a5 9b c4
Jun 09 13:06:01 broadwell kernel: RSP: 0018:ffffc900031ebe98 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
Jun 09 13:06:01 broadwell kernel: RAX: ffff88885c669440 RBX: ffffffff82316920 RCX: 000000000000001f
Jun 09 13:06:01 broadwell kernel: RDX: 0000005e14cbccaf RSI: 00000000238e3c20 RDI: 0000000000000000
Jun 09 13:06:01 broadwell kernel: RBP: ffff88885c672500 R08: 0000000000000002 R09: 0000000000028cc0
Jun 09 13:06:01 broadwell kernel: R10: ffffc900031ebe78 R11: 0000000000000356 R12: 0000005e14cbccaf
Jun 09 13:06:01 broadwell kernel: R13: 0000000000000004 R14: 0000000000000004 R15: 0000000000000000
Jun 09 13:06:01 broadwell kernel:  ? cpuidle_enter_state+0x90/0x420
Jun 09 13:06:01 broadwell kernel:  cpuidle_enter+0x29/0x40
Jun 09 13:06:01 broadwell kernel:  do_idle+0x1a1/0x1e0
Jun 09 13:06:01 broadwell kernel:  cpu_startup_entry+0x19/0x20
Jun 09 13:06:01 broadwell kernel:  start_secondary+0x10f/0x140
Jun 09 13:06:01 broadwell kernel:  secondary_startup_64+0xa4/0xb0
Jun 09 13:06:01 broadwell kernel: ---[ end trace c381db20c024d417 ]---

Adding tracepoints for xdp mem_disconnect

When adding a tracepoint, consider if a kprobe could serve the same purpose.

As shown above we can use kprobes via kprobe:__mem_id_disconnect, but info is lacking info 'safe_to_remove' and 'force' event.

More importantly we want a stable pointer to struct xdp_mem_allocator, just before it's scheduled for RCU removal.

Adding tracepoint xdp:mem_disconnect

Base tests for it tracepoint exists:

perf record -e xdp:mem_disconnect
sudo bpftrace -e 'tracepoint:xdp:mem_disconnect { @[comm] = count(); }'

Question: can we extract xa->allocator info ?

First need to mount kernel source:

sshfs -o allow_other 192.168.42.3:git/kernel/bpf-next ~/git/kernel/bpf-next

sudo bpftrace -e 'tracepoint:xdp:mem_disconnect { @[comm] = count(); }'

More advanced fails:

sudo bpftrace -e 'tracepoint:xdp:mem_disconnect { @[comm] = args->xa->mem.id; }'
Unknown struct/union: 'const struct xdp_mem_allocator'

sudo bpftrace -e 'tracepoint:xdp:mem_disconnect { $a = args->xa; }'

If we manually define struct xdp_mem_allocator (and struct xdp_mem_info) inside the bpftrace script, and manually type-cast, then it works. Behind the scenes bpftrace will do bpf_probe_read to extract the info from kernel memory.

#!/usr/bin/bpftrace
#include <linux/types.h>

struct xdp_mem_info {
        u32 type; /* enum xdp_mem_type, but known size type */
        u32 id;
};

struct xdp_mem_allocator {
        struct xdp_mem_info mem;
        union {
                void *allocator;
                struct page_pool *page_pool;
                struct zero_copy_allocator *zc_alloc;
        };
        int disconnect_cnt;
        unsigned long defer_start;
        //struct rhash_head node;
        //struct rcu_head rcu;
        //struct delayed_work defer_wq;
        //unsigned long defer_warn;
};

tracepoint:xdp:mem_disconnect {
        $xa = (struct xdp_mem_allocator *)args->xa;
        $cntA = args->disconnect_cnt;
        //$cntB = $xa->disconnect_cnt;

        /* Extract mem.id in two different ways */
        $idA = args->mem_id;
        $idB = $xa->mem.id;

        if ($cntA > 1) {
                printf("%s: mem.id=%d re-scheduled", probe, $idA);
        }

        if (args->force == 1) {
                printf("%s: mem.id=%d FORCE shutdown", probe, $idA);
        }

        $page_pool = args->allocator;
        // $page_pool = $xa->allocator;
        @memid_to_page_pool[$idA] = $page_pool;

        time();
        printf("%s: mem.id-A:%d B:%d 0x%lX\n", probe, $idA, $idB, $page_pool);
}

Adding tracepoint for xdp mem_connect

Also add a tracepoint called xdp:mem_connect to pickup ifindex.

Patch desc for adding xdp tracepoints

xdp: add tracepoints for XDP mem

These tracepoints make it easier to troubleshoot XDP mem id disconnect and deferred shutdown procedure.

The xdp:mem_disconnect tracepoint cannot be replaced via kprobe. It is placed at the last stable place for the pointer to struct xdp_mem_allocator, just before it's scheduled for RCU removal. It also extract info on 'safe_to_remove' and 'force'.

Detailed info about in-flight pages is not available at this layer. The next patch will added tracepoints needed at the page_pool layer for this.

Adding tracepoints for page_pool

Patch desc: page_pool: add tracepoints for page_pool details need by XDP

The xdp tracepoints for mem id disconnect don't carry information about, why it was not safe_to_remove. The tracepoint page_pool:page_pool_inflight in this patch can be used for extract this info for further debugging.

This patchset also adds tracepoint for the pages_state_* release/hold transitions, including a pointer to the page. This can be used for stats about in-flight pages, or used to debug page leakage via keeping track of page pointer and combining this with kprobe for __put_page().

bpftrace notes

Compile notes

I want build, but install in usr/local/stow

mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=/usr/local/stow/bpftrace-git01 ..
make -j 4
make install
-- Installing: /usr/local/stow/bpftrace-git01/bin/bpftrace

Use stow to enable/disable using this version:

cd /usr/local/stow/
sudo stow    bpftrace-git01/
sudo stow -D bpftrace-git01/

Date: 2021-09-20 Mon 18:33

Validate