dm-pcache �� persistent-memory cache for block devices by blktests-ci[bot] · Pull Request #35 · linux-blktests/linux-block

blktests-ci · 2025-07-10T12:13:13Z

Pull request for series with
subject: dm-pcache �� persistent-memory cache for block devices
version: 2
url: https://patchwork.kernel.org/project/linux-block/list/?series=979565

Consolidate common PCACHE helpers into a new header so that subsequent patches can include them without repeating boiler-plate. - Logging macros with unified prefix and location info. - Common constants (KB/MB helpers, metadata replica count, CRC seed). - On-disk metadata header definition and CRC helper. - Sequence-number comparison that handles wrap-around. - pcache_meta_find_latest() to pick the newest valid metadata copy. Signed-off-by: Dongsheng Yang <[email protected]>

This patch introduces *backing_dev.{c,h}*, a self-contained layer that handles all interaction with the *backing block device* where cache write-back and cache-miss reads are serviced. Isolating this logic keeps the core dm-pcache code free of low-level bio plumbing. * Device setup / teardown - Opens the target with `dm_get_device()`, stores `bdev`, file and size, and initialises a dedicated `bioset`. - Gracefully releases resources via `backing_dev_stop()`. * Request object (`struct pcache_backing_dev_req`) - Two request flavours: - REQ-type – cloned from an upper `struct bio` issued to dm-pcache; trimmed and re-targeted to the backing LBA. - KMEM-type – maps an arbitrary kernel memory buffer into a freshly built. - Private completion callback (`end_req`) propagates status to the upper layer and handles resource recycling. * Submission & completion path - Lock-protected submit queue + worker (`req_submit_work`) let pcache push many requests asynchronously, at the same time, allow caller to submit backing_dev_req in atomic context. - End-io handler moves finished requests to a completion list processed by `req_complete_work`, ensuring callbacks run in process context. - Direct-submit option for non-atomic context. * Flush - `backing_dev_flush()` issues a flush to persist backing-device data. Signed-off-by: Dongsheng Yang <[email protected]>

Add cache_dev.{c,h} to manage the persistent-memory device that stores all pcache metadata and data segments. Splitting this logic out keeps the main dm-pcache code focused on policy while cache_dev handles the low-level interaction with the DAX block device. * DAX mapping - Opens the underlying device via dm_get_device(). - Uses dax_direct_access() to obtain a direct linear mapping; falls back to vmap() when the range is fragmented. * On-disk layout ┌─ 4 KB ─┐ super-block (SB) ├─ 4 KB ─┤ cache_info[0] ├─ 4 KB ─┤ cache_info[1] ├─ 4 KB ─┤ cache_ctrl └─ ... ─┘ segments Constants and macros in the header expose offsets and sizes. * Super-block handling - sb_read(), sb_validate(), sb_init() verify magic, CRC32 and host endianness (flag *PCACHE_SB_F_BIGENDIAN*). - Formatting zeroes the metadata replicas and initialises the segment bitmap when the SB is blank. * Segment allocator - Bitmap protected by seg_lock; find_next_zero_bit() yields the next free 16 MB segment. * Lifecycle helpers - cache_dev_start()/stop() encapsulate init/exit and are invoked by dm-pcache core. - Gracefully handles errors: CRC mismatch, wrong endianness, device too small (< 512 MB), or failed DAX mapping. Signed-off-by: Dongsheng Yang <[email protected]>

Introduce segment.{c,h}, an internal abstraction that encapsulates everything related to a single pcache *segment* (the fixed-size allocation unit stored on the cache-device). * On-disk metadata (`struct pcache_segment_info`) - Embedded `struct pcache_meta_header` for CRC/sequence handling. - `flags` field encodes a “has-next” bit and a 4-bit *type* class (`CACHE_DATA` added as the first type). * Initialisation - `pcache_segment_init()` populates the in-memory `struct pcache_segment` from a given segment id, data offset and metadata pointer, computing the usable `data_size` and virtual address within the DAX mapping. * IO helpers - `segment_copy_to_bio()` / `segment_copy_from_bio()` move data between pmem and a bio, using `_copy_mc_to_iter()` and `_copy_from_iter_flushcache()` to tolerate hw memory errors and ensure durability. - `segment_pos_advance()` advances an internal offset while staying inside the segment’s data area. These helpers allow upper layers (cache key management, write-back logic, GC, etc.) to treat a segment as a contiguous byte array without knowing about DAX mappings or persistence details. Signed-off-by: Dongsheng Yang <[email protected]>

Introduce *cache_segment.c*, the in-memory/on-disk glue that lets a `struct pcache_cache` manage its array of data segments. * Metadata handling - Loads the most-recent replica of both the segment-info block (`struct pcache_segment_info`) and per-segment generation counter (`struct pcache_cache_seg_gen`) using `pcache_meta_find_latest()`. - Updates those structures atomically with CRC + sequence rollover, writing alternately to the two metadata slots inside each segment. * Segment initialisation (`cache_seg_init`) - Builds a `struct pcache_segment` pointing to the segment’s data area, sets up locks, generation counters, and, when formatting a new cache, zeroes the on-segment kset header. * Linked-list of segments - `cache_seg_set_next_seg()` stores the *next* segment id in `seg_info->next_seg` and sets the HAS_NEXT flag, allowing a cache to span multiple segments. This is important to allow other type of segment added in future. * Runtime life-cycle - Reference counting (`cache_seg_get/put`) with invalidate-on-last-put that clears the bitmap slot and schedules cleanup work. - Generation bump (`cache_seg_gen_increase`) persists a new generation record whenever the segment is modified. * Allocator - `get_cache_segment()` uses a bitmap and per-cache hint to pick the next free segment, retrying with micro-delays when none are immediately available. Signed-off-by: Dongsheng Yang <[email protected]>

Introduce cache_writeback.c, which implements the asynchronous write-back path for pcache. The new file is responsible for detecting dirty data, organising it into an in-memory tree, issuing bios to the backing block device, and advancing the cache’s *dirty tail* pointer once data has been safely persisted. * Dirty-state detection - `__is_cache_clean()` reads the kset header at `dirty_tail`, checks magic and CRC, and thus decides whether there is anything to flush. * Write-back scheduler - `cache_writeback_work` is queued on the cache task-workqueue and re-arms itself at `PCACHE_CACHE_WRITEBACK_INTERVAL`. - Uses an internal spin-protected `writeback_key_tree` to batch keys belonging to the same stripe before IO. * Key processing - `cache_kset_insert_tree()` decodes each key inside the on-media kset, allocates an in-memory key object, and inserts it into the writeback_key_tree. - `cache_key_writeback()` builds a *KMEM-type* backing request that maps the persistent-memory range directly into a WRITE bio and submits it with `submit_bio_noacct()`. - After all keys from the writeback_key_tree have been flushed, `backing_dev_flush()` issues a single FLUSH to ensure durability. * Tail advancement - Once a kset is written back, `cache_pos_advance()` moves `cache->dirty_tail` by the exact on-disk size and the new position is persisted via `cache_encode_dirty_tail()`. - When the `PCACHE_KSET_FLAGS_LAST` flag is seen, the write-back engine switches to the next segment indicated by `next_cache_seg_id`. Signed-off-by: Dongsheng Yang <[email protected]>

Introduce cache_gc.c, a self-contained engine that reclaims cache segments whose data have already been flushed to the backing device. Running in the cache workqueue, the GC keeps segment usage below the user-configurable *cache_gc_percent* threshold. * need_gc() – decides when to trigger GC by checking: - *dirty_tail* vs *key_tail* position, - kset integrity (magic + CRC), - bitmap utilisation against the gc-percent threshold. * Per-key reclamation - Decodes each key in the target kset (`cache_key_decode()`). - Drops the segment reference with `cache_seg_put()`, allowing the segment to be invalidated once all keys are gone. - When the reference count hits zero the segment is cleared from `seg_map`, making it immediately reusable by the allocator. * Scheduling - `pcache_cache_gc_fn()` loops until no more work is needed, then re-queues itself after *PCACHE_CACHE_GC_INTERVAL*. Signed-off-by: Dongsheng Yang <[email protected]>

Add *cache_key.c* which becomes the heart of dm-pcache’s in-memory index and on-media key-set (“kset”) format. * Key objects (`struct pcache_cache_key`) - Slab-backed allocator & ref-count helpers - `cache_key_encode()/decode()` translate between in-memory keys and their on-disk representation, validating CRC when *cache_data_crc* is enabled. * Kset construction & persistence - Per-kset buffer lives in `struct pcache_cache_kset`; keys are appended until full or *force_close* triggers an immediate flush. - `cache_kset_close()` writes the kset to the *key_head* segment, automatically chaining a *LAST* kset header when rolling over to a freshly allocated segment. * Red-black tree with striping - Cache space is divided into *subtrees* to reduce lock contention; each subtree owns its own RB-root + spinlock. - Complex overlap-resolution logic (`cache_insert_fixup()`) ensures newly inserted keys never leave overlapping stale ranges behind (head/tail/contain/contained cases handled). * Replay on start-up - `cache_replay()` walks from *key_tail* to *key_head*, re-hydrates keys, validates CRC/magic, seamlessly skipping placeholder “empty” keys left by read-misses. * Background maintenance - `clean_work` lazily prunes invalidated keys after GC. - `kset_flush_work` background thread to close a kset. With this patch dm-pcache can persistently track cached extents, rebuild its index after crash, and guarantee non-overlapping key space – paving the way for functional read/write caching. Signed-off-by: Dongsheng Yang <[email protected]>

Introduce cache_req.c, the high-level engine that drives I/O requests through dm-pcache. It decides whether data is served from the cache or fetched from the backing device, allocates new cache space on writes, and flushes dirty ksets when required. * Read path - Traverses the striped RB-trees to locate cached extents. - Generates backing READ requests for gaps and inserts placeholder “empty” keys to avoid duplicate fetches. - Copies valid data directly from pmem into the caller’s bio; CRC and generation checks guard against stale segments. * Write path - Allocates space in the current data segment via cache_data_alloc(). - Copies data from the bio into pmem, then inserts or updates keys, splitting or trimming overlapped ranges as needed. - Adds each new key to the active kset; forces kset close when FUA is requested or the kset is full. * Miss handling - create_cache_miss_req() builds a backing READ, optionally attaching an empty key. - miss_read_end_req() replaces the placeholder with real data once the READ completes, or deletes it on error. * Flush support - cache_flush() iterates over all ksets and forces them to close, ensuring data durability when REQ_PREFLUSH is received. Signed-off-by: Dongsheng Yang <[email protected]>

Add cache.c and cache.h that introduce the top-level “struct pcache_cache”. This object glues together the backing block device, the persistent-memory cache device, segment array, RB-tree indexes, and the background workers for write-back and garbage collection. * Persistent metadata - pcache_cache_info tracks options such as cache mode, data-crc flag and GC threshold, written atomically with CRC+sequence. - key_tail and dirty_tail positions are double-buffered and recovered at mount time. * Segment management - kvcalloc()’d array of pcache_cache_segment objects, bitmap for fast allocation, refcounts and generation numbers so GC can invalidate old extents safely. - First segment hosts a pcache_cache_ctrl block shared by all threads. * Request path hooks - pcache_cache_handle_req() dispatches READ, WRITE and FLUSH bios to the engines added in earlier patches. - Per-CPU data_heads support lock-free allocation of space for new writes. * Background workers - Delayed work items for write-back (5 s) and GC (5 s). - clean_work removes stale keys after segments are reclaimed. * Lifecycle helpers - pcache_cache_start()/stop() bring the cache online, replay keys, start workers, and flush everything on shutdown. With this piece in place dm-pcache has a fully initialised cache object capable of serving I/O and maintaining its on-disk structures. Signed-off-by: Dongsheng Yang <[email protected]>

Add the top-level integration pieces that make the new persistent-memory cache target usable from device-mapper: * Documentation - `Documentation/admin-guide/device-mapper/dm-pcache.rst` explains the design, table syntax, status fields and runtime messages. * Core target implementation - `dm_pcache.c` and `dm_pcache.h` register the `"pcache"` DM target, parse constructor arguments, create workqueues, and forward BIOS to the cache core added in earlier patches. - Supports flush/FUA, status reporting, and a “gc_percent” message. - Dont support discard currently. - Dont support table reload for live target currently. * Device-mapper tables now accept lines like pcache <pmem_dev> <backing_dev> writeback <true|false> Signed-off-by: Dongsheng Yang <[email protected]>

blktests-ci · 2025-07-10T12:13:14Z

Upstream branch: f4ca523
series: https://patchwork.kernel.org/project/linux-block/list/?series=979565
version: 2

Commit 1767bb2 ("ipv6: mcast: Don't hold RTNL for IPV6_ADD_MEMBERSHIP and MCAST_JOIN_GROUP.") removed the RTNL lock for IPV6_ADD_MEMBERSHIP and MCAST_JOIN_GROUP operations. However, this change triggered the following call trace on my BeagleBone Black board: WARNING: net/8021q/vlan_core.c:236 at vlan_for_each+0x120/0x124, CPU#0: rpcbind/481 RTNL: assertion failed at net/8021q/vlan_core.c (236) Modules linked in: CPU: 0 UID: 997 PID: 481 Comm: rpcbind Not tainted 6.19.0-rc7-next-20260130-yocto-standard+ #35 PREEMPT Hardware name: Generic AM33XX (Flattened Device Tree) Call trace: unwind_backtrace from show_stack+0x28/0x2c show_stack from dump_stack_lvl+0x30/0x38 dump_stack_lvl from __warn+0xb8/0x11c __warn from warn_slowpath_fmt+0x130/0x194 warn_slowpath_fmt from vlan_for_each+0x120/0x124 vlan_for_each from cpsw_add_mc_addr+0x54/0x98 cpsw_add_mc_addr from __hw_addr_ref_sync_dev+0xc4/0xec __hw_addr_ref_sync_dev from __dev_mc_add+0x78/0x88 __dev_mc_add from igmp6_group_added+0x84/0xec igmp6_group_added from __ipv6_dev_mc_inc+0x1fc/0x2f0 __ipv6_dev_mc_inc from __ipv6_sock_mc_join+0x124/0x1b4 __ipv6_sock_mc_join from do_ipv6_setsockopt+0x84c/0x1168 do_ipv6_setsockopt from ipv6_setsockopt+0x88/0xc8 ipv6_setsockopt from do_sock_setsockopt+0xe8/0x19c do_sock_setsockopt from __sys_setsockopt+0x84/0xac __sys_setsockopt from ret_fast_syscall+0x0/0x54 This trace occurs because vlan_for_each() is called within cpsw_ndo_set_rx_mode(), which expects the RTNL lock to be held. Since modifying vlan_for_each() to operate without the RTNL lock is not straightforward, and because ndo_set_rx_mode() is invoked both with and without the RTNL lock across different code paths, simply adding rtnl_lock() in cpsw_ndo_set_rx_mode() is not a viable solution. To resolve this issue, we opt to execute the actual processing within a work queue, following the approach used by the icssg-prueth driver. Please note: To reproduce this issue, I manually reverted the changes to am335x-bone-common.dtsi from commit c477358 ("ARM: dts: am335x-bone: switch to new cpsw switch drv") in order to revert to the legacy cpsw driver. Fixes: 1767bb2 ("ipv6: mcast: Don't hold RTNL for IPV6_ADD_MEMBERSHIP and MCAST_JOIN_GROUP.") Signed-off-by: Kevin Hao <[email protected]> Cc: [email protected] Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>

Commit a75b2be ("iommu: Add iommu_driver_get_domain_for_dev() helper") introduced iommu_driver_get_domain_for_dev() for driver code paths that hold iommu_group->mutex while attaching a device to an IOMMU domain. The same commit also added a lockdep assertion in iommu_get_domain_for_dev() to ensure that callers do not hold iommu_group->mutex when invoking it. On powerpc platforms, when PCI device ownership is switched from BLOCKED to the PLATFORM domain, the attach callback spapr_tce_platform_iommu_attach_dev() still calls iommu_get_domain_for_dev(). This happens while iommu_group->mutex is held during domain switching, which triggers the lockdep warning below during PCI enumeration: WARNING: drivers/iommu/iommu.c:2252 at iommu_get_domain_for_dev+0x38/0x80, CPU#2: swapper/0/1 Modules linked in: CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc2+ #35 PREEMPT Hardware name: IBM,9105-22A Power11 (architected) 0x820200 0xf000007 of:IBM,FW1120.00 (RB1120_115) hv:phyp pSeries NIP: c000000000c244c4 LR: c00000000005b5a4 CTR: c00000000005b578 REGS: c00000000a7bf280 TRAP: 0700 Not tainted (7.0.0-rc2+) MSR: 8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 22004422 XER: 0000000a CFAR: c000000000c24508 IRQMASK: 0 GPR00: c00000000005b5a4 c00000000a7bf520 c000000001dc8100 0000000000000001 GPR04: c00000000f972f10 0000000000000000 0000000000000000 0000000000000001 GPR08: 0000001ffbc60000 0000000000000001 0000000000000000 0000000000000000 GPR12: c00000000005b578 c000001fffffe480 c000000000011618 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: ffffffffffffefff 0000000000000000 c000000002d30eb0 0000000000000001 GPR24: c0000000017881f8 0000000000000000 0000000000000001 c00000000f972e00 GPR28: c00000000bbba0d0 0000000000000000 c00000000bbba0d0 c00000000f972e00 NIP [c000000000c244c4] iommu_get_domain_for_dev+0x38/0x80 LR [c00000000005b5a4] spapr_tce_platform_iommu_attach_dev+0x2c/0x98 Call Trace: iommu_get_domain_for_dev+0x68/0x80 (unreliable) spapr_tce_platform_iommu_attach_dev+0x2c/0x98 __iommu_attach_device+0x44/0x220 __iommu_device_set_domain+0xf4/0x194 __iommu_group_set_domain_internal+0xec/0x228 iommu_setup_default_domain+0x5f4/0x6a4 __iommu_probe_device+0x674/0x724 iommu_probe_device+0x50/0xb4 iommu_add_device+0x48/0x198 pci_dma_dev_setup_pSeriesLP+0x198/0x4f0 pcibios_bus_add_device+0x80/0x464 pci_bus_add_device+0x40/0x100 pci_bus_add_devices+0x54/0xb0 pcibios_init+0xd8/0x140 do_one_initcall+0x8c/0x598 kernel_init_freeable+0x3ec/0x850 kernel_init+0x34/0x270 ret_from_kernel_user_thread+0x14/0x1c Fix this by using iommu_driver_get_domain_for_dev() instead of iommu_get_domain_for_dev() in spapr_tce_platform_iommu_attach_dev(), which is the appropriate helper for callers holding the group mutex. Cc: [email protected] Fixes: a75b2be ("iommu: Add iommu_driver_get_domain_for_dev() helper") Closes: https://patchwork.ozlabs.org/project/linuxppc-dev/patch/[email protected]/ Signed-off-by: Nilay Shroff <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Tested-by: Venkat Rao Bagalkote <[email protected]> [Maddy: Added Closes, tested and reviewed by tags] Signed-off-by: Madhavan Srinivasan <[email protected]> Link: https://patch.msgid.link/[email protected]

yangdongsheng added 11 commits July 10, 2025 21:13

blktests-ci Bot added new for-next V2 labels Jul 10, 2025

blktests-ci Bot closed this Jul 10, 2025

blktests-ci Bot deleted the series/975154=>for-next branch July 23, 2025 02:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dm-pcache �� persistent-memory cache for block devices#35

dm-pcache �� persistent-memory cache for block devices#35
blktests-ci[bot] wants to merge 11 commits intofor-next_basefrom
series/975154=>for-next

blktests-ci Bot commented Jul 10, 2025

Uh oh!

blktests-ci Bot commented Jul 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

blktests-ci Bot commented Jul 10, 2025

Uh oh!

blktests-ci Bot commented Jul 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant