dm-pcache ��� persistent-memory cache for block devices#35
Closed
blktests-ci[bot] wants to merge 11 commits intofor-next_basefrom
Closed
dm-pcache ��� persistent-memory cache for block devices#35blktests-ci[bot] wants to merge 11 commits intofor-next_basefrom
blktests-ci[bot] wants to merge 11 commits intofor-next_basefrom
Conversation
Consolidate common PCACHE helpers into a new header so that subsequent patches can include them without repeating boiler-plate. - Logging macros with unified prefix and location info. - Common constants (KB/MB helpers, metadata replica count, CRC seed). - On-disk metadata header definition and CRC helper. - Sequence-number comparison that handles wrap-around. - pcache_meta_find_latest() to pick the newest valid metadata copy. Signed-off-by: Dongsheng Yang <[email protected]>
This patch introduces *backing_dev.{c,h}*, a self-contained layer that
handles all interaction with the *backing block device* where cache
write-back and cache-miss reads are serviced. Isolating this logic
keeps the core dm-pcache code free of low-level bio plumbing.
* Device setup / teardown
- Opens the target with `dm_get_device()`, stores `bdev`, file and
size, and initialises a dedicated `bioset`.
- Gracefully releases resources via `backing_dev_stop()`.
* Request object (`struct pcache_backing_dev_req`)
- Two request flavours:
- REQ-type – cloned from an upper `struct bio` issued to
dm-pcache; trimmed and re-targeted to the backing LBA.
- KMEM-type – maps an arbitrary kernel memory buffer
into a freshly built.
- Private completion callback (`end_req`) propagates status to the
upper layer and handles resource recycling.
* Submission & completion path
- Lock-protected submit queue + worker (`req_submit_work`) let pcache
push many requests asynchronously, at the same time, allow caller
to submit backing_dev_req in atomic context.
- End-io handler moves finished requests to a completion list processed
by `req_complete_work`, ensuring callbacks run in process context.
- Direct-submit option for non-atomic context.
* Flush
- `backing_dev_flush()` issues a flush to persist backing-device data.
Signed-off-by: Dongsheng Yang <[email protected]>
Add cache_dev.{c,h} to manage the persistent-memory device that stores
all pcache metadata and data segments. Splitting this logic out keeps
the main dm-pcache code focused on policy while cache_dev handles the
low-level interaction with the DAX block device.
* DAX mapping
- Opens the underlying device via dm_get_device().
- Uses dax_direct_access() to obtain a direct linear mapping; falls
back to vmap() when the range is fragmented.
* On-disk layout
┌─ 4 KB ─┐ super-block (SB)
├─ 4 KB ─┤ cache_info[0]
├─ 4 KB ─┤ cache_info[1]
├─ 4 KB ─┤ cache_ctrl
└─ ... ─┘ segments
Constants and macros in the header expose offsets and sizes.
* Super-block handling
- sb_read(), sb_validate(), sb_init() verify magic, CRC32 and host
endianness (flag *PCACHE_SB_F_BIGENDIAN*).
- Formatting zeroes the metadata replicas and initialises the segment
bitmap when the SB is blank.
* Segment allocator
- Bitmap protected by seg_lock; find_next_zero_bit() yields the next
free 16 MB segment.
* Lifecycle helpers
- cache_dev_start()/stop() encapsulate init/exit and are invoked by
dm-pcache core.
- Gracefully handles errors: CRC mismatch, wrong endianness, device
too small (< 512 MB), or failed DAX mapping.
Signed-off-by: Dongsheng Yang <[email protected]>
Introduce segment.{c,h}, an internal abstraction that encapsulates
everything related to a single pcache *segment* (the fixed-size
allocation unit stored on the cache-device).
* On-disk metadata (`struct pcache_segment_info`)
- Embedded `struct pcache_meta_header` for CRC/sequence handling.
- `flags` field encodes a “has-next” bit and a 4-bit *type* class
(`CACHE_DATA` added as the first type).
* Initialisation
- `pcache_segment_init()` populates the in-memory
`struct pcache_segment` from a given segment id, data offset and
metadata pointer, computing the usable `data_size` and virtual
address within the DAX mapping.
* IO helpers
- `segment_copy_to_bio()` / `segment_copy_from_bio()` move data
between pmem and a bio, using `_copy_mc_to_iter()` and
`_copy_from_iter_flushcache()` to tolerate hw memory errors and
ensure durability.
- `segment_pos_advance()` advances an internal offset while staying
inside the segment’s data area.
These helpers allow upper layers (cache key management, write-back
logic, GC, etc.) to treat a segment as a contiguous byte array without
knowing about DAX mappings or persistence details.
Signed-off-by: Dongsheng Yang <[email protected]>
Introduce *cache_segment.c*, the in-memory/on-disk glue that lets a
`struct pcache_cache` manage its array of data segments.
* Metadata handling
- Loads the most-recent replica of both the segment-info block
(`struct pcache_segment_info`) and per-segment generation counter
(`struct pcache_cache_seg_gen`) using `pcache_meta_find_latest()`.
- Updates those structures atomically with CRC + sequence rollover,
writing alternately to the two metadata slots inside each segment.
* Segment initialisation (`cache_seg_init`)
- Builds a `struct pcache_segment` pointing to the segment’s data
area, sets up locks, generation counters, and, when formatting a new
cache, zeroes the on-segment kset header.
* Linked-list of segments
- `cache_seg_set_next_seg()` stores the *next* segment id in
`seg_info->next_seg` and sets the HAS_NEXT flag, allowing a cache to
span multiple segments. This is important to allow other type of
segment added in future.
* Runtime life-cycle
- Reference counting (`cache_seg_get/put`) with invalidate-on-last-put
that clears the bitmap slot and schedules cleanup work.
- Generation bump (`cache_seg_gen_increase`) persists a new generation
record whenever the segment is modified.
* Allocator
- `get_cache_segment()` uses a bitmap and per-cache hint to pick the
next free segment, retrying with micro-delays when none are
immediately available.
Signed-off-by: Dongsheng Yang <[email protected]>
Introduce cache_writeback.c, which implements the asynchronous write-back
path for pcache. The new file is responsible for detecting dirty data,
organising it into an in-memory tree, issuing bios to the backing block
device, and advancing the cache’s *dirty tail* pointer once data has
been safely persisted.
* Dirty-state detection
- `__is_cache_clean()` reads the kset header at `dirty_tail`, checks
magic and CRC, and thus decides whether there is anything to flush.
* Write-back scheduler
- `cache_writeback_work` is queued on the cache task-workqueue and
re-arms itself at `PCACHE_CACHE_WRITEBACK_INTERVAL`.
- Uses an internal spin-protected `writeback_key_tree` to batch keys
belonging to the same stripe before IO.
* Key processing
- `cache_kset_insert_tree()` decodes each key inside the on-media
kset, allocates an in-memory key object, and inserts it into the
writeback_key_tree.
- `cache_key_writeback()` builds a *KMEM-type* backing request that
maps the persistent-memory range directly into a WRITE bio and
submits it with `submit_bio_noacct()`.
- After all keys from the writeback_key_tree have been flushed,
`backing_dev_flush()` issues a single FLUSH to ensure durability.
* Tail advancement
- Once a kset is written back, `cache_pos_advance()` moves
`cache->dirty_tail` by the exact on-disk size and the new position is
persisted via `cache_encode_dirty_tail()`.
- When the `PCACHE_KSET_FLAGS_LAST` flag is seen, the write-back
engine switches to the next segment indicated by `next_cache_seg_id`.
Signed-off-by: Dongsheng Yang <[email protected]>
Introduce cache_gc.c, a self-contained engine that reclaims cache
segments whose data have already been flushed to the backing device.
Running in the cache workqueue, the GC keeps segment usage below the
user-configurable *cache_gc_percent* threshold.
* need_gc() – decides when to trigger GC by checking:
- *dirty_tail* vs *key_tail* position,
- kset integrity (magic + CRC),
- bitmap utilisation against the gc-percent threshold.
* Per-key reclamation
- Decodes each key in the target kset (`cache_key_decode()`).
- Drops the segment reference with `cache_seg_put()`, allowing the
segment to be invalidated once all keys are gone.
- When the reference count hits zero the segment is cleared from
`seg_map`, making it immediately reusable by the allocator.
* Scheduling
- `pcache_cache_gc_fn()` loops until no more work is needed, then
re-queues itself after *PCACHE_CACHE_GC_INTERVAL*.
Signed-off-by: Dongsheng Yang <[email protected]>
Add *cache_key.c* which becomes the heart of dm-pcache’s
in-memory index and on-media key-set (“kset”) format.
* Key objects (`struct pcache_cache_key`)
- Slab-backed allocator & ref-count helpers
- `cache_key_encode()/decode()` translate between in-memory keys and
their on-disk representation, validating CRC when
*cache_data_crc* is enabled.
* Kset construction & persistence
- Per-kset buffer lives in `struct pcache_cache_kset`; keys are
appended until full or *force_close* triggers an immediate flush.
- `cache_kset_close()` writes the kset to the *key_head* segment,
automatically chaining a *LAST* kset header when rolling over to a
freshly allocated segment.
* Red-black tree with striping
- Cache space is divided into *subtrees* to reduce lock
contention; each subtree owns its own RB-root + spinlock.
- Complex overlap-resolution logic (`cache_insert_fixup()`) ensures
newly inserted keys never leave overlapping stale ranges behind
(head/tail/contain/contained cases handled).
* Replay on start-up
- `cache_replay()` walks from *key_tail* to *key_head*, re-hydrates
keys, validates CRC/magic, seamlessly
skipping placeholder “empty” keys left by read-misses.
* Background maintenance
- `clean_work` lazily prunes invalidated keys after GC.
- `kset_flush_work` background thread to close a kset.
With this patch dm-pcache can persistently track cached extents, rebuild
its index after crash, and guarantee non-overlapping key space – paving
the way for functional read/write caching.
Signed-off-by: Dongsheng Yang <[email protected]>
Introduce cache_req.c, the high-level engine that
drives I/O requests through dm-pcache. It decides whether data is served
from the cache or fetched from the backing device, allocates new cache
space on writes, and flushes dirty ksets when required.
* Read path
- Traverses the striped RB-trees to locate cached extents.
- Generates backing READ requests for gaps and inserts placeholder
“empty” keys to avoid duplicate fetches.
- Copies valid data directly from pmem into the caller’s bio; CRC and
generation checks guard against stale segments.
* Write path
- Allocates space in the current data segment via cache_data_alloc().
- Copies data from the bio into pmem, then inserts or updates keys,
splitting or trimming overlapped ranges as needed.
- Adds each new key to the active kset; forces kset close when FUA is
requested or the kset is full.
* Miss handling
- create_cache_miss_req() builds a backing READ, optionally attaching
an empty key.
- miss_read_end_req() replaces the placeholder with real data once the
READ completes, or deletes it on error.
* Flush support
- cache_flush() iterates over all ksets and forces them to close,
ensuring data durability when REQ_PREFLUSH is received.
Signed-off-by: Dongsheng Yang <[email protected]>
Add cache.c and cache.h that introduce the top-level
“struct pcache_cache”. This object glues together the backing block
device, the persistent-memory cache device, segment array, RB-tree
indexes, and the background workers for write-back and garbage
collection.
* Persistent metadata
- pcache_cache_info tracks options such as cache mode, data-crc flag
and GC threshold, written atomically with CRC+sequence.
- key_tail and dirty_tail positions are double-buffered and recovered
at mount time.
* Segment management
- kvcalloc()’d array of pcache_cache_segment objects, bitmap for fast
allocation, refcounts and generation numbers so GC can invalidate
old extents safely.
- First segment hosts a pcache_cache_ctrl block shared by all
threads.
* Request path hooks
- pcache_cache_handle_req() dispatches READ, WRITE and FLUSH bios to
the engines added in earlier patches.
- Per-CPU data_heads support lock-free allocation of space for new
writes.
* Background workers
- Delayed work items for write-back (5 s) and GC (5 s).
- clean_work removes stale keys after segments are reclaimed.
* Lifecycle helpers
- pcache_cache_start()/stop() bring the cache online, replay keys,
start workers, and flush everything on shutdown.
With this piece in place dm-pcache has a fully initialised cache object
capable of serving I/O and maintaining its on-disk structures.
Signed-off-by: Dongsheng Yang <[email protected]>
Add the top-level integration pieces that make the new persistent-memory
cache target usable from device-mapper:
* Documentation
- `Documentation/admin-guide/device-mapper/dm-pcache.rst` explains the
design, table syntax, status fields and runtime messages.
* Core target implementation
- `dm_pcache.c` and `dm_pcache.h` register the `"pcache"` DM target,
parse constructor arguments, create workqueues, and forward BIOS to
the cache core added in earlier patches.
- Supports flush/FUA, status reporting, and a “gc_percent” message.
- Dont support discard currently.
- Dont support table reload for live target currently.
* Device-mapper tables now accept lines like
pcache <pmem_dev> <backing_dev> writeback <true|false>
Signed-off-by: Dongsheng Yang <[email protected]>
Author
|
Upstream branch: f4ca523 |
blktests-ci Bot
pushed a commit
that referenced
this pull request
Feb 6, 2026
Commit 1767bb2 ("ipv6: mcast: Don't hold RTNL for IPV6_ADD_MEMBERSHIP and MCAST_JOIN_GROUP.") removed the RTNL lock for IPV6_ADD_MEMBERSHIP and MCAST_JOIN_GROUP operations. However, this change triggered the following call trace on my BeagleBone Black board: WARNING: net/8021q/vlan_core.c:236 at vlan_for_each+0x120/0x124, CPU#0: rpcbind/481 RTNL: assertion failed at net/8021q/vlan_core.c (236) Modules linked in: CPU: 0 UID: 997 PID: 481 Comm: rpcbind Not tainted 6.19.0-rc7-next-20260130-yocto-standard+ #35 PREEMPT Hardware name: Generic AM33XX (Flattened Device Tree) Call trace: unwind_backtrace from show_stack+0x28/0x2c show_stack from dump_stack_lvl+0x30/0x38 dump_stack_lvl from __warn+0xb8/0x11c __warn from warn_slowpath_fmt+0x130/0x194 warn_slowpath_fmt from vlan_for_each+0x120/0x124 vlan_for_each from cpsw_add_mc_addr+0x54/0x98 cpsw_add_mc_addr from __hw_addr_ref_sync_dev+0xc4/0xec __hw_addr_ref_sync_dev from __dev_mc_add+0x78/0x88 __dev_mc_add from igmp6_group_added+0x84/0xec igmp6_group_added from __ipv6_dev_mc_inc+0x1fc/0x2f0 __ipv6_dev_mc_inc from __ipv6_sock_mc_join+0x124/0x1b4 __ipv6_sock_mc_join from do_ipv6_setsockopt+0x84c/0x1168 do_ipv6_setsockopt from ipv6_setsockopt+0x88/0xc8 ipv6_setsockopt from do_sock_setsockopt+0xe8/0x19c do_sock_setsockopt from __sys_setsockopt+0x84/0xac __sys_setsockopt from ret_fast_syscall+0x0/0x54 This trace occurs because vlan_for_each() is called within cpsw_ndo_set_rx_mode(), which expects the RTNL lock to be held. Since modifying vlan_for_each() to operate without the RTNL lock is not straightforward, and because ndo_set_rx_mode() is invoked both with and without the RTNL lock across different code paths, simply adding rtnl_lock() in cpsw_ndo_set_rx_mode() is not a viable solution. To resolve this issue, we opt to execute the actual processing within a work queue, following the approach used by the icssg-prueth driver. Please note: To reproduce this issue, I manually reverted the changes to am335x-bone-common.dtsi from commit c477358 ("ARM: dts: am335x-bone: switch to new cpsw switch drv") in order to revert to the legacy cpsw driver. Fixes: 1767bb2 ("ipv6: mcast: Don't hold RTNL for IPV6_ADD_MEMBERSHIP and MCAST_JOIN_GROUP.") Signed-off-by: Kevin Hao <[email protected]> Cc: [email protected] Link: https://patch.msgid.link/[email protected] Signed-off-by: Jakub Kicinski <[email protected]>
blktests-ci Bot
pushed a commit
that referenced
this pull request
Mar 18, 2026
Commit a75b2be ("iommu: Add iommu_driver_get_domain_for_dev() helper") introduced iommu_driver_get_domain_for_dev() for driver code paths that hold iommu_group->mutex while attaching a device to an IOMMU domain. The same commit also added a lockdep assertion in iommu_get_domain_for_dev() to ensure that callers do not hold iommu_group->mutex when invoking it. On powerpc platforms, when PCI device ownership is switched from BLOCKED to the PLATFORM domain, the attach callback spapr_tce_platform_iommu_attach_dev() still calls iommu_get_domain_for_dev(). This happens while iommu_group->mutex is held during domain switching, which triggers the lockdep warning below during PCI enumeration: WARNING: drivers/iommu/iommu.c:2252 at iommu_get_domain_for_dev+0x38/0x80, CPU#2: swapper/0/1 Modules linked in: CPU: 2 UID: 0 PID: 1 Comm: swapper/0 Not tainted 7.0.0-rc2+ #35 PREEMPT Hardware name: IBM,9105-22A Power11 (architected) 0x820200 0xf000007 of:IBM,FW1120.00 (RB1120_115) hv:phyp pSeries NIP: c000000000c244c4 LR: c00000000005b5a4 CTR: c00000000005b578 REGS: c00000000a7bf280 TRAP: 0700 Not tainted (7.0.0-rc2+) MSR: 8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 22004422 XER: 0000000a CFAR: c000000000c24508 IRQMASK: 0 GPR00: c00000000005b5a4 c00000000a7bf520 c000000001dc8100 0000000000000001 GPR04: c00000000f972f10 0000000000000000 0000000000000000 0000000000000001 GPR08: 0000001ffbc60000 0000000000000001 0000000000000000 0000000000000000 GPR12: c00000000005b578 c000001fffffe480 c000000000011618 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: ffffffffffffefff 0000000000000000 c000000002d30eb0 0000000000000001 GPR24: c0000000017881f8 0000000000000000 0000000000000001 c00000000f972e00 GPR28: c00000000bbba0d0 0000000000000000 c00000000bbba0d0 c00000000f972e00 NIP [c000000000c244c4] iommu_get_domain_for_dev+0x38/0x80 LR [c00000000005b5a4] spapr_tce_platform_iommu_attach_dev+0x2c/0x98 Call Trace: iommu_get_domain_for_dev+0x68/0x80 (unreliable) spapr_tce_platform_iommu_attach_dev+0x2c/0x98 __iommu_attach_device+0x44/0x220 __iommu_device_set_domain+0xf4/0x194 __iommu_group_set_domain_internal+0xec/0x228 iommu_setup_default_domain+0x5f4/0x6a4 __iommu_probe_device+0x674/0x724 iommu_probe_device+0x50/0xb4 iommu_add_device+0x48/0x198 pci_dma_dev_setup_pSeriesLP+0x198/0x4f0 pcibios_bus_add_device+0x80/0x464 pci_bus_add_device+0x40/0x100 pci_bus_add_devices+0x54/0xb0 pcibios_init+0xd8/0x140 do_one_initcall+0x8c/0x598 kernel_init_freeable+0x3ec/0x850 kernel_init+0x34/0x270 ret_from_kernel_user_thread+0x14/0x1c Fix this by using iommu_driver_get_domain_for_dev() instead of iommu_get_domain_for_dev() in spapr_tce_platform_iommu_attach_dev(), which is the appropriate helper for callers holding the group mutex. Cc: [email protected] Fixes: a75b2be ("iommu: Add iommu_driver_get_domain_for_dev() helper") Closes: https://patchwork.ozlabs.org/project/linuxppc-dev/patch/[email protected]/ Signed-off-by: Nilay Shroff <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Tested-by: Venkat Rao Bagalkote <[email protected]> [Maddy: Added Closes, tested and reviewed by tags] Signed-off-by: Madhavan Srinivasan <[email protected]> Link: https://patch.msgid.link/[email protected]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull request for series with
subject: dm-pcache ��� persistent-memory cache for block devices
version: 2
url: https://patchwork.kernel.org/project/linux-block/list/?series=979565