Skip to content

Commit 1bf3ce2

Browse files
yangdongshengkawasaki
authored andcommitted
dm-pcache: initial dm-pcache target
Add the top-level integration pieces that make the new persistent-memory cache target usable from device-mapper: * Documentation - `Documentation/admin-guide/device-mapper/dm-pcache.rst` explains the design, table syntax, status fields and runtime messages. * Core target implementation - `dm_pcache.c` and `dm_pcache.h` register the `"pcache"` DM target, parse constructor arguments, create workqueues, and forward BIOS to the cache core added in earlier patches. - Supports flush/FUA, status reporting, and a “gc_percent” message. - Dont support discard currently. - Dont support table reload for live target currently. * Device-mapper tables now accept lines like pcache <pmem_dev> <backing_dev> writeback <true|false> Signed-off-by: Dongsheng Yang <[email protected]>
1 parent 658d9cc commit 1bf3ce2

8 files changed

Lines changed: 763 additions & 0 deletions

File tree

Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
=================================
4+
dm-pcache — Persistent Cache
5+
=================================
6+
7+
*Author: Dongsheng Yang <[email protected]>*
8+
9+
This document describes *dm-pcache*, a Device-Mapper target that lets a
10+
byte-addressable *DAX* (persistent-memory, “pmem”) region act as a
11+
high-performance, crash-persistent cache in front of a slower block
12+
device. The code lives in `drivers/md/dm-pcache/`.
13+
14+
Quick feature summary
15+
=====================
16+
17+
* *Write-back* caching (only mode currently supported).
18+
* *16 MiB segments* allocated on the pmem device.
19+
* *Data CRC32* verification (optional, per cache).
20+
* Crash-safe: every metadata structure is duplicated (`PCACHE_META_INDEX_MAX
21+
== 2`) and protected with CRC+sequence numbers.
22+
* *Multi-tree indexing* (one radix tree per CPU backend) for high PMem
23+
parallelism
24+
* Pure *DAX path* I/O – no extra BIO round-trips
25+
* *Log-structured write-back* that preserves backend crash-consistency
26+
27+
-------------------------------------------------------------------------------
28+
Constructor
29+
===========
30+
31+
::
32+
33+
pcache <cache_dev> <backing_dev> [<number_of_optional_arguments> <cache_mode writeback> <data_crc true|false>]
34+
35+
========================= ====================================================
36+
``cache_dev`` Any DAX-capable block device (``/dev/pmem0``…).
37+
All metadata *and* cached blocks are stored here.
38+
39+
``backing_dev`` The slow block device to be cached.
40+
41+
``cache_mode`` Optional, Only ``writeback`` is accepted at the moment.
42+
43+
``data_crc`` Optional, default to ``false``
44+
``true`` – store CRC32 for every cached entry and
45+
verify on reads
46+
``false`` – skip CRC (faster)
47+
========================= ====================================================
48+
49+
Example
50+
-------
51+
52+
.. code-block:: shell
53+
54+
dmsetup create pcache_sdb --table \
55+
"0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true"
56+
57+
The first time a pmem device is used, dm-pcache formats it automatically
58+
(super-block, cache_info, etc.).
59+
60+
-------------------------------------------------------------------------------
61+
Status line
62+
===========
63+
64+
``dmsetup status <device>`` (``STATUSTYPE_INFO``) prints:
65+
66+
::
67+
68+
<sb_flags> <seg_total> <cache_segs> <segs_used> \
69+
<gc_percent> <cache_flags> \
70+
<key_head_seg>:<key_head_off> \
71+
<dirty_tail_seg>:<dirty_tail_off> \
72+
<key_tail_seg>:<key_tail_off>
73+
74+
Field meanings
75+
--------------
76+
77+
=============================== =============================================
78+
``sb_flags`` Super-block flags (e.g. endian marker).
79+
80+
``seg_total`` Number of physical *pmem* segments.
81+
82+
``cache_segs`` Number of segments used for cache.
83+
84+
``segs_used`` Segments currently allocated (bitmap weight).
85+
86+
``gc_percent`` Current GC high-water mark (0-90).
87+
88+
``cache_flags`` Bit 0 – DATA_CRC enabled
89+
Bit 1 – INIT_DONE (cache initialised)
90+
Bits 2-5 – cache mode (0 == WB).
91+
92+
``key_head`` Where new key-sets are being written.
93+
94+
``dirty_tail`` First dirty key-set that still needs
95+
write-back to the backing device.
96+
97+
``key_tail`` First key-set that may be reclaimed by GC.
98+
=============================== =============================================
99+
100+
-------------------------------------------------------------------------------
101+
Messages
102+
========
103+
104+
*Change GC trigger*
105+
106+
::
107+
108+
dmsetup message <dev> 0 gc_percent <0-90>
109+
110+
-------------------------------------------------------------------------------
111+
Theory of operation
112+
===================
113+
114+
Sub-devices
115+
-----------
116+
117+
==================== =========================================================
118+
backing_dev Any block device (SSD/HDD/loop/LVM, etc.).
119+
cache_dev DAX device; must expose direct-access memory.
120+
==================== =========================================================
121+
122+
Segments and key-sets
123+
---------------------
124+
125+
* The pmem space is divided into *16 MiB segments*.
126+
* Each write allocates space from a per-CPU *data_head* inside a segment.
127+
* A *cache-key* records a logical range on the origin and where it lives
128+
inside pmem (segment + offset + generation).
129+
* 128 keys form a *key-set* (kset); ksets are written sequentially in pmem
130+
and are themselves crash-safe (CRC).
131+
* The pair *(key_tail, dirty_tail)* delimit clean/dirty and live/dead ksets.
132+
133+
Write-back
134+
----------
135+
136+
Dirty keys are queued into a tree; a background worker copies data
137+
back to the backing_dev and advances *dirty_tail*. A FLUSH/FUA bio from the
138+
upper layers forces an immediate metadata commit.
139+
140+
Garbage collection
141+
------------------
142+
143+
GC starts when ``segs_used >= seg_total * gc_percent / 100``. It walks
144+
from *key_tail*, frees segments whose every key has been invalidated, and
145+
advances *key_tail*.
146+
147+
CRC verification
148+
----------------
149+
150+
If ``data_crc is enabled`` dm-pcache computes a CRC32 over every cached data
151+
range when it is inserted and stores it in the on-media key. Reads
152+
validate the CRC before copying to the caller.
153+
154+
-------------------------------------------------------------------------------
155+
Failure handling
156+
================
157+
158+
* *pmem media errors* – all metadata copies are read with
159+
``copy_mc_to_kernel``; an uncorrectable error logs and aborts initialisation.
160+
* *Cache full* – if no free segment can be found, writes return ``-EBUSY``;
161+
dm-pcache retries internally (request deferral).
162+
* *System crash* – on attach, the driver replays ksets from *key_tail* to
163+
rebuild the in-core trees; every segment’s generation guards against
164+
use-after-free keys.
165+
166+
-------------------------------------------------------------------------------
167+
Limitations & TODO
168+
==================
169+
170+
* Only *write-back* mode; other modes planned.
171+
* Only FIFO cache invalidate; other (LRU, ARC...) planned.
172+
* Table reload is not supported currently.
173+
* Discard planned.
174+
175+
-------------------------------------------------------------------------------
176+
Example workflow
177+
================
178+
179+
.. code-block:: shell
180+
181+
# 1. Create devices
182+
dmsetup create pcache_sdb --table \
183+
"0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true"
184+
185+
# 2. Put a filesystem on top
186+
mkfs.ext4 /dev/mapper/pcache_sdb
187+
mount /dev/mapper/pcache_sdb /mnt
188+
189+
# 3. Tune GC threshold to 80 %
190+
dmsetup message pcache_sdb 0 gc_percent 80
191+
192+
# 4. Observe status
193+
watch -n1 'dmsetup status pcache_sdb'
194+
195+
# 5. Shutdown
196+
umount /mnt
197+
dmsetup remove pcache_sdb
198+
199+
-------------------------------------------------------------------------------
200+
``dm-pcache`` is under active development; feedback, bug reports and patches
201+
are very welcome!

MAINTAINERS

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6947,6 +6947,14 @@ S: Maintained
69476947
F: Documentation/admin-guide/device-mapper/vdo*.rst
69486948
F: drivers/md/dm-vdo/
69496949

6950+
DEVICE-MAPPER PCACHE TARGET
6951+
M: Dongsheng Yang <[email protected]>
6952+
M: Zheng Gu <[email protected]>
6953+
6954+
S: Maintained
6955+
F: Documentation/admin-guide/device-mapper/dm-pcache.rst
6956+
F: drivers/md/dm-pcache/
6957+
69506958
DEVLINK
69516959
M: Jiri Pirko <[email protected]>
69526960

drivers/md/Kconfig

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -659,4 +659,6 @@ config DM_AUDIT
659659

660660
source "drivers/md/dm-vdo/Kconfig"
661661

662+
source "drivers/md/dm-pcache/Kconfig"
663+
662664
endif # MD

drivers/md/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ obj-$(CONFIG_DM_RAID) += dm-raid.o
7171
obj-$(CONFIG_DM_THIN_PROVISIONING) += dm-thin-pool.o
7272
obj-$(CONFIG_DM_VERITY) += dm-verity.o
7373
obj-$(CONFIG_DM_VDO) += dm-vdo/
74+
obj-$(CONFIG_DM_PCACHE) += dm-pcache/
7475
obj-$(CONFIG_DM_CACHE) += dm-cache.o
7576
obj-$(CONFIG_DM_CACHE_SMQ) += dm-cache-smq.o
7677
obj-$(CONFIG_DM_EBS) += dm-ebs.o

drivers/md/dm-pcache/Kconfig

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
config DM_PCACHE
2+
tristate "Persistent cache for Block Device (Experimental)"
3+
depends on BLK_DEV_DM
4+
depends on DEV_DAX
5+
help
6+
PCACHE provides a mechanism to use persistent memory (e.g., CXL persistent memory,
7+
DAX-enabled devices) as a high-performance cache layer in front of
8+
traditional block devices such as SSDs or HDDs.
9+
10+
PCACHE is implemented as a kernel module that integrates with the block
11+
layer and supports direct access (DAX) to persistent memory for low-latency,
12+
byte-addressable caching.
13+
14+
Note: This feature is experimental and should be tested thoroughly
15+
before use in production environments.
16+
17+
If unsure, say 'N'.

drivers/md/dm-pcache/Makefile

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
dm-pcache-y := dm_pcache.o cache_dev.o segment.o backing_dev.o cache.o cache_gc.o cache_writeback.o cache_segment.o cache_key.o cache_req.o
2+
3+
obj-m += dm-pcache.o

0 commit comments

Comments
 (0)