Skip to content
Closed
201 changes: 201 additions & 0 deletions Documentation/admin-guide/device-mapper/dm-pcache.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
.. SPDX-License-Identifier: GPL-2.0

=================================
dm-pcache — Persistent Cache
=================================

*Author: Dongsheng Yang <[email protected]>*

This document describes *dm-pcache*, a Device-Mapper target that lets a
byte-addressable *DAX* (persistent-memory, “pmem”) region act as a
high-performance, crash-persistent cache in front of a slower block
device. The code lives in `drivers/md/dm-pcache/`.

Quick feature summary
=====================

* *Write-back* caching (only mode currently supported).
* *16 MiB segments* allocated on the pmem device.
* *Data CRC32* verification (optional, per cache).
* Crash-safe: every metadata structure is duplicated (`PCACHE_META_INDEX_MAX
== 2`) and protected with CRC+sequence numbers.
* *Multi-tree indexing* (one radix tree per CPU backend) for high PMem
parallelism
* Pure *DAX path* I/O – no extra BIO round-trips
* *Log-structured write-back* that preserves backend crash-consistency

-------------------------------------------------------------------------------
Constructor
===========

::

pcache <cache_dev> <backing_dev> [<number_of_optional_arguments> <cache_mode writeback> <data_crc true|false>]

========================= ====================================================
``cache_dev`` Any DAX-capable block device (``/dev/pmem0``…).
All metadata *and* cached blocks are stored here.

``backing_dev`` The slow block device to be cached.

``cache_mode`` Optional, Only ``writeback`` is accepted at the moment.

``data_crc`` Optional, default to ``false``
``true`` – store CRC32 for every cached entry and
verify on reads
``false`` – skip CRC (faster)
========================= ====================================================

Example
-------

.. code-block:: shell

dmsetup create pcache_sdb --table \
"0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true"

The first time a pmem device is used, dm-pcache formats it automatically
(super-block, cache_info, etc.).

-------------------------------------------------------------------------------
Status line
===========

``dmsetup status <device>`` (``STATUSTYPE_INFO``) prints:

::

<sb_flags> <seg_total> <cache_segs> <segs_used> \
<gc_percent> <cache_flags> \
<key_head_seg>:<key_head_off> \
<dirty_tail_seg>:<dirty_tail_off> \
<key_tail_seg>:<key_tail_off>

Field meanings
--------------

=============================== =============================================
``sb_flags`` Super-block flags (e.g. endian marker).

``seg_total`` Number of physical *pmem* segments.

``cache_segs`` Number of segments used for cache.

``segs_used`` Segments currently allocated (bitmap weight).

``gc_percent`` Current GC high-water mark (0-90).

``cache_flags`` Bit 0 – DATA_CRC enabled
Bit 1 – INIT_DONE (cache initialised)
Bits 2-5 – cache mode (0 == WB).

``key_head`` Where new key-sets are being written.

``dirty_tail`` First dirty key-set that still needs
write-back to the backing device.

``key_tail`` First key-set that may be reclaimed by GC.
=============================== =============================================

-------------------------------------------------------------------------------
Messages
========

*Change GC trigger*

::

dmsetup message <dev> 0 gc_percent <0-90>

-------------------------------------------------------------------------------
Theory of operation
===================

Sub-devices
-----------

==================== =========================================================
backing_dev Any block device (SSD/HDD/loop/LVM, etc.).
cache_dev DAX device; must expose direct-access memory.
==================== =========================================================

Segments and key-sets
---------------------

* The pmem space is divided into *16 MiB segments*.
* Each write allocates space from a per-CPU *data_head* inside a segment.
* A *cache-key* records a logical range on the origin and where it lives
inside pmem (segment + offset + generation).
* 128 keys form a *key-set* (kset); ksets are written sequentially in pmem
and are themselves crash-safe (CRC).
* The pair *(key_tail, dirty_tail)* delimit clean/dirty and live/dead ksets.

Write-back
----------

Dirty keys are queued into a tree; a background worker copies data
back to the backing_dev and advances *dirty_tail*. A FLUSH/FUA bio from the
upper layers forces an immediate metadata commit.

Garbage collection
------------------

GC starts when ``segs_used >= seg_total * gc_percent / 100``. It walks
from *key_tail*, frees segments whose every key has been invalidated, and
advances *key_tail*.

CRC verification
----------------

If ``data_crc is enabled`` dm-pcache computes a CRC32 over every cached data
range when it is inserted and stores it in the on-media key. Reads
validate the CRC before copying to the caller.

-------------------------------------------------------------------------------
Failure handling
================

* *pmem media errors* – all metadata copies are read with
``copy_mc_to_kernel``; an uncorrectable error logs and aborts initialisation.
* *Cache full* – if no free segment can be found, writes return ``-EBUSY``;
dm-pcache retries internally (request deferral).
* *System crash* – on attach, the driver replays ksets from *key_tail* to
rebuild the in-core trees; every segment’s generation guards against
use-after-free keys.

-------------------------------------------------------------------------------
Limitations & TODO
==================

* Only *write-back* mode; other modes planned.
* Only FIFO cache invalidate; other (LRU, ARC...) planned.
* Table reload is not supported currently.
* Discard planned.

-------------------------------------------------------------------------------
Example workflow
================

.. code-block:: shell

# 1. Create devices
dmsetup create pcache_sdb --table \
"0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true"

# 2. Put a filesystem on top
mkfs.ext4 /dev/mapper/pcache_sdb
mount /dev/mapper/pcache_sdb /mnt

# 3. Tune GC threshold to 80 %
dmsetup message pcache_sdb 0 gc_percent 80

# 4. Observe status
watch -n1 'dmsetup status pcache_sdb'

# 5. Shutdown
umount /mnt
dmsetup remove pcache_sdb

-------------------------------------------------------------------------------
``dm-pcache`` is under active development; feedback, bug reports and patches
are very welcome!
8 changes: 8 additions & 0 deletions MAINTAINERS
Original file line number Diff line number Diff line change
Expand Up @@ -6947,6 +6947,14 @@ S: Maintained
F: Documentation/admin-guide/device-mapper/vdo*.rst
F: drivers/md/dm-vdo/

DEVICE-MAPPER PCACHE TARGET
M: Dongsheng Yang <[email protected]>
M: Zheng Gu <[email protected]>
L: [email protected]
S: Maintained
F: Documentation/admin-guide/device-mapper/dm-pcache.rst
F: drivers/md/dm-pcache/

DEVLINK
M: Jiri Pirko <[email protected]>
L: [email protected]
Expand Down
2 changes: 2 additions & 0 deletions drivers/md/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -659,4 +659,6 @@ config DM_AUDIT

source "drivers/md/dm-vdo/Kconfig"

source "drivers/md/dm-pcache/Kconfig"

endif # MD
1 change: 1 addition & 0 deletions drivers/md/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ obj-$(CONFIG_DM_RAID) += dm-raid.o
obj-$(CONFIG_DM_THIN_PROVISIONING) += dm-thin-pool.o
obj-$(CONFIG_DM_VERITY) += dm-verity.o
obj-$(CONFIG_DM_VDO) += dm-vdo/
obj-$(CONFIG_DM_PCACHE) += dm-pcache/
obj-$(CONFIG_DM_CACHE) += dm-cache.o
obj-$(CONFIG_DM_CACHE_SMQ) += dm-cache-smq.o
obj-$(CONFIG_DM_EBS) += dm-ebs.o
Expand Down
17 changes: 17 additions & 0 deletions drivers/md/dm-pcache/Kconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
config DM_PCACHE
tristate "Persistent cache for Block Device (Experimental)"
depends on BLK_DEV_DM
depends on DEV_DAX
help
PCACHE provides a mechanism to use persistent memory (e.g., CXL persistent memory,
DAX-enabled devices) as a high-performance cache layer in front of
traditional block devices such as SSDs or HDDs.

PCACHE is implemented as a kernel module that integrates with the block
layer and supports direct access (DAX) to persistent memory for low-latency,
byte-addressable caching.

Note: This feature is experimental and should be tested thoroughly
before use in production environments.

If unsure, say 'N'.
3 changes: 3 additions & 0 deletions drivers/md/dm-pcache/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
dm-pcache-y := dm_pcache.o cache_dev.o segment.o backing_dev.o cache.o cache_gc.o cache_writeback.o cache_segment.o cache_key.o cache_req.o

obj-m += dm-pcache.o
Loading