Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions .github/workflows/kernel_build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
name: blktests-ci

on:
pull_request:

jobs:
build-kernel:
runs-on: ubuntu-latest
steps:
- name: Configure git
run: |
git config --global --add safe.directory '*'
- name: Checkout git
run: |
sudo apt-get install -y libelf-dev
mkdir -p linux
cd linux
git init
git remote add origin https://github.com/${{ github.repository }}
git fetch origin --depth=5 ${{ github.event.pull_request.head.sha }}
git reset --hard ${{ github.event.pull_request.head.sha }}
git log -1
- name: Build kernel
run: |
cd linux
make defconfig
make -j 8

200 changes: 200 additions & 0 deletions Documentation/admin-guide/device-mapper/dm-pcache.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
.. SPDX-License-Identifier: GPL-2.0

=================================
dm-pcache — Persistent Cache
=================================

*Author: Dongsheng Yang <[email protected]>*

This document describes *dm-pcache*, a Device-Mapper target that lets a
byte-addressable *DAX* (persistent-memory, “pmem”) region act as a
high-performance, crash-persistent cache in front of a slower block
device. The code lives in `drivers/md/dm-pcache/`.

Quick feature summary
=====================

* *Write-back* caching (only mode currently supported).
* *16 MiB segments* allocated on the pmem device.
* *Data CRC32* verification (optional, per cache).
* Crash-safe: every metadata structure is duplicated (`PCACHE_META_INDEX_MAX
== 2`) and protected with CRC+sequence numbers.
* *Multi-tree indexing* (one radix tree per CPU backend) for high PMem
parallelism
* Pure *DAX path* I/O – no extra BIO round-trips
* *Log-structured write-back* that preserves backend crash-consistency

-------------------------------------------------------------------------------
Constructor
===========

::

pcache <cache_dev> <backing_dev> <cache_mode> <data_crc>

========================= ====================================================
``cache_dev`` Any DAX-capable block device (``/dev/pmem0``…).
All metadata *and* cached blocks are stored here.

``backing_dev`` The slow block device to be cached.

``cache_mode`` Only ``writeback`` is accepted at the moment.

``data_crc`` ``true`` – store CRC32 for every cached entry and
verify on reads
``false`` – skip CRC (faster)
========================= ====================================================

Example
-------

.. code-block:: shell

dmsetup create pcache_sdb --table \
"0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb writeback true"

The first time a pmem device is used, dm-pcache formats it automatically
(super-block, cache_info, etc.).

-------------------------------------------------------------------------------
Status line
===========

``dmsetup status <device>`` (``STATUSTYPE_INFO``) prints:

::

<sb_flags> <seg_total> <cache_segs> <segs_used> \
<gc_percent> <cache_flags> \
<key_head_seg>:<key_head_off> \
<dirty_tail_seg>:<dirty_tail_off> \
<key_tail_seg>:<key_tail_off>

Field meanings
--------------

=============================== =============================================
``sb_flags`` Super-block flags (e.g. endian marker).

``seg_total`` Number of physical *pmem* segments.

``cache_segs`` Number of segments used for cache.

``segs_used`` Segments currently allocated (bitmap weight).

``gc_percent`` Current GC high-water mark (0-90).

``cache_flags`` Bit 0 – DATA_CRC enabled
Bit 1 – INIT_DONE (cache initialised)
Bits 2-5 – cache mode (0 == WB).

``key_head`` Where new key-sets are being written.

``dirty_tail`` First dirty key-set that still needs
write-back to the backing device.

``key_tail`` First key-set that may be reclaimed by GC.
=============================== =============================================

-------------------------------------------------------------------------------
Messages
========

*Change GC trigger*

::

dmsetup message <dev> 0 gc_percent <0-90>

-------------------------------------------------------------------------------
Theory of operation
===================

Sub-devices
-----------

==================== =========================================================
backing_dev Any block device (SSD/HDD/loop/LVM, etc.).
cache_dev DAX device; must expose direct-access memory.
==================== =========================================================

Segments and key-sets
---------------------

* The pmem space is divided into *16 MiB segments*.
* Each write allocates space from a per-CPU *data_head* inside a segment.
* A *cache-key* records a logical range on the origin and where it lives
inside pmem (segment + offset + generation).
* 128 keys form a *key-set* (kset); ksets are written sequentially in pmem
and are themselves crash-safe (CRC).
* The pair *(key_tail, dirty_tail)* delimit clean/dirty and live/dead ksets.

Write-back
----------

Dirty keys are queued into a tree; a background worker copies data
back to the backing_dev and advances *dirty_tail*. A FLUSH/FUA bio from the
upper layers forces an immediate metadata commit.

Garbage collection
------------------

GC starts when ``segs_used >= seg_total * gc_percent / 100``. It walks
from *key_tail*, frees segments whose every key has been invalidated, and
advances *key_tail*.

CRC verification
----------------

If ``data_crc is enabled`` dm-pcache computes a CRC32 over every cached data
range when it is inserted and stores it in the on-media key. Reads
validate the CRC before copying to the caller.

-------------------------------------------------------------------------------
Failure handling
================

* *pmem media errors* – all metadata copies are read with
``copy_mc_to_kernel``; an uncorrectable error logs and aborts initialisation.
* *Cache full* – if no free segment can be found, writes return ``-EBUSY``;
dm-pcache retries internally (request deferral).
* *System crash* – on attach, the driver replays ksets from *key_tail* to
rebuild the in-core trees; every segment’s generation guards against
use-after-free keys.

-------------------------------------------------------------------------------
Limitations & TODO
==================

* Only *write-back* mode; other modes planned.
* Only FIFO cache invalidate; other (LRU, ARC...) planned.
* Table reload is not supported currently.
* Discard planned.

-------------------------------------------------------------------------------
Example workflow
================

.. code-block:: shell

# 1. Create devices
dmsetup create pcache_sdb --table \
"0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb writeback true"

# 2. Put a filesystem on top
mkfs.ext4 /dev/mapper/pcache_sdb
mount /dev/mapper/pcache_sdb /mnt

# 3. Tune GC threshold to 80 %
dmsetup message pcache_sdb 0 gc_percent 80

# 4. Observe status
watch -n1 'dmsetup status pcache_sdb'

# 5. Shutdown
umount /mnt
dmsetup remove pcache_sdb

-------------------------------------------------------------------------------
``dm-pcache`` is under active development; feedback, bug reports and patches
are very welcome!
9 changes: 9 additions & 0 deletions MAINTAINERS
Original file line number Diff line number Diff line change
Expand Up @@ -6946,6 +6946,15 @@ S: Maintained
F: Documentation/admin-guide/device-mapper/vdo*.rst
F: drivers/md/dm-vdo/

DEVICE-MAPPER PCACHE TARGET
M: Dongsheng Yang <[email protected]>
M: Zheng Gu <[email protected]>
R: Linggang Zeng <[email protected]>
L: [email protected]
S: Maintained
F: Documentation/admin-guide/device-mapper/dm-pcache.rst
F: drivers/md/dm-pcache/

DEVLINK
M: Jiri Pirko <[email protected]>
L: [email protected]
Expand Down
2 changes: 2 additions & 0 deletions drivers/md/Kconfig
Original file line number Diff line number Diff line change
Expand Up @@ -659,4 +659,6 @@ config DM_AUDIT

source "drivers/md/dm-vdo/Kconfig"

source "drivers/md/dm-pcache/Kconfig"

endif # MD
1 change: 1 addition & 0 deletions drivers/md/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ obj-$(CONFIG_DM_RAID) += dm-raid.o
obj-$(CONFIG_DM_THIN_PROVISIONING) += dm-thin-pool.o
obj-$(CONFIG_DM_VERITY) += dm-verity.o
obj-$(CONFIG_DM_VDO) += dm-vdo/
obj-$(CONFIG_DM_PCACHE) += dm-pcache/
obj-$(CONFIG_DM_CACHE) += dm-cache.o
obj-$(CONFIG_DM_CACHE_SMQ) += dm-cache-smq.o
obj-$(CONFIG_DM_EBS) += dm-ebs.o
Expand Down
17 changes: 17 additions & 0 deletions drivers/md/dm-pcache/Kconfig
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
config DM_PCACHE
tristate "Persistent cache for Block Device (Experimental)"
depends on BLK_DEV_DM
depends on DEV_DAX
help
PCACHE provides a mechanism to use persistent memory (e.g., CXL persistent memory,
DAX-enabled devices) as a high-performance cache layer in front of
traditional block devices such as SSDs or HDDs.

PCACHE is implemented as a kernel module that integrates with the block
layer and supports direct access (DAX) to persistent memory for low-latency,
byte-addressable caching.

Note: This feature is experimental and should be tested thoroughly
before use in production environments.

If unsure, say 'N'.
3 changes: 3 additions & 0 deletions drivers/md/dm-pcache/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
dm-pcache-y := dm_pcache.o cache_dev.o segment.o backing_dev.o cache.o cache_gc.o cache_writeback.o cache_segment.o cache_key.o cache_req.o

obj-m += dm-pcache.o
Loading