dm-pcache �� persistent-memory cache for block devices by blktests-ci[bot] · Pull Request #14 · linux-blktests/linux-block

blktests-ci · 2025-06-10T04:38:36Z

Pull request for series with
subject: dm-pcache �� persistent-memory cache for block devices
version: 1
url: https://patchwork.kernel.org/project/linux-block/list/?series=968947

Consolidate common PCACHE helpers into a new header so that subsequent patches can include them without repeating boiler-plate. - Logging macros with unified prefix and location info. - Common constants (KB/MB helpers, metadata replica count, CRC seed). - On-disk metadata header definition and CRC helper. - Sequence-number comparison that handles wrap-around. - pcache_meta_find_latest() to pick the newest valid metadata copy. Signed-off-by: Dongsheng Yang <[email protected]>

This patch introduces *backing_dev.{c,h}*, a self-contained layer that handles all interaction with the *backing block device* where cache write-back and cache-miss reads are serviced. Isolating this logic keeps the core dm-pcache code free of low-level bio plumbing. * Device setup / teardown - Opens the target with `dm_get_device()`, stores `bdev`, file and size, and initialises a dedicated `bioset`. - Gracefully releases resources via `backing_dev_stop()`. * Request object (`struct pcache_backing_dev_req`) - Two request flavours: - REQ-type – cloned from an upper `struct bio` issued to dm-pcache; trimmed and re-targeted to the backing LBA. - KMEM-type – maps an arbitrary kernel memory buffer into a freshly built. - Private completion callback (`end_req`) propagates status to the upper layer and handles resource recycling. * Submission & completion path - Lock-protected submit queue + worker (`req_submit_work`) let pcache push many requests asynchronously, at the same time, allow caller to submit backing_dev_req in atomic context. - End-io handler moves finished requests to a completion list processed by `req_complete_work`, ensuring callbacks run in process context. - Direct-submit option for non-atomic context. * Flush - `backing_dev_flush()` issues a flush to persist backing-device data. Signed-off-by: Dongsheng Yang <[email protected]>

Add cache_dev.{c,h} to manage the persistent-memory device that stores all pcache metadata and data segments. Splitting this logic out keeps the main dm-pcache code focused on policy while cache_dev handles the low-level interaction with the DAX block device. * DAX mapping - Opens the underlying device via dm_get_device(). - Uses dax_direct_access() to obtain a direct linear mapping; falls back to vmap() when the range is fragmented. * On-disk layout ┌─ 4 KB ─┐ super-block (SB) ├─ 4 KB ─┤ cache_info[0] ├─ 4 KB ─┤ cache_info[1] ├─ 4 KB ─┤ cache_ctrl └─ ... ─┘ segments Constants and macros in the header expose offsets and sizes. * Super-block handling - sb_read(), sb_validate(), sb_init() verify magic, CRC32 and host endianness (flag *PCACHE_SB_F_BIGENDIAN*). - Formatting zeroes the metadata replicas and initialises the segment bitmap when the SB is blank. * Segment allocator - Bitmap protected by seg_lock; find_next_zero_bit() yields the next free 16 MB segment. * Lifecycle helpers - cache_dev_start()/stop() encapsulate init/exit and are invoked by dm-pcache core. - Gracefully handles errors: CRC mismatch, wrong endianness, device too small (< 512 MB), or failed DAX mapping. Signed-off-by: Dongsheng Yang <[email protected]>

Introduce segment.{c,h}, an internal abstraction that encapsulates everything related to a single pcache *segment* (the fixed-size allocation unit stored on the cache-device). * On-disk metadata (`struct pcache_segment_info`) - Embedded `struct pcache_meta_header` for CRC/sequence handling. - `flags` field encodes a “has-next” bit and a 4-bit *type* class (`CACHE_DATA` added as the first type). * Initialisation - `pcache_segment_init()` populates the in-memory `struct pcache_segment` from a given segment id, data offset and metadata pointer, computing the usable `data_size` and virtual address within the DAX mapping. * IO helpers - `segment_copy_to_bio()` / `segment_copy_from_bio()` move data between pmem and a bio, using `_copy_mc_to_iter()` and `_copy_from_iter_flushcache()` to tolerate hw memory errors and ensure durability. - `segment_pos_advance()` advances an internal offset while staying inside the segment’s data area. These helpers allow upper layers (cache key management, write-back logic, GC, etc.) to treat a segment as a contiguous byte array without knowing about DAX mappings or persistence details. Signed-off-by: Dongsheng Yang <[email protected]>

Introduce *cache_segment.c*, the in-memory/on-disk glue that lets a `struct pcache_cache` manage its array of data segments. * Metadata handling - Loads the most-recent replica of both the segment-info block (`struct pcache_segment_info`) and per-segment generation counter (`struct pcache_cache_seg_gen`) using `pcache_meta_find_latest()`. - Updates those structures atomically with CRC + sequence rollover, writing alternately to the two metadata slots inside each segment. * Segment initialisation (`cache_seg_init`) - Builds a `struct pcache_segment` pointing to the segment’s data area, sets up locks, generation counters, and, when formatting a new cache, zeroes the on-segment kset header. * Linked-list of segments - `cache_seg_set_next_seg()` stores the *next* segment id in `seg_info->next_seg` and sets the HAS_NEXT flag, allowing a cache to span multiple segments. This is important to allow other type of segment added in future. * Runtime life-cycle - Reference counting (`cache_seg_get/put`) with invalidate-on-last-put that clears the bitmap slot and schedules cleanup work. - Generation bump (`cache_seg_gen_increase`) persists a new generation record whenever the segment is modified. * Allocator - `get_cache_segment()` uses a bitmap and per-cache hint to pick the next free segment, retrying with micro-delays when none are immediately available. Signed-off-by: Dongsheng Yang <[email protected]>

Introduce cache_writeback.c, which implements the asynchronous write-back path for pcache. The new file is responsible for detecting dirty data, organising it into an in-memory tree, issuing bios to the backing block device, and advancing the cache’s *dirty tail* pointer once data has been safely persisted. * Dirty-state detection - `__is_cache_clean()` reads the kset header at `dirty_tail`, checks magic and CRC, and thus decides whether there is anything to flush. * Write-back scheduler - `cache_writeback_work` is queued on the cache task-workqueue and re-arms itself at `PCACHE_CACHE_WRITEBACK_INTERVAL`. - Uses an internal spin-protected `writeback_key_tree` to batch keys belonging to the same stripe before IO. * Key processing - `cache_kset_insert_tree()` decodes each key inside the on-media kset, allocates an in-memory key object, and inserts it into the writeback_key_tree. - `cache_key_writeback()` builds a *KMEM-type* backing request that maps the persistent-memory range directly into a WRITE bio and submits it with `submit_bio_noacct()`. - After all keys from the writeback_key_tree have been flushed, `backing_dev_flush()` issues a single FLUSH to ensure durability. * Tail advancement - Once a kset is written back, `cache_pos_advance()` moves `cache->dirty_tail` by the exact on-disk size and the new position is persisted via `cache_encode_dirty_tail()`. - When the `PCACHE_KSET_FLAGS_LAST` flag is seen, the write-back engine switches to the next segment indicated by `next_cache_seg_id`. Signed-off-by: Dongsheng Yang <[email protected]>

Introduce cache_gc.c, a self-contained engine that reclaims cache segments whose data have already been flushed to the backing device. Running in the cache workqueue, the GC keeps segment usage below the user-configurable *cache_gc_percent* threshold. * need_gc() – decides when to trigger GC by checking: - *dirty_tail* vs *key_tail* position, - kset integrity (magic + CRC), - bitmap utilisation against the gc-percent threshold. * Per-key reclamation - Decodes each key in the target kset (`cache_key_decode()`). - Drops the segment reference with `cache_seg_put()`, allowing the segment to be invalidated once all keys are gone. - When the reference count hits zero the segment is cleared from `seg_map`, making it immediately reusable by the allocator. * Scheduling - `pcache_cache_gc_fn()` loops until no more work is needed, then re-queues itself after *PCACHE_CACHE_GC_INTERVAL*. Signed-off-by: Dongsheng Yang <[email protected]>

Add *cache_key.c* which becomes the heart of dm-pcache’s in-memory index and on-media key-set (“kset”) format. * Key objects (`struct pcache_cache_key`) - Slab-backed allocator & ref-count helpers - `cache_key_encode()/decode()` translate between in-memory keys and their on-disk representation, validating CRC when *cache_data_crc* is enabled. * Kset construction & persistence - Per-kset buffer lives in `struct pcache_cache_kset`; keys are appended until full or *force_close* triggers an immediate flush. - `cache_kset_close()` writes the kset to the *key_head* segment, automatically chaining a *LAST* kset header when rolling over to a freshly allocated segment. * Red-black tree with striping - Cache space is divided into *subtrees* to reduce lock contention; each subtree owns its own RB-root + spinlock. - Complex overlap-resolution logic (`cache_insert_fixup()`) ensures newly inserted keys never leave overlapping stale ranges behind (head/tail/contain/contained cases handled). * Replay on start-up - `cache_replay()` walks from *key_tail* to *key_head*, re-hydrates keys, validates CRC/magic, seamlessly skipping placeholder “empty” keys left by read-misses. * Background maintenance - `clean_work` lazily prunes invalidated keys after GC. - `kset_flush_work` background thread to close a kset. With this patch dm-pcache can persistently track cached extents, rebuild its index after crash, and guarantee non-overlapping key space – paving the way for functional read/write caching. Signed-off-by: Dongsheng Yang <[email protected]>

Introduce cache_req.c, the high-level engine that drives I/O requests through dm-pcache. It decides whether data is served from the cache or fetched from the backing device, allocates new cache space on writes, and flushes dirty ksets when required. * Read path - Traverses the striped RB-trees to locate cached extents. - Generates backing READ requests for gaps and inserts placeholder “empty” keys to avoid duplicate fetches. - Copies valid data directly from pmem into the caller’s bio; CRC and generation checks guard against stale segments. * Write path - Allocates space in the current data segment via cache_data_alloc(). - Copies data from the bio into pmem, then inserts or updates keys, splitting or trimming overlapped ranges as needed. - Adds each new key to the active kset; forces kset close when FUA is requested or the kset is full. * Miss handling - create_cache_miss_req() builds a backing READ, optionally attaching an empty key. - miss_read_end_req() replaces the placeholder with real data once the READ completes, or deletes it on error. * Flush support - cache_flush() iterates over all ksets and forces them to close, ensuring data durability when REQ_PREFLUSH is received. Signed-off-by: Dongsheng Yang <[email protected]>

Add cache.c and cache.h that introduce the top-level “struct pcache_cache”. This object glues together the backing block device, the persistent-memory cache device, segment array, RB-tree indexes, and the background workers for write-back and garbage collection. * Persistent metadata - pcache_cache_info tracks options such as cache mode, data-crc flag and GC threshold, written atomically with CRC+sequence. - key_tail and dirty_tail positions are double-buffered and recovered at mount time. * Segment management - kvcalloc()’d array of pcache_cache_segment objects, bitmap for fast allocation, refcounts and generation numbers so GC can invalidate old extents safely. - First segment hosts a pcache_cache_ctrl block shared by all threads. * Request path hooks - pcache_cache_handle_req() dispatches READ, WRITE and FLUSH bios to the engines added in earlier patches. - Per-CPU data_heads support lock-free allocation of space for new writes. * Background workers - Delayed work items for write-back (5 s) and GC (5 s). - clean_work removes stale keys after segments are reclaimed. * Lifecycle helpers - pcache_cache_start()/stop() bring the cache online, replay keys, start workers, and flush everything on shutdown. With this piece in place dm-pcache has a fully initialised cache object capable of serving I/O and maintaining its on-disk structures. Signed-off-by: Dongsheng Yang <[email protected]>

Add the top-level integration pieces that make the new persistent-memory cache target usable from device-mapper: * Documentation - `Documentation/admin-guide/device-mapper/dm-pcache.rst` explains the design, table syntax, status fields and runtime messages. * Core target implementation - `dm_pcache.c` and `dm_pcache.h` register the `"pcache"` DM target, parse constructor arguments, create workqueues, and forward BIOS to the cache core added in earlier patches. - Supports flush/FUA, status reporting, and a “gc_percent” message. - Dont support discard currently. - Dont support table reload for live target currently. * Device-mapper tables now accept lines like pcache <pmem_dev> <backing_dev> writeback <true|false> Signed-off-by: Dongsheng Yang <[email protected]>

blktests-ci · 2025-06-10T04:38:37Z

Upstream branch: 19272b3
series: https://patchwork.kernel.org/project/linux-block/list/?series=968947
version: 1

blktests-ci · 2025-07-10T12:01:46Z

Upstream branch: 8c2e52e
series: https://patchwork.kernel.org/project/linux-block/list/?series=979565
version: 2

blktests-ci · 2025-07-10T12:10:11Z

Upstream branch: 8c2e52e
series: https://patchwork.kernel.org/project/linux-block/list/?series=979565
version: 2

blktests-ci · 2025-07-10T12:13:01Z

Github failed to update this PR after force push. Close it.

In KVM guests with Hyper-V hypercalls enabled, the hypercalls HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST and HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST_EX allow a guest to request invalidation of portions of a virtual TLB. For this, the hypercall parameter includes a list of GVAs that are supposed to be invalidated. However, when non-canonical GVAs are passed, there is currently no filtering in place and they are eventually passed to checked invocations of INVVPID on Intel / INVLPGA on AMD. While AMD's INVLPGA silently ignores non-canonical addresses (effectively a no-op), Intel's INVVPID explicitly signals VM-Fail and ultimately triggers the WARN_ONCE in invvpid_error(): invvpid failed: ext=0x0 vpid=1 gva=0xaaaaaaaaaaaaa000 WARNING: CPU: 6 PID: 326 at arch/x86/kvm/vmx/vmx.c:482 invvpid_error+0x91/0xa0 [kvm_intel] Modules linked in: kvm_intel kvm 9pnet_virtio irqbypass fuse CPU: 6 UID: 0 PID: 326 Comm: kvm-vm Not tainted 6.15.0 #14 PREEMPT(voluntary) RIP: 0010:invvpid_error+0x91/0xa0 [kvm_intel] Call Trace: vmx_flush_tlb_gva+0x320/0x490 [kvm_intel] kvm_hv_vcpu_flush_tlb+0x24f/0x4f0 [kvm] kvm_arch_vcpu_ioctl_run+0x3013/0x5810 [kvm] Hyper-V documents that invalid GVAs (those that are beyond a partition's GVA space) are to be ignored. While not completely clear whether this ruling also applies to non-canonical GVAs, it is likely fine to make that assumption, and manual testing on Azure confirms "real" Hyper-V interprets the specification in the same way. Skip non-canonical GVAs when processing the list of address to avoid tripping the INVVPID failure. Alternatively, KVM could filter out "bad" GVAs before inserting into the FIFO, but practically speaking the only downside of pushing validation to the final processing is that doing so is suboptimal for the guest, and no well-behaved guest will request TLB flushes for non-canonical addresses. Fixes: 2609708 ("KVM: x86: hyper-v: Handle HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST{,EX} calls gently") Cc: [email protected] Signed-off-by: Manuel Andreas <[email protected]> Suggested-by: Vitaly Kuznetsov <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Sean Christopherson <[email protected]>

Without the change `perf `hangs up on charaster devices. On my system it's enough to run system-wide sampler for a few seconds to get the hangup: $ perf record -a -g --call-graph=dwarf $ perf report # hung `strace` shows that hangup happens on reading on a character device `/dev/dri/renderD128` $ strace -y -f -p 2780484 strace: Process 2780484 attached pread64(101</dev/dri/renderD128>, strace: Process 2780484 detached It's call trace descends into `elfutils`: $ gdb -p 2780484 (gdb) bt #0 0x00007f5e508f04b7 in __libc_pread64 (fd=101, buf=0x7fff9df7edb0, count=0, offset=0) at ../sysdeps/unix/sysv/linux/pread64.c:25 #1 0x00007f5e52b79515 in read_file () from /<<NIX>>/elfutils-0.192/lib/libelf.so.1 #2 0x00007f5e52b25666 in libdw_open_elf () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1 #3 0x00007f5e52b25907 in __libdw_open_file () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1 #4 0x00007f5e52b120a9 in dwfl_report_elf@@ELFUTILS_0.156 () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1 #5 0x000000000068bf20 in __report_module (al=al@entry=0x7fff9df80010, ip=ip@entry=139803237033216, ui=ui@entry=0x5369b5e0) at util/dso.h:537 #6 0x000000000068c3d1 in report_module (ip=139803237033216, ui=0x5369b5e0) at util/unwind-libdw.c:114 #7 frame_callback (state=0x535aef10, arg=0x5369b5e0) at util/unwind-libdw.c:242 #8 0x00007f5e52b261d3 in dwfl_thread_getframes () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1 #9 0x00007f5e52b25bdb in get_one_thread_cb () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1 #10 0x00007f5e52b25faa in dwfl_getthreads () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1 #11 0x00007f5e52b26514 in dwfl_getthread_frames () from /<<NIX>>/elfutils-0.192/lib/libdw.so.1 #12 0x000000000068c6ce in unwind__get_entries (cb=cb@entry=0x5d4620 <unwind_entry>, arg=arg@entry=0x10cd5fa0, thread=thread@entry=0x1076a290, data=data@entry=0x7fff9df80540, max_stack=max_stack@entry=127, best_effort=best_effort@entry=false) at util/thread.h:152 #13 0x00000000005dae95 in thread__resolve_callchain_unwind (evsel=0x106006d0, thread=0x1076a290, cursor=0x10cd5fa0, sample=0x7fff9df80540, max_stack=127, symbols=true) at util/machine.c:2939 #14 thread__resolve_callchain_unwind (thread=0x1076a290, cursor=0x10cd5fa0, evsel=0x106006d0, sample=0x7fff9df80540, max_stack=127, symbols=true) at util/machine.c:2920 #15 __thread__resolve_callchain (thread=0x1076a290, cursor=0x10cd5fa0, evsel=0x106006d0, evsel@entry=0x7fff9df80440, sample=0x7fff9df80540, parent=parent@entry=0x7fff9df804a0, root_al=root_al@entry=0x7fff9df80440, max_stack=127, symbols=true) at util/machine.c:2970 #16 0x00000000005d0cb2 in thread__resolve_callchain (thread=<optimized out>, cursor=<optimized out>, evsel=0x7fff9df80440, sample=<optimized out>, parent=0x7fff9df804a0, root_al=0x7fff9df80440, max_stack=127) at util/machine.h:198 #17 sample__resolve_callchain (sample=<optimized out>, cursor=<optimized out>, parent=parent@entry=0x7fff9df804a0, evsel=evsel@entry=0x106006d0, al=al@entry=0x7fff9df80440, max_stack=max_stack@entry=127) at util/callchain.c:1127 #18 0x0000000000617e08 in hist_entry_iter__add (iter=iter@entry=0x7fff9df80480, al=al@entry=0x7fff9df80440, max_stack_depth=127, arg=arg@entry=0x7fff9df81ae0) at util/hist.c:1255 #19 0x000000000045d2d0 in process_sample_event (tool=0x7fff9df81ae0, event=<optimized out>, sample=0x7fff9df80540, evsel=0x106006d0, machine=<optimized out>) at builtin-report.c:334 #20 0x00000000005e3bb1 in perf_session__deliver_event (session=0x105ff2c0, event=0x7f5c7d735ca0, tool=0x7fff9df81ae0, file_offset=2914716832, file_path=0x105ffbf0 "perf.data") at util/session.c:1367 #21 0x00000000005e8d93 in do_flush (oe=0x105ffa50, show_progress=false) at util/ordered-events.c:245 #22 __ordered_events__flush (oe=0x105ffa50, how=OE_FLUSH__ROUND, timestamp=<optimized out>) at util/ordered-events.c:324 #23 0x00000000005e1f64 in perf_session__process_user_event (session=0x105ff2c0, event=0x7f5c7d752b18, file_offset=2914835224, file_path=0x105ffbf0 "perf.data") at util/session.c:1419 #24 0x00000000005e47c7 in reader__read_event (rd=rd@entry=0x7fff9df81260, session=session@entry=0x105ff2c0, --Type <RET> for more, q to quit, c to continue without paging-- quit prog=prog@entry=0x7fff9df81220) at util/session.c:2132 #25 0x00000000005e4b37 in reader__process_events (rd=0x7fff9df81260, session=0x105ff2c0, prog=0x7fff9df81220) at util/session.c:2181 #26 __perf_session__process_events (session=0x105ff2c0) at util/session.c:2226 #27 perf_session__process_events (session=session@entry=0x105ff2c0) at util/session.c:2390 #28 0x0000000000460add in __cmd_report (rep=0x7fff9df81ae0) at builtin-report.c:1076 #29 cmd_report (argc=<optimized out>, argv=<optimized out>) at builtin-report.c:1827 #30 0x00000000004c5a40 in run_builtin (p=p@entry=0xd8f7f8 <commands+312>, argc=argc@entry=1, argv=argv@entry=0x7fff9df844b0) at perf.c:351 #31 0x00000000004c5d63 in handle_internal_command (argc=argc@entry=1, argv=argv@entry=0x7fff9df844b0) at perf.c:404 #32 0x0000000000442de3 in run_argv (argcp=<synthetic pointer>, argv=<synthetic pointer>) at perf.c:448 #33 main (argc=<optimized out>, argv=0x7fff9df844b0) at perf.c:556 The hangup happens because nothing in` perf` or `elfutils` checks if a mapped file is easily readable. The change conservatively skips all non-regular files. Signed-off-by: Sergei Trofimovich <[email protected]> Acked-by: Namhyung Kim <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Namhyung Kim <[email protected]>

Symbolize stack traces by creating a live machine. Add this functionality to dump_stack and switch dump_stack users to use it. Switch TUI to use it. Add stack traces to the child test function which can be useful to diagnose blocked code. Example output: ``` $ perf test -vv PERF_RECORD_ ... 7: PERF_RECORD_* events & perf_sample fields: 7: PERF_RECORD_* events & perf_sample fields : Running (1 active) ^C Signal (2) while running tests. Terminating tests with the same signal Internal test harness failure. Completing any started tests: : 7: PERF_RECORD_* events & perf_sample fields: ---- unexpected signal (2) ---- #0 0x55788c6210a3 in child_test_sig_handler builtin-test.c:0 #1 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0 #2 0x7fc12fe99687 in __internal_syscall_cancel cancellation.c:64 #3 0x7fc12fee5f7a in clock_nanosleep@GLIBC_2.2.5 clock_nanosleep.c:72 #4 0x7fc12fef1393 in __nanosleep nanosleep.c:26 #5 0x7fc12ff02d68 in __sleep sleep.c:55 #6 0x55788c63196b in test__PERF_RECORD perf-record.c:0 #7 0x55788c620fb0 in run_test_child builtin-test.c:0 #8 0x55788c5bd18d in start_command run-command.c:127 #9 0x55788c621ef3 in __cmd_test builtin-test.c:0 #10 0x55788c6225bf in cmd_test ??:0 #11 0x55788c5afbd0 in run_builtin perf.c:0 #12 0x55788c5afeeb in handle_internal_command perf.c:0 #13 0x55788c52b383 in main ??:0 #14 0x7fc12fe33ca8 in __libc_start_call_main libc_start_call_main.h:74 #15 0x7fc12fe33d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128 #16 0x55788c52b9d1 in _start ??:0 ---- unexpected signal (2) ---- #0 0x55788c6210a3 in child_test_sig_handler builtin-test.c:0 #1 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0 #2 0x7fc12fea3a14 in pthread_sigmask@GLIBC_2.2.5 pthread_sigmask.c:45 #3 0x7fc12fe49fd9 in __GI___sigprocmask sigprocmask.c:26 #4 0x7fc12ff2601b in __longjmp_chk longjmp.c:36 #5 0x55788c6210c0 in print_test_result.isra.0 builtin-test.c:0 #6 0x7fc12fe49df0 in __restore_rt libc_sigaction.c:0 #7 0x7fc12fe99687 in __internal_syscall_cancel cancellation.c:64 #8 0x7fc12fee5f7a in clock_nanosleep@GLIBC_2.2.5 clock_nanosleep.c:72 #9 0x7fc12fef1393 in __nanosleep nanosleep.c:26 #10 0x7fc12ff02d68 in __sleep sleep.c:55 #11 0x55788c63196b in test__PERF_RECORD perf-record.c:0 #12 0x55788c620fb0 in run_test_child builtin-test.c:0 #13 0x55788c5bd18d in start_command run-command.c:127 #14 0x55788c621ef3 in __cmd_test builtin-test.c:0 #15 0x55788c6225bf in cmd_test ??:0 #16 0x55788c5afbd0 in run_builtin perf.c:0 #17 0x55788c5afeeb in handle_internal_command perf.c:0 #18 0x55788c52b383 in main ??:0 #19 0x7fc12fe33ca8 in __libc_start_call_main libc_start_call_main.h:74 #20 0x7fc12fe33d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128 #21 0x55788c52b9d1 in _start ??:0 7: PERF_RECORD_* events & perf_sample fields : Skip (permissions) ``` Signed-off-by: Ian Rogers <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Namhyung Kim <[email protected]>

Calling perf top with branch filters enabled on Intel CPU's with branch counters logging (A.K.A LBR event logging [1]) support results in a segfault. $ perf top -e '{cpu_core/cpu-cycles/,cpu_core/event=0xc6,umask=0x3,frontend=0x11,name=frontend_retired_dsb_miss/}' -j any,counter ... Thread 27 "perf" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffafff76c0 (LWP 949003)] perf_env__find_br_cntr_info (env=0xf66dc0 <perf_env>, nr=0x0, width=0x7fffafff62c0) at util/env.c:653 653 *width = env->cpu_pmu_caps ? env->br_cntr_width : (gdb) bt #0 perf_env__find_br_cntr_info (env=0xf66dc0 <perf_env>, nr=0x0, width=0x7fffafff62c0) at util/env.c:653 #1 0x00000000005b1599 in symbol__account_br_cntr (branch=0x7fffcc3db580, evsel=0xfea2d0, offset=12, br_cntr=8) at util/annotate.c:345 #2 0x00000000005b17fb in symbol__account_cycles (addr=5658172, start=5658160, sym=0x7fffcc0ee420, cycles=539, evsel=0xfea2d0, br_cntr=8) at util/annotate.c:389 #3 0x00000000005b1976 in addr_map_symbol__account_cycles (ams=0x7fffcd7b01d0, start=0x7fffcd7b02b0, cycles=539, evsel=0xfea2d0, br_cntr=8) at util/annotate.c:422 #4 0x000000000068d57f in hist__account_cycles (bs=0x110d288, al=0x7fffafff6540, sample=0x7fffafff6760, nonany_branch_mode=false, total_cycles=0x0, evsel=0xfea2d0) at util/hist.c:2850 #5 0x0000000000446216 in hist_iter__top_callback (iter=0x7fffafff6590, al=0x7fffafff6540, single=true, arg=0x7fffffff9e00) at builtin-top.c:737 #6 0x0000000000689787 in hist_entry_iter__add (iter=0x7fffafff6590, al=0x7fffafff6540, max_stack_depth=127, arg=0x7fffffff9e00) at util/hist.c:1359 #7 0x0000000000446710 in perf_event__process_sample (tool=0x7fffffff9e00, event=0x110d250, evsel=0xfea2d0, sample=0x7fffafff6760, machine=0x108c968) at builtin-top.c:845 #8 0x0000000000447735 in deliver_event (qe=0x7fffffffa120, qevent=0x10fc200) at builtin-top.c:1211 #9 0x000000000064ccae in do_flush (oe=0x7fffffffa120, show_progress=false) at util/ordered-events.c:245 #10 0x000000000064d005 in __ordered_events__flush (oe=0x7fffffffa120, how=OE_FLUSH__TOP, timestamp=0) at util/ordered-events.c:324 #11 0x000000000064d0ef in ordered_events__flush (oe=0x7fffffffa120, how=OE_FLUSH__TOP) at util/ordered-events.c:342 #12 0x00000000004472a9 in process_thread (arg=0x7fffffff9e00) at builtin-top.c:1120 #13 0x00007ffff6e7dba8 in start_thread (arg=<optimized out>) at pthread_create.c:448 #14 0x00007ffff6f01b8c in __GI___clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78 The cause is that perf_env__find_br_cntr_info tries to access a null pointer pmu_caps in the perf_env struct. A similar issue exists for homogeneous core systems which use the cpu_pmu_caps structure. Fix this by populating cpu_pmu_caps and pmu_caps structures with values from sysfs when calling perf top with branch stack sampling enabled. [1], LBR event logging introduced here: https://lore.kernel.org/all/[email protected]/ Reviewed-by: Ian Rogers <[email protected]> Signed-off-by: Thomas Falcon <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Namhyung Kim <[email protected]>

The test starts a workload and then opens events. If the events fail to open, for example because of perf_event_paranoid, the gopipe of the workload is leaked and the file descriptor leak check fails when the test exits. To avoid this cancel the workload when opening the events fails. Before: ``` $ perf test -vv 7 7: PERF_RECORD_* events & perf_sample fields: --- start --- test child forked, pid 1189568 Using CPUID GenuineIntel-6-B7-1 ------------------------------------------------------------ perf_event_attr: type 0 (PERF_TYPE_HARDWARE) config 0xa00000000 (cpu_atom/PERF_COUNT_HW_CPU_CYCLES/) disabled 1 ------------------------------------------------------------ sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8 sys_perf_event_open failed, error -13 ------------------------------------------------------------ perf_event_attr: type 0 (PERF_TYPE_HARDWARE) config 0xa00000000 (cpu_atom/PERF_COUNT_HW_CPU_CYCLES/) disabled 1 exclude_kernel 1 ------------------------------------------------------------ sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8 = 3 ------------------------------------------------------------ perf_event_attr: type 0 (PERF_TYPE_HARDWARE) config 0x400000000 (cpu_core/PERF_COUNT_HW_CPU_CYCLES/) disabled 1 ------------------------------------------------------------ sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8 sys_perf_event_open failed, error -13 ------------------------------------------------------------ perf_event_attr: type 0 (PERF_TYPE_HARDWARE) config 0x400000000 (cpu_core/PERF_COUNT_HW_CPU_CYCLES/) disabled 1 exclude_kernel 1 ------------------------------------------------------------ sys_perf_event_open: pid 0 cpu -1 group_fd -1 flags 0x8 = 3 Attempt to add: software/cpu-clock/ ..after resolving event: software/config=0/ cpu-clock -> software/cpu-clock/ ------------------------------------------------------------ perf_event_attr: type 1 (PERF_TYPE_SOFTWARE) size 136 config 0x9 (PERF_COUNT_SW_DUMMY) sample_type IP|TID|TIME|CPU read_format ID|LOST disabled 1 inherit 1 mmap 1 comm 1 enable_on_exec 1 task 1 sample_id_all 1 mmap2 1 comm_exec 1 ksymbol 1 bpf_event 1 { wakeup_events, wakeup_watermark } 1 ------------------------------------------------------------ sys_perf_event_open: pid 1189569 cpu 0 group_fd -1 flags 0x8 sys_perf_event_open failed, error -13 perf_evlist__open: Permission denied ---- end(-2) ---- Leak of file descriptor 6 that opened: 'pipe:[14200347]' ---- unexpected signal (6) ---- iFailed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon Failed to read build ID for //anon #0 0x565358f6666e in child_test_sig_handler builtin-test.c:311 #1 0x7f29ce849df0 in __restore_rt libc_sigaction.c:0 #2 0x7f29ce89e95c in __pthread_kill_implementation pthread_kill.c:44 #3 0x7f29ce849cc2 in raise raise.c:27 #4 0x7f29ce8324ac in abort abort.c:81 #5 0x565358f662d4 in check_leaks builtin-test.c:226 #6 0x565358f6682e in run_test_child builtin-test.c:344 #7 0x565358ef7121 in start_command run-command.c:128 #8 0x565358f67273 in start_test builtin-test.c:545 #9 0x565358f6771d in __cmd_test builtin-test.c:647 #10 0x565358f682bd in cmd_test builtin-test.c:849 #11 0x565358ee5ded in run_builtin perf.c:349 #12 0x565358ee6085 in handle_internal_command perf.c:401 #13 0x565358ee61de in run_argv perf.c:448 #14 0x565358ee6527 in main perf.c:555 #15 0x7f29ce833ca8 in __libc_start_call_main libc_start_call_main.h:74 #16 0x7f29ce833d65 in __libc_start_main@@GLIBC_2.34 libc-start.c:128 #17 0x565358e391c1 in _start perf[851c1] 7: PERF_RECORD_* events & perf_sample fields : FAILED! ``` After: ``` $ perf test 7 7: PERF_RECORD_* events & perf_sample fields : Skip (permissions) ``` Fixes: 16d00fe ("perf tests: Move test__PERF_RECORD into separate object") Signed-off-by: Ian Rogers <[email protected]> Tested-by: Arnaldo Carvalho de Melo <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: Athira Rajeev <[email protected]> Cc: Chun-Tse Shao <[email protected]> Cc: Howard Chu <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: James Clark <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: Kan Liang <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Peter Zijlstra <[email protected]> Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

This reverts commit 3451cf3 and fixes the following KASAN complaint when running test zbd/013: BUG: KASAN: slab-use-after-free in null_handle_data_transfer+0x88c/0xe50 [null_blk] Write of size 4096 at addr ffff8881ab162000 by task (udev-worker)/78072 CPU: 8 UID: 0 PID: 78072 Comm: (udev-worker) Not tainted 6.18.0-rc5-dbg #14 PREEMPT 737e33391e24fa2fcd9958673f6992b5ee131a07 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Call Trace: <TASK> show_stack+0x4d/0x60 dump_stack_lvl+0x61/0x80 print_address_description.constprop.0+0x8b/0x310 print_report+0xfd/0x1d7 kasan_report+0xde/0x1c0 kasan_check_range+0x10c/0x1f0 __asan_memcpy+0x3f/0x70 null_handle_data_transfer+0x88c/0xe50 [null_blk] null_process_cmd+0x1a4/0x370 [null_blk] null_process_zoned_cmd+0x1ff/0x3c0 [null_blk] null_handle_cmd+0x1bd/0x580 [null_blk] null_queue_rq+0x568/0x970 [null_blk] null_queue_rqs+0xe5/0x2b0 [null_blk] __blk_mq_flush_list+0x83/0xb0 blk_mq_dispatch_queue_requests+0x3d7/0x660 blk_mq_flush_plug_list+0x1a1/0x730 __blk_flush_plug+0x290/0x540 blk_finish_plug+0x53/0xc0 read_pages+0x456/0xad0 page_cache_ra_unbounded+0x3cd/0x6e0 force_page_cache_ra+0x1f0/0x370 page_cache_sync_ra+0x158/0x870 filemap_get_pages+0x327/0xcb0 filemap_read+0x336/0xd30 blkdev_read_iter+0x15c/0x430 vfs_read+0x79a/0x1150 ksys_read+0xfd/0x230 __x64_sys_read+0x76/0xc0 x64_sys_call+0x143c/0x17e0 do_syscall_64+0x96/0x360 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> Allocated by task 0 on cpu 0 at 3226.274686s: kasan_save_stack+0x2a/0x50 kasan_save_track+0x1c/0x70 kasan_save_alloc_info+0x3d/0x50 __kasan_kmalloc+0xa0/0xb0 __kmalloc_cache_noprof+0x2e9/0x8a0 kmem_cache_free+0x590/0x870 mempool_free_slab+0x1b/0x20 mempool_free+0xd1/0x9b0 bio_free+0x15e/0x1c0 bio_put+0x34f/0x790 bio_endio+0x31d/0x6c0 blk_update_request+0x425/0xfb0 blk_mq_end_request+0x5d/0x370 null_cmd_timer_expired+0x43/0x60 [null_blk] __hrtimer_run_queues+0x53e/0xb40 hrtimer_interrupt+0x32f/0x850 __sysvec_apic_timer_interrupt+0xdc/0x360 sysvec_apic_timer_interrupt+0xa4/0xe0 asm_sysvec_apic_timer_interrupt+0x1f/0x30 Freed by task 14 on cpu 0 at 3226.398721s: kasan_save_stack+0x2a/0x50 kasan_save_track+0x1c/0x70 __kasan_save_free_info+0x3f/0x60 __kasan_slab_free+0x67/0x80 kfree+0x170/0x780 slab_free_after_rcu_debug+0x6c/0x250 rcu_do_batch+0x369/0x13f0 rcu_core+0x385/0x5a0 rcu_core_si+0x12/0x20 handle_softirqs+0x1a3/0x930 run_ksoftirqd+0x3e/0x60 smpboot_thread_fn+0x311/0xa00 kthread+0x3cc/0x830 ret_from_fork+0x39c/0x500 ret_from_fork_asm+0x11/0x20 Last potentially related work creation: kasan_save_stack+0x2a/0x50 kasan_record_aux_stack+0xad/0xc0 __call_rcu_common.constprop.0+0xfb/0xbb0 call_rcu+0x12/0x20 kmem_cache_free+0x5bc/0x870 mempool_free_slab+0x1b/0x20 mempool_free+0xd1/0x9b0 bio_free+0x15e/0x1c0 bio_put+0x34f/0x790 bio_endio+0x31d/0x6c0 blk_update_request+0x425/0xfb0 blk_mq_end_request+0x5d/0x370 null_cmd_timer_expired+0x43/0x60 [null_blk] __hrtimer_run_queues+0x53e/0xb40 hrtimer_interrupt+0x32f/0x850 __sysvec_apic_timer_interrupt+0xdc/0x360 sysvec_apic_timer_interrupt+0xa4/0xe0 asm_sysvec_apic_timer_interrupt+0x1f/0x30 The buggy address belongs to the object at ffff8881ab162000 which belongs to the cache kmalloc-32 of size 32 The buggy address is located 0 bytes inside of freed 32-byte region [ffff8881ab162000, ffff8881ab162020) Cc: Keith Busch <[email protected]> Signed-off-by: Bart Van Assche <[email protected]>

This reverts commit 3451cf3 and fixes the following KASAN complaint when running test zbd/013: BUG: KASAN: slab-use-after-free in null_handle_data_transfer+0x88c/0xe50 [null_blk] Write of size 4096 at addr ffff8881ab162000 by task (udev-worker)/78072 CPU: 8 UID: 0 PID: 78072 Comm: (udev-worker) Not tainted 6.18.0-rc5-dbg #14 PREEMPT 737e33391e24fa2fcd9958673f6992b5ee131a07 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 Call Trace: <TASK> show_stack+0x4d/0x60 dump_stack_lvl+0x61/0x80 print_address_description.constprop.0+0x8b/0x310 print_report+0xfd/0x1d7 kasan_report+0xde/0x1c0 kasan_check_range+0x10c/0x1f0 __asan_memcpy+0x3f/0x70 null_handle_data_transfer+0x88c/0xe50 [null_blk] null_process_cmd+0x1a4/0x370 [null_blk] null_process_zoned_cmd+0x1ff/0x3c0 [null_blk] null_handle_cmd+0x1bd/0x580 [null_blk] null_queue_rq+0x568/0x970 [null_blk] null_queue_rqs+0xe5/0x2b0 [null_blk] __blk_mq_flush_list+0x83/0xb0 blk_mq_dispatch_queue_requests+0x3d7/0x660 blk_mq_flush_plug_list+0x1a1/0x730 __blk_flush_plug+0x290/0x540 blk_finish_plug+0x53/0xc0 read_pages+0x456/0xad0 page_cache_ra_unbounded+0x3cd/0x6e0 force_page_cache_ra+0x1f0/0x370 page_cache_sync_ra+0x158/0x870 filemap_get_pages+0x327/0xcb0 filemap_read+0x336/0xd30 blkdev_read_iter+0x15c/0x430 vfs_read+0x79a/0x1150 ksys_read+0xfd/0x230 __x64_sys_read+0x76/0xc0 x64_sys_call+0x143c/0x17e0 do_syscall_64+0x96/0x360 entry_SYSCALL_64_after_hwframe+0x4b/0x53 </TASK> Allocated by task 0 on cpu 0 at 3226.274686s: kasan_save_stack+0x2a/0x50 kasan_save_track+0x1c/0x70 kasan_save_alloc_info+0x3d/0x50 __kasan_kmalloc+0xa0/0xb0 __kmalloc_cache_noprof+0x2e9/0x8a0 kmem_cache_free+0x590/0x870 mempool_free_slab+0x1b/0x20 mempool_free+0xd1/0x9b0 bio_free+0x15e/0x1c0 bio_put+0x34f/0x790 bio_endio+0x31d/0x6c0 blk_update_request+0x425/0xfb0 blk_mq_end_request+0x5d/0x370 null_cmd_timer_expired+0x43/0x60 [null_blk] __hrtimer_run_queues+0x53e/0xb40 hrtimer_interrupt+0x32f/0x850 __sysvec_apic_timer_interrupt+0xdc/0x360 sysvec_apic_timer_interrupt+0xa4/0xe0 asm_sysvec_apic_timer_interrupt+0x1f/0x30 Freed by task 14 on cpu 0 at 3226.398721s: kasan_save_stack+0x2a/0x50 kasan_save_track+0x1c/0x70 __kasan_save_free_info+0x3f/0x60 __kasan_slab_free+0x67/0x80 kfree+0x170/0x780 slab_free_after_rcu_debug+0x6c/0x250 rcu_do_batch+0x369/0x13f0 rcu_core+0x385/0x5a0 rcu_core_si+0x12/0x20 handle_softirqs+0x1a3/0x930 run_ksoftirqd+0x3e/0x60 smpboot_thread_fn+0x311/0xa00 kthread+0x3cc/0x830 ret_from_fork+0x39c/0x500 ret_from_fork_asm+0x11/0x20 Last potentially related work creation: kasan_save_stack+0x2a/0x50 kasan_record_aux_stack+0xad/0xc0 __call_rcu_common.constprop.0+0xfb/0xbb0 call_rcu+0x12/0x20 kmem_cache_free+0x5bc/0x870 mempool_free_slab+0x1b/0x20 mempool_free+0xd1/0x9b0 bio_free+0x15e/0x1c0 bio_put+0x34f/0x790 bio_endio+0x31d/0x6c0 blk_update_request+0x425/0xfb0 blk_mq_end_request+0x5d/0x370 null_cmd_timer_expired+0x43/0x60 [null_blk] __hrtimer_run_queues+0x53e/0xb40 hrtimer_interrupt+0x32f/0x850 __sysvec_apic_timer_interrupt+0xdc/0x360 sysvec_apic_timer_interrupt+0xa4/0xe0 asm_sysvec_apic_timer_interrupt+0x1f/0x30 The buggy address belongs to the object at ffff8881ab162000 which belongs to the cache kmalloc-32 of size 32 The buggy address is located 0 bytes inside of freed 32-byte region [ffff8881ab162000, ffff8881ab162020) Cc: Keith Busch <[email protected]> Signed-off-by: Bart Van Assche <[email protected]> Reviewed-by: Chaitanya Kulkarni <[email protected]>

The ETM decoder incorrectly assumed that auxtrace queue indices were equivalent to CPU number. This assumption is used for inserting records into the queue, and for fetching queues when given a CPU number. This assumption held when Perf always opened a dummy event on every CPU, even if the user provided a subset of CPUs on the commandline, resulting in the indices aligning. For example: # event : name = cs_etm//u, , id = { 2451, 2452 }, type = 11 (cs_etm), size = 136, config = 0x4010, { sample_period, samp> # event : name = dummy:u, , id = { 2453, 2454, 2455, 2456 }, type = 1 (PERF_TYPE_SOFTWARE), size = 136, config = 0x9 (PER> 0 0 0x200 [0xd0]: PERF_RECORD_ID_INDEX nr: 6 ... id: 2451 idx: 2 cpu: 2 tid: -1 ... id: 2452 idx: 3 cpu: 3 tid: -1 ... id: 2453 idx: 0 cpu: 0 tid: -1 ... id: 2454 idx: 1 cpu: 1 tid: -1 ... id: 2455 idx: 2 cpu: 2 tid: -1 ... id: 2456 idx: 3 cpu: 3 tid: -1 Since commit 811082e ("perf parse-events: Support user CPUs mixed with threads/processes") the dummy event no longer behaves in this way, making the ETM event indices start from 0 on the first CPU recorded regardless of its ID: # event : name = cs_etm//u, , id = { 771, 772 }, type = 11 (cs_etm), size = 144, config = 0x4010, { sample_period, sample> # event : name = dummy:u, , id = { 773, 774 }, type = 1 (PERF_TYPE_SOFTWARE), size = 144, config = 0x9 (PERF_COUNT_SW_DUM> 0 0 0x200 [0x90]: PERF_RECORD_ID_INDEX nr: 4 ... id: 771 idx: 0 cpu: 2 tid: -1 ... id: 772 idx: 1 cpu: 3 tid: -1 ... id: 773 idx: 0 cpu: 2 tid: -1 ... id: 774 idx: 1 cpu: 3 tid: -1 This causes the following segfault when decoding: $ perf record -e cs_etm//u -C 2,3 -- true $ perf report perf: Segmentation fault -------- backtrace -------- #0 0xaaaabf9fd020 in ui__signal_backtrace setup.c:110 #1 0xffffab5c7930 in __kernel_rt_sigreturn [vdso][930] #2 0xaaaabfb68d30 in cs_etm_decoder__reset cs-etm-decoder.c:85 #3 0xaaaabfb65930 in cs_etm__get_data_block cs-etm.c:2032 #4 0xaaaabfb666fc in cs_etm__run_per_cpu_timeless_decoder cs-etm.c:2551 #5 0xaaaabfb6692c in (cs_etm__process_timeless_queues cs-etm.c:2612 #6 0xaaaabfb63390 in cs_etm__flush_events cs-etm.c:921 #7 0xaaaabfb324c0 in auxtrace__flush_events auxtrace.c:2915 #8 0xaaaabfaac378 in __perf_session__process_events session.c:2285 #9 0xaaaabfaacc9c in perf_session__process_events session.c:2442 #10 0xaaaabf8d3d90 in __cmd_report builtin-report.c:1085 #11 0xaaaabf8d6944 in cmd_report builtin-report.c:1866 #12 0xaaaabf95ebfc in run_builtin perf.c:351 #13 0xaaaabf95eeb0 in handle_internal_command perf.c:404 #14 0xaaaabf95f068 in run_argv perf.c:451 #15 0xaaaabf95f390 in main perf.c:558 #16 0xffffaab97400 in __libc_start_call_main libc_start_call_main.h:74 #17 0xffffaab974d8 in __libc_start_main@@GLIBC_2.34 libc-start.c:128 #18 0xaaaabf8aa8f0 in _start perf[7a8f0] Fix it by inserting into the queues based on CPU number, rather than using the index. Fixes: 811082e ("perf parse-events: Support user CPUs mixed with threads/processes") Signed-off-by: James Clark <[email protected]> Tested-by: Leo Yan <[email protected]> Cc: Adrian Hunter <[email protected]> Cc: Alexander Shishkin <[email protected]> Cc: [email protected] Cc: Ian Rogers <[email protected]> Cc: Ingo Molnar <[email protected]> Cc: Jiri Olsa <[email protected]> Cc: John Garry <[email protected]> Cc: Mark Rutland <[email protected]> Cc: Mike Leach <[email protected]> Cc: Namhyung Kim <[email protected]> Cc: Peter Zijlstra <[email protected]> Cc: Suzuki Poulouse <[email protected]> Cc: Thomas Falcon <[email protected]> Cc: Will Deacon <[email protected]> Signed-off-by: Arnaldo Carvalho de Melo <[email protected]>

Ihor Solodrai says: ==================== selftests/bpf: Fixes for userspace ASAN This series includes various fixes aiming to enable test_progs run with userspace address sanitizer on BPF CI. The first five patches add a simplified implementation of strscpy() to selftests/bpf and then replace strcpy/strncpy usages across the tests with it. See relevant discussions [1][2]. Patch #6 fixes the selftests/bpf/test_progs build with: SAN_CFLAGS="-fsanitize=address -fno-omit-frame-pointer" The subsequent patches fix bugs reported by the address sanitizer on attempt to run the tests. [1] https://lore.kernel.org/bpf/CAADnVQ+9uw2_o388j43EWiAPdMB=3FLx2jq-9zRSvqrv-wgRag@mail.gmail.com/ [2] https://lore.kernel.org/bpf/[email protected]/ --- v3->v4: - combine strscpy and ASAN series into one (Alexei) - make the count arg of strscpy() optional via macro and fixup relevant call sites (Alexei) - remove strscpy_cat() from this series (Alexei) v3: https://lore.kernel.org/bpf/[email protected]/ v2->v3: - rebase on top of "selftests/bpf: Add and use strscpy()" - https://lore.kernel.org/bpf/[email protected]/ - uprobe_multi_test.c: memset static struct child at the beginning of a test *and* zero out child->thread in release_child (patch #9, Mykyta) - nits in test_sysctl.c (patch #11, Eduard) - bpftool_helpers.c: update to use strscpy (patch #14, Alexei) - add __asan_on_error handler to still dump test logs even with ASAN build (patch #15, Mykyta) v2: https://lore.kernel.org/bpf/[email protected]/ v1->v2: - rebase on bpf (v1 was targeting bpf-next) - add ASAN flag handling in selftests/bpf/Makefile (Eduard) - don't override SIGSEGV handler in test_progs with ASAN (Eduard) - add error messages in detect_bpftool_path (Mykyta) - various nits (Eduard, Jiri, Mykyta, Alexis) v1: https://lore.kernel.org/bpf/[email protected]/ ==================== Link: https://patch.msgid.link/[email protected] Signed-off-by: Alexei Starovoitov <[email protected]>

For target nvmet_ctrl_free() flushes ctrl->async_event_work. If nvmet_ctrl_free() runs on nvmet-wq, the flush re-enters workqueue completion for the same worker:- A. Async event work queued on nvmet-wq (prior to disconnect): nvmet_execute_async_event() queue_work(nvmet_wq, &ctrl->async_event_work) nvmet_add_async_event() queue_work(nvmet_wq, &ctrl->async_event_work) B. Full pre-work chain (RDMA CM path): nvmet_rdma_cm_handler() nvmet_rdma_queue_disconnect() __nvmet_rdma_queue_disconnect() queue_work(nvmet_wq, &queue->release_work) process_one_work() lock((wq_completion)nvmet-wq) <--------- 1st nvmet_rdma_release_queue_work() C. Recursive path (same worker): nvmet_rdma_release_queue_work() nvmet_rdma_free_queue() nvmet_sq_destroy() nvmet_ctrl_put() nvmet_ctrl_free() flush_work(&ctrl->async_event_work) __flush_work() touch_wq_lockdep_map() lock((wq_completion)nvmet-wq) <--------- 2nd Lockdep splat: ============================================ WARNING: possible recursive locking detected 6.19.0-rc3nvme+ #14 Tainted: G N -------------------------------------------- kworker/u192:42/44933 is trying to acquire lock: ffff888118a00948 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: touch_wq_lockdep_map+0x26/0x90 but task is already holding lock: ffff888118a00948 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: process_one_work+0x53e/0x660 3 locks held by kworker/u192:42/44933: #0: ffff888118a00948 ((wq_completion)nvmet-wq){+.+.}-{0:0}, at: process_one_work+0x53e/0x660 #1: ffffc9000e6cbe28 ((work_completion)(&queue->release_work)){+.+.}-{0:0}, at: process_one_work+0x1c5/0x660 #2: ffffffff82d4db60 (rcu_read_lock){....}-{1:3}, at: __flush_work+0x62/0x530 Workqueue: nvmet-wq nvmet_rdma_release_queue_work [nvmet_rdma] Call Trace: __flush_work+0x268/0x530 nvmet_ctrl_free+0x140/0x310 [nvmet] nvmet_cq_put+0x74/0x90 [nvmet] nvmet_rdma_free_queue+0x23/0xe0 [nvmet_rdma] nvmet_rdma_release_queue_work+0x19/0x50 [nvmet_rdma] process_one_work+0x206/0x660 worker_thread+0x184/0x320 kthread+0x10c/0x240 ret_from_fork+0x319/0x390 Move async event work to a dedicated nvmet-aen-wq to avoid reentrant flush on nvmet-wq. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Chaitanya Kulkarni <[email protected]> Signed-off-by: Keith Busch <[email protected]>

This fixes the following trace caused by not dropping l2cap_conn reference when user->remove callback is called: [ 97.809249] l2cap_conn_free: freeing conn ffff88810a171c00 [ 97.809907] CPU: 1 UID: 0 PID: 1419 Comm: repro_standalon Not tainted 7.0.0-rc1-dirty #14 PREEMPT(lazy) [ 97.809935] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014 [ 97.809947] Call Trace: [ 97.809954] <TASK> [ 97.809961] dump_stack_lvl (lib/dump_stack.c:122) [ 97.809990] l2cap_conn_free (net/bluetooth/l2cap_core.c:1808) [ 97.810017] l2cap_conn_del (./include/linux/kref.h:66 net/bluetooth/l2cap_core.c:1821 net/bluetooth/l2cap_core.c:1798) [ 97.810055] l2cap_disconn_cfm (net/bluetooth/l2cap_core.c:7347 (discriminator 1) net/bluetooth/l2cap_core.c:7340 (discriminator 1)) [ 97.810086] ? __pfx_l2cap_disconn_cfm (net/bluetooth/l2cap_core.c:7341) [ 97.810117] hci_conn_hash_flush (./include/net/bluetooth/hci_core.h:2152 (discriminator 2) net/bluetooth/hci_conn.c:2644 (discriminator 2)) [ 97.810148] hci_dev_close_sync (net/bluetooth/hci_sync.c:5360) [ 97.810180] ? __pfx_hci_dev_close_sync (net/bluetooth/hci_sync.c:5285) [ 97.810212] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 97.810242] ? up_write (./arch/x86/include/asm/atomic64_64.h:87 (discriminator 5) ./include/linux/atomic/atomic-arch-fallback.h:2852 (discriminator 5) ./include/linux/atomic/atomic-long.h:268 (discriminator 5) ./include/linux/atomic/atomic-instrumented.h:3391 (discriminator 5) kernel/locking/rwsem.c:1385 (discriminator 5) kernel/locking/rwsem.c:1643 (discriminator 5)) [ 97.810267] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 97.810290] ? rcu_is_watching (./arch/x86/include/asm/atomic.h:23 ./include/linux/atomic/atomic-arch-fallback.h:457 ./include/linux/context_tracking.h:128 kernel/rcu/tree.c:752) [ 97.810320] hci_unregister_dev (net/bluetooth/hci_core.c:504 net/bluetooth/hci_core.c:2716) [ 97.810346] vhci_release (drivers/bluetooth/hci_vhci.c:691) [ 97.810375] ? __pfx_vhci_release (drivers/bluetooth/hci_vhci.c:678) [ 97.810404] __fput (fs/file_table.c:470) [ 97.810430] task_work_run (kernel/task_work.c:235) [ 97.810451] ? __pfx_task_work_run (kernel/task_work.c:201) [ 97.810472] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 97.810495] ? do_raw_spin_unlock (./include/asm-generic/qspinlock.h:128 (discriminator 5) kernel/locking/spinlock_debug.c:142 (discriminator 5)) [ 97.810527] do_exit (kernel/exit.c:972) [ 97.810547] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 97.810574] ? __pfx_do_exit (kernel/exit.c:897) [ 97.810594] ? lock_acquire (kernel/locking/lockdep.c:470 (discriminator 6) kernel/locking/lockdep.c:5870 (discriminator 6) kernel/locking/lockdep.c:5825 (discriminator 6)) [ 97.810616] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 97.810639] ? do_raw_spin_lock (kernel/locking/spinlock_debug.c:95 (discriminator 4) kernel/locking/spinlock_debug.c:118 (discriminator 4)) [ 97.810664] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 97.810688] ? find_held_lock (kernel/locking/lockdep.c:5350 (discriminator 1)) [ 97.810721] do_group_exit (kernel/exit.c:1093) [ 97.810745] get_signal (kernel/signal.c:3007 (discriminator 1)) [ 97.810772] ? security_file_permission (./arch/x86/include/asm/jump_label.h:37 security/security.c:2366) [ 97.810803] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 97.810826] ? vfs_read (fs/read_write.c:555) [ 97.810854] ? __pfx_get_signal (kernel/signal.c:2800) [ 97.810880] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 97.810905] ? __pfx_vfs_read (fs/read_write.c:555) [ 97.810932] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 97.810960] arch_do_signal_or_restart (arch/x86/kernel/signal.c:337 (discriminator 1)) [ 97.810990] ? __pfx_arch_do_signal_or_restart (arch/x86/kernel/signal.c:334) [ 97.811021] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 97.811055] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 97.811078] ? ksys_read (fs/read_write.c:707) [ 97.811106] ? __pfx_ksys_read (fs/read_write.c:707) [ 97.811137] exit_to_user_mode_loop (kernel/entry/common.c:66 kernel/entry/common.c:98) [ 97.811169] ? rcu_is_watching (./arch/x86/include/asm/atomic.h:23 ./include/linux/atomic/atomic-arch-fallback.h:457 ./include/linux/context_tracking.h:128 kernel/rcu/tree.c:752) [ 97.811192] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 97.811215] ? trace_hardirqs_off (./include/trace/events/preemptirq.h:36 (discriminator 33) kernel/trace/trace_preemptirq.c:95 (discriminator 33) kernel/trace/trace_preemptirq.c:90 (discriminator 33)) [ 97.811240] do_syscall_64 (./include/linux/irq-entry-common.h:226 ./include/linux/irq-entry-common.h:256 ./include/linux/entry-common.h:325 arch/x86/entry/syscall_64.c:100) [ 97.811268] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 97.811292] ? exc_page_fault (arch/x86/mm/fault.c:1480 (discriminator 3) arch/x86/mm/fault.c:1527 (discriminator 3)) [ 97.811318] entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) [ 97.811338] RIP: 0033:0x445cfe [ 97.811352] Code: Unable to access opcode bytes at 0x445cd4. Code starting with the faulting instruction =========================================== [ 97.811360] RSP: 002b:00007f65c41c6dc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 97.811378] RAX: fffffffffffffe00 RBX: 00007f65c41c76c0 RCX: 0000000000445cfe [ 97.811391] RDX: 0000000000000400 RSI: 00007f65c41c6e40 RDI: 0000000000000004 [ 97.811403] RBP: 00007f65c41c7250 R08: 0000000000000000 R09: 0000000000000000 [ 97.811415] R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffe8 [ 97.811428] R13: 0000000000000000 R14: 00007fff780a8c00 R15: 00007f65c41c76c0 [ 97.811453] </TASK> [ 98.402453] ================================================================== [ 98.403560] BUG: KASAN: use-after-free in __mutex_lock (kernel/locking/mutex.c:199 kernel/locking/mutex.c:694 kernel/locking/mutex.c:776) [ 98.404541] Read of size 8 at addr ffff888113ee40a8 by task khidpd_00050004/1430 [ 98.405361] [ 98.405563] CPU: 1 UID: 0 PID: 1430 Comm: khidpd_00050004 Not tainted 7.0.0-rc1-dirty #14 PREEMPT(lazy) [ 98.405588] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014 [ 98.405600] Call Trace: [ 98.405607] <TASK> [ 98.405614] dump_stack_lvl (lib/dump_stack.c:122) [ 98.405641] print_report (mm/kasan/report.c:379 mm/kasan/report.c:482) [ 98.405667] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 98.405691] ? __virt_addr_valid (arch/x86/mm/physaddr.c:55) [ 98.405724] ? __mutex_lock (kernel/locking/mutex.c:199 kernel/locking/mutex.c:694 kernel/locking/mutex.c:776) [ 98.405748] kasan_report (mm/kasan/report.c:221 mm/kasan/report.c:597) [ 98.405778] ? __mutex_lock (kernel/locking/mutex.c:199 kernel/locking/mutex.c:694 kernel/locking/mutex.c:776) [ 98.405807] __mutex_lock (kernel/locking/mutex.c:199 kernel/locking/mutex.c:694 kernel/locking/mutex.c:776) [ 98.405832] ? do_raw_spin_lock (kernel/locking/spinlock_debug.c:95 (discriminator 4) kernel/locking/spinlock_debug.c:118 (discriminator 4)) [ 98.405859] ? l2cap_unregister_user (./include/linux/list.h:381 (discriminator 2) net/bluetooth/l2cap_core.c:1723 (discriminator 2)) [ 98.405888] ? __pfx_do_raw_spin_lock (kernel/locking/spinlock_debug.c:114) [ 98.405915] ? __pfx___mutex_lock (kernel/locking/mutex.c:775) [ 98.405939] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 98.405963] ? lock_acquire (kernel/locking/lockdep.c:470 (discriminator 6) kernel/locking/lockdep.c:5870 (discriminator 6) kernel/locking/lockdep.c:5825 (discriminator 6)) [ 98.405984] ? find_held_lock (kernel/locking/lockdep.c:5350 (discriminator 1)) [ 98.406015] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 98.406038] ? lock_release (kernel/locking/lockdep.c:5536 kernel/locking/lockdep.c:5889 kernel/locking/lockdep.c:5875) [ 98.406061] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 98.406085] ? _raw_spin_unlock_irqrestore (./arch/x86/include/asm/irqflags.h:42 ./arch/x86/include/asm/irqflags.h:119 ./arch/x86/include/asm/irqflags.h:159 ./include/linux/spinlock_api_smp.h:178 kernel/locking/spinlock.c:194) [ 98.406107] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 98.406130] ? __timer_delete_sync (kernel/time/timer.c:1592) [ 98.406158] ? l2cap_unregister_user (./include/linux/list.h:381 (discriminator 2) net/bluetooth/l2cap_core.c:1723 (discriminator 2)) [ 98.406186] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 98.406210] l2cap_unregister_user (./include/linux/list.h:381 (discriminator 2) net/bluetooth/l2cap_core.c:1723 (discriminator 2)) [ 98.406263] hidp_session_thread (./include/linux/instrumented.h:112 ./include/linux/atomic/atomic-instrumented.h:400 ./include/linux/refcount.h:389 ./include/linux/refcount.h:432 ./include/linux/refcount.h:450 ./include/linux/kref.h:64 net/bluetooth/hidp/core.c:996 net/bluetooth/hidp/core.c:1305) [ 98.406293] ? __pfx_hidp_session_thread (net/bluetooth/hidp/core.c:1264) [ 98.406323] ? kthread (kernel/kthread.c:433) [ 98.406340] ? __pfx_hidp_session_wake_function (net/bluetooth/hidp/core.c:1251) [ 98.406370] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 98.406393] ? find_held_lock (kernel/locking/lockdep.c:5350 (discriminator 1)) [ 98.406424] ? __pfx_hidp_session_wake_function (net/bluetooth/hidp/core.c:1251) [ 98.406453] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 98.406476] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:79 (discriminator 1)) [ 98.406499] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 98.406523] ? kthread (kernel/kthread.c:433) [ 98.406539] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 98.406565] ? kthread (kernel/kthread.c:433) [ 98.406581] ? __pfx_hidp_session_thread (net/bluetooth/hidp/core.c:1264) [ 98.406610] kthread (kernel/kthread.c:467) [ 98.406627] ? __pfx_kthread (kernel/kthread.c:412) [ 98.406645] ret_from_fork (arch/x86/kernel/process.c:164) [ 98.406674] ? __pfx_ret_from_fork (arch/x86/kernel/process.c:153) [ 98.406704] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 98.406728] ? __pfx_kthread (kernel/kthread.c:412) [ 98.406747] ret_from_fork_asm (arch/x86/entry/entry_64.S:258) [ 98.406774] </TASK> [ 98.406780] [ 98.433693] The buggy address belongs to the physical page: [ 98.434405] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0xffff888113ee7c40 pfn:0x113ee4 [ 98.435557] flags: 0x200000000000000(node=0|zone=2) [ 98.436198] raw: 0200000000000000 ffffea0004244308 ffff8881f6f3ebc0 0000000000000000 [ 98.437195] raw: ffff888113ee7c40 0000000000000000 00000000ffffffff 0000000000000000 [ 98.438115] page dumped because: kasan: bad access detected [ 98.438951] [ 98.439211] Memory state around the buggy address: [ 98.439871] ffff888113ee3f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 98.440714] ffff888113ee4000: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 98.441580] >ffff888113ee4080: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 98.442458] ^ [ 98.443011] ffff888113ee4100: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 98.443889] ffff888113ee4180: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [ 98.444768] ================================================================== [ 98.445719] Disabling lock debugging due to kernel taint [ 98.448074] l2cap_conn_free: freeing conn ffff88810c22b400 [ 98.450012] CPU: 1 UID: 0 PID: 1430 Comm: khidpd_00050004 Tainted: G B 7.0.0-rc1-dirty #14 PREEMPT(lazy) [ 98.450040] Tainted: [B]=BAD_PAGE [ 98.450047] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014 [ 98.450059] Call Trace: [ 98.450065] <TASK> [ 98.450071] dump_stack_lvl (lib/dump_stack.c:122) [ 98.450099] l2cap_conn_free (net/bluetooth/l2cap_core.c:1808) [ 98.450125] l2cap_conn_put (net/bluetooth/l2cap_core.c:1822) [ 98.450154] session_free (net/bluetooth/hidp/core.c:990) [ 98.450181] hidp_session_thread (net/bluetooth/hidp/core.c:1307) [ 98.450213] ? __pfx_hidp_session_thread (net/bluetooth/hidp/core.c:1264) [ 98.450271] ? kthread (kernel/kthread.c:433) [ 98.450293] ? __pfx_hidp_session_wake_function (net/bluetooth/hidp/core.c:1251) [ 98.450339] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 98.450368] ? find_held_lock (kernel/locking/lockdep.c:5350 (discriminator 1)) [ 98.450406] ? __pfx_hidp_session_wake_function (net/bluetooth/hidp/core.c:1251) [ 98.450442] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 98.450471] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:79 (discriminator 1)) [ 98.450499] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 98.450528] ? kthread (kernel/kthread.c:433) [ 98.450547] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 98.450578] ? kthread (kernel/kthread.c:433) [ 98.450598] ? __pfx_hidp_session_thread (net/bluetooth/hidp/core.c:1264) [ 98.450637] kthread (kernel/kthread.c:467) [ 98.450657] ? __pfx_kthread (kernel/kthread.c:412) [ 98.450680] ret_from_fork (arch/x86/kernel/process.c:164) [ 98.450715] ? __pfx_ret_from_fork (arch/x86/kernel/process.c:153) [ 98.450752] ? srso_alias_return_thunk (arch/x86/lib/retpoline.S:221) [ 98.450782] ? __pfx_kthread (kernel/kthread.c:412) [ 98.450804] ret_from_fork_asm (arch/x86/entry/entry_64.S:258) [ 98.450836] </TASK> Fixes: b4f34d8 ("Bluetooth: hidp: add new session-management helpers") Reported-by: soufiane el hachmi <[email protected]> Tested-by: soufiane el hachmi <[email protected]> Signed-off-by: Luiz Augusto von Dentz <[email protected]>

Cancelling the I/O and admin tagsets during nvme-loop controller reset or shutdown is unnecessary. The subsequent destruction of the I/O and admin queues already waits for all in-flight target operations to complete. Cancelling the tagsets first also opens a race window. After a request tag has been cancelled, a late completion from the target may still arrive before the queues are destroyed. In that case the completion path may access a request whose tag has already been cancelled or freed, which can lead to a kernel crash. Please see below the kernel crash encountered while running blktests nvme/040: run blktests nvme/040 at 2026-03-08 06:34:27 loop0: detected capacity change from 0 to 2097152 nvmet: adding nsid 1 to subsystem blktests-subsystem-1 nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349. nvme nvme6: creating 96 I/O queues. nvme nvme6: new ctrl: "blktests-subsystem-1" nvme_log_error: 1 callbacks suppressed block nvme6n1: no usable path - requeuing I/O nvme6c6n1: Read(0x2) @ LBA 2096384, 128 blocks, Host Aborted Command (sct 0x3 / sc 0x71) blk_print_req_error: 1 callbacks suppressed I/O error, dev nvme6c6n1, sector 2096384 op 0x0:(READ) flags 0x2880700 phys_seg 1 prio class 2 block nvme6n1: no usable path - requeuing I/O Kernel attempted to read user page (236) - exploit attempt? (uid: 0) BUG: Kernel NULL pointer dereference on read at 0x00000236 Faulting instruction address: 0xc000000000961274 Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries Modules linked in: nvme_loop nvme_fabrics loop nvmet null_blk rpadlpar_io rpaphp xsk_diag bonding rfkill nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink pseries_rng dax_pmem vmx_crypto drm drm_panel_orientation_quirks xfs mlx5_core nvme bnx2x sd_mod nd_pmem nd_btt nvme_core sg papr_scm tls libnvdimm ibmvscsi ibmveth scsi_transport_srp nvme_keyring nvme_auth mdio hkdf pseries_wdt dm_mirror dm_region_hash dm_log dm_mod fuse [last unloaded: loop] CPU: 25 UID: 0 PID: 0 Comm: swapper/25 Kdump: loaded Not tainted 7.0.0-rc3+ #14 PREEMPT Hardware name: IBM,9043-MRX Power11 (architected) 0x820200 0xf000007 of:IBM,FW1120.00 (RF1120_128) hv:phyp pSeries NIP: c000000000961274 LR: c008000009af1808 CTR: c00000000096124c REGS: c0000007ffc0f910 TRAP: 0300 Not tainted (7.0.0-rc3+) MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22222222 XER: 00000000 CFAR: c008000009af232c DAR: 0000000000000236 DSISR: 40000000 IRQMASK: 0 GPR00: c008000009af17fc c0000007ffc0fbb0 c000000001c78100 c0000000be05cc00 GPR04: 0000000000000001 0000000000000000 0000000000000007 0000000000000000 GPR08: 0000000000000000 0000000000000000 0000000000000002 c008000009af2318 GPR12: c00000000096124c c0000007ffdab880 0000000000000000 0000000000000000 GPR16: 0000000000000010 0000000000000000 0000000000000004 0000000000000000 GPR20: 0000000000000001 c000000002ca2b00 0000000100043bb2 000000000000000a GPR24: 000000000000000a 0000000000000000 0000000000000000 0000000000000000 GPR28: c000000084021d40 c000000084021d50 c0000000be05cd60 c0000000be05cc00 NIP [c000000000961274] blk_mq_complete_request_remote+0x28/0x2d4 LR [c008000009af1808] nvme_loop_queue_response+0x110/0x290 [nvme_loop] Call Trace: 0xc00000000502c640 (unreliable) nvme_loop_queue_response+0x104/0x290 [nvme_loop] __nvmet_req_complete+0x80/0x498 [nvmet] nvmet_req_complete+0x24/0xf8 [nvmet] nvmet_bio_done+0x58/0xcc [nvmet] bio_endio+0x250/0x390 blk_update_request+0x2e8/0x68c blk_mq_end_request+0x30/0x5c lo_complete_rq+0x94/0x110 [loop] blk_complete_reqs+0x78/0x98 handle_softirqs+0x148/0x454 do_softirq_own_stack+0x3c/0x50 __irq_exit_rcu+0x18c/0x1b4 irq_exit+0x1c/0x34 do_IRQ+0x114/0x278 hardware_interrupt_common_virt+0x28c/0x290 Since the queue teardown path already guarantees that all target-side operations have completed, cancelling the tagsets is redundant and unsafe. So avoid cancelling the I/O and admin tagsets during controller reset and shutdown. Reviewed-by: Christoph Hellwig <[email protected]> Signed-off-by: Nilay Shroff <[email protected]> Signed-off-by: Keith Busch <[email protected]>

kawasaki and others added 12 commits June 9, 2025 20:58

adding ci files

3d460ec

blktests-ci Bot added new V1 linus-master labels Jun 10, 2025

blktests-ci Bot force-pushed the linus-master_base branch from 3d460ec to ac8b12e Compare July 10, 2025 11:56

blktests-ci Bot added V2 and removed V1 labels Jul 10, 2025

blktests-ci Bot closed this Jul 10, 2025

blktests-ci Bot deleted the series/968947=>linus-master branch July 23, 2025 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dm-pcache �� persistent-memory cache for block devices#14

dm-pcache �� persistent-memory cache for block devices#14
blktests-ci[bot] wants to merge 12 commits intolinus-master_basefrom
series/968947=>linus-master

blktests-ci Bot commented Jun 10, 2025

Uh oh!

blktests-ci Bot commented Jun 10, 2025

Uh oh!

blktests-ci Bot commented Jul 10, 2025

Uh oh!

blktests-ci Bot commented Jul 10, 2025

Uh oh!

blktests-ci Bot commented Jul 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

blktests-ci Bot commented Jun 10, 2025

Uh oh!

blktests-ci Bot commented Jun 10, 2025

Uh oh!

blktests-ci Bot commented Jul 10, 2025

Uh oh!

blktests-ci Bot commented Jul 10, 2025

Uh oh!

blktests-ci Bot commented Jul 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants