Skip to content

Releases: anakryiko/wprof

wprof v0.3

30 Apr 23:13

Choose a tag to compare

wprof v0.3 Release Notes

Highlights

  • User-defined tracing (utrace) — a comprehensive new subsystem for
    custom event capture with a flexible DSL supporting uprobes, kprobes,
    USDT, tracepoints, raw tracepoints, and BPF program probes.
  • Python function tracing — deterministic Python and PyTorch function
    call tracing via injection, with stitched Python + native stack traces.
  • JSON output mode — full JSON trace output with documented schema.
  • PMU counter collection — hardware and software performance counter
    capture per scheduling event.
  • Request listing and filtering — post-capture request analysis with
    sorting, filtering, and top/bottom-N selection.

New Features

User-Defined Tracing (-U)

A new DSL-based subsystem for capturing custom events alongside
built-in tracing. Define probes with -U '<definition>' or from a
file with -U @filepath.

Supported probe types:

  • uprobes (u:, uret:, uspan:) — userspace function entry/exit
  • kprobes (k:, kret:, kspan:) — kernel function entry/exit
  • USDT (usdt:provider:name) — User Statically-Defined Tracepoints
  • classic tracepoints (tp:category:name) — perf tracepoint events
  • raw tracepoints (raw_tp:name) — BTF-based kernel tracepoints
  • BPF probes (bpf:, bpfret:, bpfspan:) — tracing loaded BPF programs
  • generic spans (probe1 ~~ probe2) — arbitrary entry/exit pairs

Features:

  • Argument capture by index (arg:0) or name (arg:prev_comm) with
    automatic type inference from BTF, tracefs format files, or USDT ELF notes
  • Wildcard capture (arg:*) for all available arguments
  • Optional explicit type annotation (arg:0:str, arg:1:u32->my_name)
  • Stack trace capture (stack parameter)
  • Name format templates with argument substitution
    (| name:'syscall #{id}' |)
  • Custom probe IDs for track grouping (| id:my_probe |)
  • Binary/process filtering (path:, pid:)

Events appear as Perfetto slices/instants with argument annotations and
as structured JSON with typed argument values.

See UTRACE.md for full documentation.

Python Stack Traces (-f py-stacks)

BPF-based Python stack trace capture:

  • Captures Python call stacks and stitches them with native (C/C++)
    stack traces for unified call stacks in timer and off-CPU events.
  • -e py-stacks-only to show only Python frames without native stacks.
  • Auto-discovers Python processes, or target specific ones with
    -f py-stacks=PID or -f py-stacks=nvidia-smi.

Python Function Tracing (-f py-trace, -f py-torch)

Deterministic function-level tracing for Python and PyTorch applications
via library injection:

  • Python tracing (-f py-trace): Captures Python function calls and
    returns via PyEval_SetProfile, producing exact call trees with
    timestamps. Rendered as collapsible tracks under kernel threads in
    Perfetto.
  • PyTorch tracing (-f py-torch): Captures PyTorch operator execution
    via the RecordFunction callback system, covering autograd and C++ threads.
  • Supports both statically and dynamically linked Python and libpytorch.

JSON Output Mode (-J)

Full JSON trace output as newline-delimited JSON:

  • -J <file> writes JSON trace to file; -J - writes to stdout
  • Complete event coverage: scheduling, interrupts, workqueue, task
    lifecycle, CUDA, Python, PyTorch, utrace, requests, sched-ext
  • Documented schema available via --json-schema flag
  • Float-second timestamps with nanosecond precision
  • Structured stack traces with symbolized frames and source locations

See JSON_SCHEMA.md for the full data model.

PMU Counter Collection (--pmu)

Hardware and software performance counter capture:

  • --pmu r003c — raw PMU event
  • --pmu cpu/cpu-cycles/ — named PMU event
  • --pmu sw:page-faults — software event
  • --pmu L1-icache-loads — cache event
  • --pmu derived:ipc=cpu_instructions/cpu_cpu-cycles — derived counters
  • PMU values attached to scheduling events (context switches, interrupts)
  • Rendered as annotations in Perfetto and as pmus arrays in JSON

Request Listing (--req-list)

Post-capture analysis of completed requests:

  • --req-list — list all completed requests
  • --req-sort latency / --req-sort-asc / --req-sort-desc — sort by
    field
  • --req-filter 'latency>1ms' — filter by field expressions
  • --req-top-n 10 / --req-bottom-n 10 — limit output
  • -S req — capture stack traces at request lifecycle events

IRQ Collection Control (-f softirq/hardirq/irq)

Fine-grained control over interrupt event capture:

  • -f softirq / -f hardirq — enable specific IRQ types
  • -f irq — enable both softirq and hardirq
  • -f no-softirq / -f no-hardirq — disable specific IRQ types

Custom Metadata (-M)

Attach arbitrary key=value metadata to recordings:

  • -M key=value — repeatable, stored in data file
  • Appears in JSON header metadata object and replay info
  • Session timestamp automatically recorded in UTC

Sandboxing Support

New options for running wprof in partially untrusted environments:

  • --record — explicitly enforce recording mode (mutually exclusive
    with --replay)
  • --seal-output (hidden) — prevent subsequent -D, -T, -J options,
    allowing a trusted runner to lock down output paths before passing
    control to untrusted arguments

Perfetto Trace Improvements

  • Restructured thread tracks: Timer, CUDA API, request, and utrace
    events each get their own collapsible child track under the thread's
    scheduler track, reducing visual clutter.
  • CUDA GPU hierarchy: GPU tracks sorted numerically with GPU #N at top.
  • Python/PyTorch tracks: Nested as collapsible rows under kernel threads.
  • Request visualization: Per-thread request activity tracks with
    -e req-split (default) and -e req-embed options.

Bug Fixes

  • Fix per-CPU stack trace scratch buffer clobbering that could corrupt
    stack traces under heavy load.
  • Warn when not running as root in capture mode.
  • Error out on bare -R without an output mode (-T, -J, -I, or
    --req-list).
  • Numerous other small bug fixes across pytrace, CUDA, request tracking,
    sched-ext, ELF symbol resolution, and Perfetto rendering.

Performance

  • Significantly sped up BPF ringbuf setup at startup.
  • Improved ringbuf usage logic to reduce occasional data drops.
  • Bumped default ringbuf size to 16MB for busier hosts.

Internal / Data Format

  • Complete redesign of wprof.data persistence format for better
    performance and reduced disk space.
  • Revamped stack symbolization pipeline.
  • Improved --replay-info output with detailed per-event-type
    breakdowns, stack trace statistics, and PMU data sizes.

Full Changelog: v0.2.1...v0.3

wprof v0.2.1

21 Jan 19:36

Choose a tag to compare

Bug fix release fixing kernel stack trace symbolization issue on arm64. Updating blazesym to latest release fixes the issue.

Full Changelog: v0.2...v0.2.1

wprof v0.2

13 Jan 18:25

Choose a tag to compare

Release notes

  • scheduler-centric per-CPU view (-e sched) is now supported;
  • GPU tracing support (-f cuda) using ptrace-based code injection into target processes;
  • reworked stack trace support in Perfetto traces, they are now attached to events and slices directly (relies on recently added Perfetto support);
  • more sched-ext specific metrics are now collected (-f scx-layer);

Full Changelog: v0.1...v0.2

wprof v0.1

10 Oct 00:00

Choose a tag to compare

First official release, in preparation for packaging.

Full Changelog: https://github.com/anakryiko/wprof/commits/v0.1