Skip to content

proxy-client: fix TSan data race in clientDestroy#286

Merged
ryanofsky merged 2 commits into
bitcoin-core:masterfrom
ryanofsky:pr/drace
Jun 11, 2026
Merged

proxy-client: fix TSan data race in clientDestroy#286
ryanofsky merged 2 commits into
bitcoin-core:masterfrom
ryanofsky:pr/drace

Conversation

@ryanofsky

@ryanofsky ryanofsky commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

#273 added a new test which exposed a longstanding race condition in clientDestroy logging code, causing the sanitize CI job to now fail with a TSAN error. Fix the race by simplifying the logging code.

Details about the fix are in the commit message. TSAN error looked like

[ TEST ] test.cpp:437: Destroying ProxyClient<> with destroy method after peer disconnect
...
  Write of size 8 at 0x721400005020 by thread T24 (mutexes: write M0):
    #0 std::__1::__function::__func<mp::ProxyClientBase<mp::test::messages::FooCallback, mp::test::FooCallback>::ProxyClientBase(mp::test::messages::FooCallback::Client, mp::Connection*, bool)::{lambda()#1}, void ()>::operator()() /home/runner/work/libmultiprocess/libmultiprocess/include/mp/proxy-io.h:522 (mptest+0x242b20)
    #1 void mp::Unlock<mp::Lock, std::__1::function<void ()>&>(mp::Lock&, std::__1::function<void ()>&) /nix/store/6z6xj37ljxwrwd5w3rs0zyfkrxd48lzl-libcxx-21.1.2-dev/include/c++/v1/__functional/function.h:274 (mptest+0x35582d)
    #2 mp::Connection::~Connection() /home/runner/work/libmultiprocess/libmultiprocess/src/mp/proxy.cpp:150 (mptest+0x34e596)
    ...
    #11 mp::test::TestSetup::TestSetup(bool)::{lambda()#1}::operator()() const /home/runner/work/libmultiprocess/libmultiprocess/test/mp/test/test.cpp:112 (mptest+0x1504fe)

  Previous read of size 8 at 0x721400005020 by thread T26:
    #0 void mp::clientDestroy<mp::ProxyClient<mp::test::messages::FooCallback> >(mp::ProxyClient<mp::test::messages::FooCallback>&) /home/runner/work/libmultiprocess/libmultiprocess/include/mp/proxy-types.h:647 (mptest+0x342d2a)
     ...
    #21 std::__1::__function::__func<mp::ProxyServerBase<mp::test::messages::FooInterface, mp::test::FooImplementation>::~ProxyServerBase()::{lambda()#1}, void ()>::operator()() /nix/store/6z6xj37ljxwrwd5w3rs0zyfkrxd48lzl-libcxx-21.1.2-dev/include/c++/v1/__functional/function.h:174 (mptest+0x347c76)
     ...
    #23 void* std::__1::__thread_proxy[abi:ne210101]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, mp::EventLoop::startAsyncThread()::$_0> >(void*) /home/runner/work/libmultiprocess/libmultiprocess/src/mp/proxy.cpp:309 (mptest+0x353ef2)

  Thread T24 (tid=10032, running) created by main thread at:
    ...
    #4 mp::test::TestCase437::run() /home/runner/work/libmultiprocess/libmultiprocess/test/mp/test/test.cpp:449 (mptest+0x1459a7)

https://github.com/bitcoin-core/libmultiprocess/actions/runs/26851297744/job/79183796152

Note: #273 made it possible for this TSAN error to trigger, but it didn't trigger most of the time, so it was not detected in CI. There is also an interaction with #218 and #273 + #218 together make it much more likely to trigger

clientDestroy() reads m_context.connection to decide whether to use
MP_LOG vs KJ_LOG, but Connection::~Connection() can set it to null
concurrently from the event loop thread (via the disconnect_cb sync
cleanup callback) while the destructor runs on an async cleanup thread,
causing a TSan-reported data race. The race is exposed by the test added
in commit 90be835 ("test: regression for ~ProxyClient destroy after peer
disconnect").

The KJ_LOG fallback was only needed before commit 315ff53 ("refactor:
Add ProxyContext EventLoop* member"), when logging required going
through connection->m_loop. Since that commit, m_context.loop is a
direct EventLoopRef that is always valid regardless of whether
m_context.connection is null. The KJ_LOG branch is now dead code, so
drop it and the connection check entirely.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@DrahtBot

DrahtBot commented Jun 3, 2026

Copy link
Copy Markdown

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Reviews

See the guideline for information on the review process.

Type Reviewers
ACK maflcko, ViniciusCestarii, xyzconstant
Concept ACK w0xlt
Stale ACK enirox001

If your review is incorrectly listed, please copy-paste <!--meta-tag:bot-skip--> into the comment that the bot should ignore.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #293 (proxy-types: prevent race condition when reading client.m_context.connection by ViniciusCestarii)
  • #269 (proxy: add local connection limit to ListenConnections by enirox001)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

gen.cpp uses IWYU pragma: keep for include which appears to be necessary for
newer versions of capnproto (1.4.0 but not 1.1.0)
@ryanofsky

ryanofsky commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

@w0xlt

w0xlt commented Jun 4, 2026

Copy link
Copy Markdown

Concept ACK,

@enirox001 enirox001 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK 039e5ac

Built and ran mptest with -fsanitize=thread, all 11 tests pass cleanly.

The fix makes sense to me. Instead of adding synchronization for what constitutes a log line, we drop the branch and unconditionally log via MP_LOG.

@ryanofsky

Copy link
Copy Markdown
Collaborator Author

Thanks for the reviews. I also went back and tried to debug why CI did not catch the TSAN error. It turns out to be because an interaction between #273 and #218. The new test added in #273 does not trigger the TSAN error deterministically. When it is run standalone without #218 there's usually no error, although I did see a TSAN error once running the test maybe 10 times. But when #273 and #218 are combined, the test triggers TSAN failures most of the time. This seems to happen because #218 changes clientDestroy to call CxxTypeName which can change thread timing.

@ryanofsky

Copy link
Copy Markdown
Collaborator Author

Started to merge this but I saw some more IWYU errors locally that seem better to fix. I think the errors don't happen in CI because nixpkgs are older and use an older version of capnproto. I also ran IWYU locally before my last push, but didn't see these errors because I was using a combined branch with #287 with example files removed. Errors look like:

/home/ryan/mp/build-default/example/calculator.capnp.proxy-client.c++ should add these lines:
#include <capnp/common.h>                    // for MessageSize
#include <string>                            // for basic_string

Updated 039e5ac -> 90982f7 (pr/drace.3 -> pr/drace.4, compare) touching up IWYU commit to fix more local IWYU errors

@maflcko

maflcko commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

lgtm ACK 90982f7

A bit tedious, that different capnp versions require different set of includes, but I guess there is nothing that can be done about that in this repo.

@DrahtBot DrahtBot requested a review from enirox001 June 10, 2026 10:40
@ryanofsky

Copy link
Copy Markdown
Collaborator Author

A bit tedious, that different capnp versions require different set of includes, but I guess there is nothing that can be done about that in this repo.

Yes, this repo is taking a different strategy for IWYU than the Bitcoin core repo so it is easier to run IWYU locally.

With Bitcoin core AFAIK you need to run IWYU in a container and can't run it as part of your normal development process because headers must be changed to match what a single CI machine enforces, and pragmas are discouraged.

With this repo there is no special IWYU configuration. IWYU can be run in any configuration and "pragma: keeps" can be added freely to turn off complaints from other configurations. From my perspective this is good because there's no need for a container to run IWYU, and you can waste less time debugging IWYU errors and figuring out what "correct" includes are. You get the main benefits of IWYU (faster compiles, cleaner headers) at the cost of needing a few mysterious pragmas. I think this is about as much as you can hope for with IWYU.

@maflcko

maflcko commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Right, it is obviously more correct to run iwyu for the full range of all supported third-party dependency versions (especially if they have different include requirements).

Maybe it could make sense to expand the CI to cover both oldeps-iwyu and nightly-iwyu (or so) here?

I think for Bitcoin Core this doesn't matter, because the dependencies don't change the include requirements across versions. (E.g. boost multi index is just a single include header)

If that were to change, we should do the same. But no strong opinion, and this seems to be trailing off a bit 😅

@ViniciusCestarii ViniciusCestarii left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK, 90982f7

This change removes differentiation on the logging when the connection was torn down "IPC interrupted client destroy" but I believe this is fine and simpler than locking a mutex.

@xyzconstant

Copy link
Copy Markdown
Contributor

tACK 90982f7

@ryanofsky ryanofsky merged commit 9885d7d into bitcoin-core:master Jun 11, 2026
20 of 23 checks passed
@w0xlt

w0xlt commented Jun 11, 2026

Copy link
Copy Markdown

post-merge ACK

Sjors added a commit to Sjors/sv2-tp that referenced this pull request Jun 17, 2026
8412fcdc65 Merge bitcoin-core/libmultiprocess#295: Mark Waiter m_cv as guarded by m_mutex
1593ee2d18 Merge bitcoin-core/libmultiprocess#294: test: Add passDouble smoke test
9885d7dd33 Merge bitcoin-core/libmultiprocess#286: proxy-client: fix TSan data race in clientDestroy
fa35501c4f Mark Waiter m_cv as guarded by m_mutex
faaedb11f8 test: Add passDouble smoke test
733c64318d Merge bitcoin-core/libmultiprocess#292: type-number: fix clang-tidy modernize-use-nullptr
9cc3479ab3 Merge bitcoin-core/libmultiprocess#291: cmake: Add `mp_headers` custom target
201abd9e3a Merge bitcoin-core/libmultiprocess#289: cmake: make target_capnp_sources use CURRENT dirs
99820c8aec Merge bitcoin-core/libmultiprocess#279: doc: Add comments to FIELD_* constants in proxy.h
73b985540c Merge bitcoin-core/libmultiprocess#278: doc: Fix and expand design.md
e7e91b2e23 Merge bitcoin-core/libmultiprocess#277: Add std::unordered_set support and a helper BuildList to dedup list build handlers
91a951f59a tidy fix: modernize-use-nullptr
16362f42d0 cmake: Add `mp_headers` custom target
615a94fe3a cmake: document ONLY_CAPNP option in target_capnp_sources
90982f75c6 mpgen: iwyu changes required by previous commit
25bb3e67f3 proxy-client: fix TSan data race in clientDestroy
620f297f31 cmake: make target_capnp_sources use CURRENT dirs
9de4b885aa test: use camelCase + $Proxy.name for FooStruct fields
011b91793d type: add std::unordered_set support
20d19b9644 proxy: add BuildList helper and dedup map/set/vector build handlers
e863c6cdf6 doc: Add comments to FIELD_* constants in proxy.h
18db0ab957 doc: Fix and expand design.md
61de697536 Merge bitcoin-core/libmultiprocess#273: proxy-client: tolerate exceptions from remote destroy during cleanup
9cec9d6ca5 Merge bitcoin-core/libmultiprocess#243: mpgen: support primitive std::optional struct fields
4aaff11374 Merge bitcoin-core/libmultiprocess#238: cmake, ci: updates for recent nixpkgs
2ac55a56b5 Merge bitcoin-core/libmultiprocess#218: Better error and log messages
6de92e1c73 proxy-client: tolerate exceptions from remote destroy during cleanup
90be8354d4 test: regression for ~ProxyClient destroy after peer disconnect
3c69d125a1 Merge bitcoin-core/libmultiprocess#260: event loop: tolerate unexpected exceptions in `post()` callbacks
b8a48c65e6 event loop: tolerate unexpected exceptions in `post()` callbacks
f787863d2c Merge bitcoin-core/libmultiprocess#270: doc: Bump version 10 > 11
a22f602910 doc: Bump version 10 > 11
4eae445d6d debug: Add TypeName() function and log statements for Proxy objects being created and destroyed
f326c5b1b7 logging: Add better logging on IPC server-side failures
6dbfa56a04 mpgen: support primitive std::optional struct fields
8d1277deb5 mpgen refactor: add AccessorType function
db716bbcba mpgen refactor: Move field handling code to FieldList class
db7acb3ce2 ci: Fix shell.nix compatibility with CMake 4.0
91a7759a9a cmake: Fix IWYU in nix by adding CMAKE_CXX_IMPLICIT_INCLUDE_DIRECTORIES

git-subtree-dir: src/ipc/libmultiprocess
git-subtree-split: 8412fcdc659e1379f9b4dea896c26bc1c5f3afa8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants