(feat): per-IOC locking#165
Conversation
Previously abort() returned the error on CancelledError and raised a new one on other failures, leaving the chain errored so any queued commit (notably the disconnect transaction) was never executed — IOCs could stay Active in CF after an upload failure. Co-authored-by: Sky Brewer <[email protected]>
8fc2db4 to
85ed5e4
Compare
jacomago
left a comment
There was a problem hiding this comment.
I'm wondering if we can get better testing for this.
I'm thinking we could make the multi-ioc container tests could have maxActive = 2, since there are 4 iocs.
Would also be nice to have an IOC with more channels than the commitMax which is defaulted to 5000, could simply set it to 1 and make sure tests work?
The single global DeferredLock serialised all IOC commits behind one queue. Under load, the lock depth determined how long IOCs waited to be marked active or inactive, and one stuck commit blocked every other. Per-IOC locks let different IOCs commit in parallel while keeping same-IOC transactions serialised. _state_lock (threading.Lock) guards the shared iocs/channel_ioc_ids dicts against concurrent thread writes. The CF push receives a targeted snapshot of the channels relevant to this commit (records being deleted plus the IOC's current channel set) so snapshot cost scales with the commit rather than total channel count. _ioc_channels tracks which channels belong to each IOC for O(1) disconnect cost; registration deduplicates to prevent double-counting. stopService drains all in-flight per-IOC locks before running the clean_on_stop sweep, preventing Active pushes from racing the sweep. The commit path moves to @inlineCallbacks so retry waits use task.deferLater — no sleeping worker threads between attempts. Also introduces CFUpdateAbortedError to distinguish exhausted CF retries from cancellation, and fixes chain_error to surface unexpected exceptions rather than swallowing them. Co-authored-by: Sky Brewer <[email protected]>
deferToThread is monkeypatched to return synchronously-resolved Deferreds so the full _commit_with_lock callback chain can be exercised without a reactor. The prepare_result tuple matches the (ioc_info, record_info_by_name, records_to_delete, iocs_snap, ciids_snap) signature of _prepare_commit; the push phase is controlled via _push_to_cf_async. Covers lock identity, iocid routing, prune lifecycle, and all four chain_error paths. Co-authored-by: Sky Brewer <[email protected]>
The global lock is gone; update demo.conf to reflect that a stuck pushAlwaysRetry now only holds that IOC's lock rather than blocking all commits, fix an incorrect default value comment, and correct a long-standing typo in the same pass. Also rename single-letter CastFactory identifiers (P/P2 → proto/waiting, addr → _addr) to satisfy linter, and correct the _stop_service_with_lock docstring which still claimed the lifecycle lock prevents concurrent commits (commits now use per-IOC locks and are drained explicitly before the clean_on_stop sweep).
application.py unconditionally overwrote session.trlimit with the config value, which defaulted to 0 (no limit), silently negating the trlimit=5000 class default added in 89be20c. Reference CollectionSession.trlimit directly so the default cannot silently drift if the class default ever changes, and add a regression test.
538b445 to
c51af3e
Compare
Twisted's default thread pool size is 10. With maxActive IOC commits in-flight, each holding a pool thread for the CF HTTP call, the pool becomes the bottleneck before maxActive connections are ever fully utilised. Set the pool to maxActive + 10 (headroom for clean_service and other thread users) so throughput scales with the configured connection limit rather than the thread pool default.
c51af3e to
d35922d
Compare
jacomago
left a comment
There was a problem hiding this comment.
Still think we should have more integration tests where the ioc count is purposefully above the configuration maxActive etc. Maybe even just alter the configuraiton for the multi-ioc tests to have maxActive =1 .
Sorry, forgot to reply to this before @jacomago. (Always annoying how to deal with the summary comments in GH which aren't threads...) I don't think we should be further extending the current "e2e" tests run in this project's CI. In fact, I think we should be making an effort to remove them (which actually is intended to partially be enabled by this MR). I can, however, accept trying to increase coverage in a more isolated "mock-y" way, if that's what you feel is needed. |
Ok, not the e2e tests, I agree they should be removed. Then I think tests that explicitly have maxActive and commitAmount (or whatever the variable is called) being low so that its clear the behaviour is still correct and can handle multiple IOCs. |
- connections beyond maxActive go to Wait, not activated - waiting connections are promoted FIFO when an active one completes - NActive decrements when active completes with no waiters - connections closed before activation are removed from Wait without affecting NActive
|



No description provided.