Skip to content

buffer task-started events that arrive before task-sent#31

Merged
aviator-app[bot] merged 1 commit into
mainfrom
orphan-started-buffer
Apr 29, 2026
Merged

buffer task-started events that arrive before task-sent#31
aviator-app[bot] merged 1 commit into
mainfrom
orphan-started-buffer

Conversation

@tulioz

@tulioz tulioz commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

celery's task-sent and task-started ride independent broker connections with no ordering guarantee, so a fast pickup can flip their order. those tasks were leaking to the in-flight TTL and silently dropping queue_wait samples.

@aviator-app

aviator-app Bot commented Apr 29, 2026

Copy link
Copy Markdown

Current Aviator status

Aviator will automatically update this comment as the status of the PR changes.
Comment /aviator refresh to force Aviator to re-examine your PR (or learn about other /aviator commands).

This PR was merged using Aviator.


See the real-time status of this PR on the Aviator webapp.
Use the Aviator Chrome Extension to see the status of your PR within GitHub.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a buffer to handle out-of-order Celery events where a 'task-started' event arrives before its corresponding 'task-sent' event. By storing these 'orphan' started events in an OrderedDict, the system can correctly calculate queue wait times when the 'task-sent' event eventually arrives, preventing potential memory leaks in the in-flight cache. Feedback suggests increasing the TTL for these orphan events to 60 seconds to better accommodate clock skew between workers and the monitor. Additionally, it is recommended to implement thread synchronization (locks) when accessing shared state like the orphan buffer and in-flight cache to avoid race conditions between the event receiver and the pruning threads.


_WORKER_HEARTBEAT_TTL_SEC = 120
_PRUNE_INTERVAL_SEC = 30
_ORPHAN_STARTED_TTL_SEC = 10

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

A 10-second TTL might be too aggressive given that the pruning logic in _prune compares the worker-generated event timestamp with the monitor's local time. If a worker's clock is behind the monitor's clock by more than 10 seconds, its task-started events will be pruned immediately upon the next _prune cycle, defeating the purpose of the buffer. Since the cache size is already capped at 10,000 items (which has a negligible memory footprint), consider increasing this TTL to 60 or 120 seconds to better account for clock skew and broker latency.

Suggested change
_ORPHAN_STARTED_TTL_SEC = 10
_ORPHAN_STARTED_TTL_SEC = 60

Comment on lines +287 to +288
while len(self._orphan_started) > _ORPHAN_STARTED_CACHE_SIZE:
self._orphan_started.popitem(last=False)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While OrderedDict is generally thread-safe for single operations in CPython, this compound check-and-pop logic is not atomic. Since _record_task_started is called from the event receiver thread and _prune runs in a separate timer thread, there is a small race condition where the size could change between the len() check and popitem(). Given the existing patterns in this class, this might be acceptable, but for better robustness, consider wrapping accesses to shared state like _orphan_started and _in_flight with a threading.Lock.

@aviator-app aviator-app Bot merged commit 7c1f185 into main Apr 29, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants