Skip to content

fix(mcp): keep strong reference to episode queue worker task to prevent silent GC#1575

Open
rneissl wants to merge 1 commit into
getzep:mainfrom
rneissl:fix/mcp-queue-worker-gc-anchor
Open

fix(mcp): keep strong reference to episode queue worker task to prevent silent GC#1575
rneissl wants to merge 1 commit into
getzep:mainfrom
rneissl:fix/mcp-queue-worker-gc-anchor

Conversation

@rneissl

@rneissl rneissl commented Jun 11, 2026

Copy link
Copy Markdown

PR: Fix queue worker garbage-collection under streamable-http transport

What

Stores a strong reference to the asyncio.Task created by QueueService.add_episode_task so the event loop can't garbage-collect the worker mid-execution.

Why

Per Python asyncio docs:

"Save a reference to the result of this function, to avoid a task disappearing mid-execution. The event loop only keeps weak references to tasks."

Without anchoring, under high-GC-pressure conditions (streamable-http request handling) the worker task can be collected before its first await self._episode_queues[group_id].get() suspends it. Result: add_memory queues episodes but no worker ever processes them — silent failure, no error logs.

Reported in #1574. Fixes #1574.

How

Three small changes in mcp_server/src/services/queue_service.py:

  1. __init__: add self._worker_tasks: dict[str, asyncio.Task] = {} — strong-ref storage
  2. add_episode_task: store the task: self._worker_tasks[group_id] = asyncio.create_task(...)
  3. _process_episode_queue finally-block: self._worker_tasks.pop(group_id, None) — clean up reference when worker exits

Verification

Tested on zepai/knowledge-graph-mcp:1.0.2-standalone with FalkorDB backend, streamable-http transport, in a Kubernetes Deployment.

Before patch: add_memory returns "queued" but no log activity, no LLM calls, no new graph nodes, no error.

After patch:

12:45:09 INFO services.queue_service - Starting episode queue worker for group_id: kiroot_v2
12:45:09 INFO services.queue_service - Processing episode None for group kiroot_v2
12:45:13 INFO httpx - HTTP Request: POST .../v1/responses "HTTP/1.1 200 OK"
12:45:14 INFO httpx - HTTP Request: POST .../v1/embeddings "HTTP/1.1 200 OK" (×6)
[entity-extraction, dedup, embedding, edge-resolution loop continues for ~51s]
12:46:01 INFO graphiti_core.graphiti - Completed add_episode in 51522.60 ms
12:46:01 INFO services.queue_service - Successfully processed episode None for group kiroot_v2

Graph: Episodic node count went from 569 → 570, new entities and edges extracted from the test episode.

Backwards compatibility

None broken. Public API of QueueService unchanged. Only adds an internal _worker_tasks dict and stores/cleans up references inside existing methods. Behavior under stdio transport is identical to before (the GC race didn't manifest there because handler completion didn't trigger the same GC pattern).

Tests

The MCP server doesn't ship asyncio-stress tests for the queue worker today. Adding a regression test would require simulating GC pressure, which is non-trivial. Manual reproduction steps are in the linked issue. Happy to add a test if maintainers point at a similar testing pattern in the repo.


Patch developed and verified by Roland Neissl (BAB IKT, Austria) while migrating the graphiti-mcp service from the ToolHive operator to a plain GitOps deployment. Originally verified on zepai/knowledge-graph-mcp:1.0.2-standalone (2026-05-10), re-based and re-verified against main on 2026-06-11.

asyncio.create_task() results must be referenced or the event
loop's weak reference allows the GC to collect the worker
mid-execution. Under streamable-http transport this manifests
as add_memory queueing episodes that are never processed, with
no error logged. Store the task in _worker_tasks and drop the
reference when the worker exits.
@zep-cla-assistant

Copy link
Copy Markdown
Contributor


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. For privacy information, see our Privacy Notice. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA behalf on myself, e-mail: [email protected]

or

I have read the CLA Document and I hereby sign the CLA behalf of my company, e-mail: [email protected]

Signature is valid for 6 months.


This bot will be retriggered when the Contributor License Agreement comment has been provided. Posted by the CLA Assistant Lite bot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MCP server: queue worker silently garbage-collected under streamable-http — add_memory queues but never processes

1 participant