Skip to content

Missing retry backoff in Fetcher for NotLeaderForPartitionError causes tight fetch loop during broker restarts #1155

@vanchaxy

Description

@vanchaxy

When Fetcher._proc_fetch_request receives NotLeaderForPartitionError or UnknownTopicOrPartitionError in the per-partition response loop, it calls self._client.force_metadata_update() but does not apply any backoff/sleep before the next fetch cycle.

Normally Kafka broker waits up to fetch.max.wait.ms (default 500ms) before responding to a fetch request. However, when a partition has no leader (e.g. during broker restart or partition reassignment), the broker returns NotLeaderForPartitionError immediately without entering the wait phase. The response time drops to just the network round-trip (~10-20ms).

Without a backoff on the client side, the fetch loop spins as fast as the network allows, effectively DDoS-ing the Kafka cluster with fetch requests. This causes memory usage to spike in the consumer process.

Note that the general KafkaError exception handler at the top of the method (added in #534) does include await asyncio.sleep(self._retry_backoff), but the per-partition error handling branch for NotLeaderForPartitionError inside the response processing loop does not.

Expected behaviour

After receiving NotLeaderForPartitionError or UnknownTopicOrPartitionError, the fetcher should apply self._retry_backoff sleep (same as it does for general KafkaError), giving the cluster time to elect new partition lleaders without being overwhelmed by requests.

  elif error_type in (
      Errors.NotLeaderForPartitionError,
      Errors.UnknownTopicOrPartitionError,
  ):
      self._client.force_metadata_update()
      await asyncio.sleep(self._retry_backoff)  # missing backoff

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions