Describe the bug
We've seen the client stuck on a pending transaction when a broker was removed from a cluster.
The client kept sending a AddPartitionsToTxnRequest to the wrong broker, failing because the broker was not responding.
I think the root cause is that _coordinator_dead is only called upon receiving a NOT_COORDINATOR error from the broker, but it never expires if the broker is no longer available.
The problem seems to also affect other requests that make use of coordinators.
Expected behaviour
The client should make sender caches expire whenever a MetadataResponse tells that a coordinator is no longer present.
It may also expire on a temporal basis in case of persistent errors with the coordinator.
Environment (please complete the following information):
Can't tell precise information, since we have seen this issue server side while not controlling the client.
Reproducible example
Not easy to reproduce. One should create and keep some transactions open while a broker is decommissioned.
Describe the bug
We've seen the client stuck on a pending transaction when a broker was removed from a cluster.
The client kept sending a
AddPartitionsToTxnRequestto the wrong broker, failing because the broker was not responding.I think the root cause is that _coordinator_dead is only called upon receiving a NOT_COORDINATOR error from the broker, but it never expires if the broker is no longer available.
The problem seems to also affect other requests that make use of coordinators.
Expected behaviour
The client should make sender caches expire whenever a
MetadataResponsetells that a coordinator is no longer present.It may also expire on a temporal basis in case of persistent errors with the coordinator.
Environment (please complete the following information):
Can't tell precise information, since we have seen this issue server side while not controlling the client.
Reproducible example
Not easy to reproduce. One should create and keep some transactions open while a broker is decommissioned.