Skip to content

feat(consumer): set client.rack placement from ZONE env var#546

Open
phacops wants to merge 2 commits into
mainfrom
claude/consumer-placement-zone-config-rcm3b6
Open

feat(consumer): set client.rack placement from ZONE env var#546
phacops wants to merge 2 commits into
mainfrom
claude/consumer-placement-zone-config-rcm3b6

Conversation

@phacops

@phacops phacops commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Allows the consumer placement (client.rack) to be set in the Kafka consumer configuration based on a ZONE environment variable, so a rack-aware fetch strategy (fetch-from-follower) can be enabled later on. Implemented in both the Python and Rust runtimes.

When the consumer configuration is built and the ZONE environment variable is set, its value is propagated to librdkafka's client.rack config. This lets the consumer advertise its availability zone to the broker.

An explicit client.rack provided in the default config or via override params always takes precedence over the env var.

This is opt-in and off by default (no client.rack unless ZONE is set), and rack-aware fetch only takes effect once the broker is configured with replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector. Per the discussion below, this is intended as a foundational / emergency-only capability and should not be enabled fleet-wide until workloads are spread evenly across zones.

Related

  • getsentry/ops#21545 — spreads workloads evenly across zones (prerequisite before broadly enabling rack-aware placement).

Changes

  • arroyo/backends/kafka/configuration.py: read the ZONE env var (ZONE_ENV_VAR) in build_kafka_consumer_configuration and set client.rack when present and not already configured.
  • rust-arroyo/src/backends/kafka/config.rs: read the ZONE env var in KafkaConfig::new_consumer_config and set client.rack before applying override params (so overrides win).
  • Unit tests in both runtimes covering the env-var case, the absent case, and explicit-override precedence.
  • CHANGELOG.md: note the new feature.

Test plan

  • Python: pytest tests/backends/test_kafka.py -k "client_rack or zone" — all 3 tests pass.
  • Rust: cargo test --lib backends::kafka::config — passes; cargo fmt --check and cargo check clean.

🤖 Generated with Claude Code

https://claude.ai/code/session_01YS4onNgraFjT9gffNP6Jzo

Propagate the ZONE environment variable to librdkafka's client.rack
config when building the consumer configuration, so consumers advertise
their availability zone to the broker. This is a prerequisite for
enabling a rack-aware fetch strategy (fetch-from-follower) later on.

An explicit client.rack in the provided config still takes precedence.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
Claude-Session: https://claude.ai/code/session_01YS4onNgraFjT9gffNP6Jzo
@phacops phacops requested review from a team as code owners June 28, 2026 18:50
…untime

Mirror the Python behavior in the rust-arroyo consumer config: propagate
the ZONE environment variable to librdkafka's client.rack when building the
consumer configuration. An explicit client.rack override still takes
precedence.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
Claude-Session: https://claude.ai/code/session_01YS4onNgraFjT9gffNP6Jzo
@untitaker

Copy link
Copy Markdown
Member

arroyo already allows you to pass in arbitrary rdkafka consumer options, so you could patch this into snuba directly.

is the idea that we can override this setting for arbitrary arroyo consumers in our infra in emergency situations?

if so, i wonder if we should instead support arbitrary overrides like this:

export ARROYO_RDKAFKA_CONFIG={"client.rack": ...}

while easy to do and very powerful, i think this overlaps a bit with other topicctl stuff we want to do. would have to sync with @enochtangg but i think we already may have a plan for setting arbitrary consumer options?

phacops commented Jun 28, 2026

Copy link
Copy Markdown
Contributor Author

I think that’s something you’d want to do at the platform level, for every consumer, that’s why I made a PR here. As a user of arroyo/streaming platform, I don’t want to know about placement and where my consumer consumes from.

Happy to contain it to Snuba too for now and wait for whatever plans you have for this.

@fpacifici

fpacifici commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

As we discussed in Slack last week, please avoid making the consumers zone aware unless we can spread consumers out across zones evenly.
Screenshot 2026-06-28 at 1 23 57 PM

Right now there is an important imbalance between the nodes in us-central1-a and the other two nodes, which will produce more load on the kafka brokers in that zone. Our Kafka infrastructure is sized with the assumption that load is spread more or less evenly.

If you need to do an experiment to troubleshoot the incident, this can be alright on the spans cluster with the streaming oncall aware of the experiment. There are so many distinct workloads on the cluster that an imbalance on one consumer will probably be acceptable. On transactions things are different, the utilization of the system is considerably higher.

Though we cannot make this a platform feature in the general case with today's infra which is basically guaranteed to be unevenly distributed between zones. I would consider this an emergency only feature to be used only for experiments and incidents.

phacops commented Jun 28, 2026

Copy link
Copy Markdown
Contributor Author

Fair enough, we don't have to merge this as a platform feature. Though, I will say, you still have to add it before we make use of it, regardless if the workload is balanced or not. It's not because we set it that it'll be in use right away.

On transactions things are different, the utilization of the system is considerably higher.

A bit confused by this. Are you suggesting the transactions consumer has a higher utilization of the system overall compared to eap-items?

@fpacifici

Copy link
Copy Markdown
Contributor

A bit confused by this. Are you suggesting the transactions consumer has a higher utilization of the system overall compared to eap-items?

It is a smaller cluster. It has higher utilization per node (we are about to scale it up) and fewer different workloads.
This means an imbalanced workload will have a larger impact on transactions than on spans.
So caution is needed before running an experiment there.

phacops commented Jun 28, 2026

Copy link
Copy Markdown
Contributor Author

Ah, I understand.

phacops commented Jun 28, 2026

Copy link
Copy Markdown
Contributor Author

By the way, we still need to run the broker with this selector so it's not like this would be enabled by default. It's just laying the foundation for this.

replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector

@fpacifici

fpacifici commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Ah, I understand.

It should be possible to do this test safely on that cluster as well. We just need to ensure the oncall is aware and rollback is ready.

phacops commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

There you go for properly spreading workloads across zones: getsentry/ops#21545

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants