Skip to content

feat(eap): speed up cross-item queries with local trace_id joins [EAP-377]#8131

Open
phacops wants to merge 8 commits into
masterfrom
claude/cross-item-queries-improve-t9qbi0
Open

feat(eap): speed up cross-item queries with local trace_id joins [EAP-377]#8131
phacops wants to merge 8 commits into
masterfrom
claude/cross-item-queries-improve-t9qbi0

Conversation

@phacops

@phacops phacops commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Implements EAP-377 — "Speed up cross item queries using local joins".

Background

Cross-item queries filter the outer query with trace_id IN (<subquery>). Today the trace_id on both sides is rewritten by UUIDColumnProcessor into replaceAll(toString(trace_id), '-', ''), which defeats the trace_id bloom-filter index, and the condition is moved to PREWHERE. Because eap_items is sharded by trace_id, the join can instead run locally on each shard and use the index — but that requires the comparison to be on the bare UUID column and distributed_product_mode='local'.

What this does

Behind a new runtime config use_local_join_for_cross_item_queries (default 0/off), cross-item queries:

  • Compare the bare trace_id UUID on both sides of the IN so the bloom-filter index applies. The trace-ids subquery selects/group-bys the raw trace_id, and the outer condition is emitted as raw SQL (trace_id IN (<subquery>)) instead of in(replaceAll(toString(trace_id),…), …).
  • Keep the condition in WHERE instead of PREWHERE. The raw-SQL condition contains no trace_id Column node, so PrewhereProcessor leaves it in WHERE. PREWHERE is incompatible with distributed_product_mode='local' due to a ClickHouse bug (per the ticket).
  • Set distributed_product_mode='local' so the join is pushed down to the local storage nodes (the storage's ClickhouseSettingsOverride uses overwrite_existing: false, so the per-query setting wins over the default global).

When the flag is off (the default), behavior is unchanged: the legacy dash-stripped in(trace_id, (subquery)) + distributed join.

Note on the table-aliasing work in the ticket

The ticket lists generic query-pipeline support for table aliasing as task (1). This PR avoids that by keeping the uncorrelated IN-subquery form: the inner and outer trace_id references live in separate query scopes, so qualified aliases aren't required for distributed_product_mode='local' to be unambiguous. Gating behind a config lets the EAP team validate on real data and roll back instantly.

Changes

  • cross_item_queries.py: use_local_join_for_cross_item_queries() flag, trace_id_in_subquery_condition() (raw vs legacy), CROSS_ITEM_DISTRIBUTED_PRODUCT_MODE, and a shared apply_cross_item_outer_query_settings() helper used by the resolvers; the trace-ids subquery selects the bare trace_id when enabled.
  • Wired through the TraceItemTable, TimeSeries, and GetTraces resolvers.

Tests

  • tests/web/rpc/v1/test_cross_item_local_join.py: unit tests asserting both flag states for the condition shape and the ClickHouse settings.
  • Cross-item functional suites for all three endpoints run the default (legacy) path; flag-on parity subclasses re-run them with the optimization enabled. A dedicated end-to-end test asserts the outer query carries distributed_product_mode='local' — under SNUBA_SETTINGS=test_distributed it exercises the trace_id IN (subquery) + local-join path against the Distributed table engine (single shard, cluster_one_sh).

Rollout

Default off. Enable per-environment with the use_local_join_for_cross_item_queries runtime config.

Legal Boilerplate

Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. and is gonna need some rights from me in order to utilize my contributions in this here PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.

🤖 Generated with Claude Code

https://claude.ai/code/session_011q2WUYZRg6VSfcmndVCWHU

…-377]

Cross-item queries filter the outer query with `trace_id IN (<subquery>)`.
Today the trace_id on both sides is rewritten to
`replaceAll(toString(trace_id), '-', '')` by the UUIDColumnProcessor, which
defeats the trace_id bloom-filter index, and the condition is moved to
PREWHERE. Because eap_items is sharded by trace_id, the join can instead run
locally on each shard and use the index.

Behind a new runtime config `use_local_join_for_cross_item_queries`
(default off), cross-item queries now:

- compare the bare `trace_id` UUID column on both sides of the IN (the
  subquery selects/group-bys the raw trace_id; the outer condition is emitted
  as raw SQL), so the bloom-filter index applies;
- keep the condition in WHERE rather than PREWHERE (the raw-SQL condition has
  no trace_id Column node for PrewhereProcessor to move) — PREWHERE is
  incompatible with distributed_product_mode='local' due to a ClickHouse bug;
- set distributed_product_mode='local' so the join is pushed down to the
  local storage nodes.

This avoids the generic query-pipeline table-aliasing work by using the
uncorrelated IN-subquery form, which does not require qualified aliases.

Applied consistently across the TraceItemTable, TimeSeries and GetTraces
endpoints. Adds unit tests for the new helpers and re-runs the existing
cross-item functional tests with the flag enabled to verify result parity.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
Claude-Session: https://claude.ai/code/session_011q2WUYZRg6VSfcmndVCWHU
@phacops phacops requested review from a team as code owners June 28, 2026 19:01
@linear-code

linear-code Bot commented Jun 28, 2026

Copy link
Copy Markdown

EAP-377

… flag

Per review: the `trace_id IN (subquery)` condition and the
distributed_product_mode='local' setting are now applied unconditionally for
cross-item queries instead of being gated behind the
`use_local_join_for_cross_item_queries` runtime config. The raw-SQL condition
and raw trace_id subquery select are correct in all cases, so the flag (and
the dash-stripped / PREWHERE legacy path) is removed.

- Remove `use_local_join_for_cross_item_queries()` and
  `local_join_clickhouse_settings()`; expose a `LOCAL_JOIN_CLICKHOUSE_SETTINGS`
  constant instead.
- `trace_id_in_subquery_condition()` always emits the raw `trace_id IN (...)`.
- The trace-ids subquery always selects/group-bys the bare trace_id.
- Drop the now-redundant flag-toggling test subclasses; the base cross-item
  functional suites exercise the only path. Simplify the unit test.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
Claude-Session: https://claude.ai/code/session_011q2WUYZRg6VSfcmndVCWHU

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 1b80122. Configure here.

Comment thread snuba/web/rpc/v1/resolvers/common/cross_item_queries.py Outdated
claude and others added 6 commits June 28, 2026 20:38
The test_distributed (0, 2) shard failed at redis_db fixture setup with
`redis.exceptions.ReadOnlyError: You can't write against a read only replica`
across tests/test_api.py — a transient Redis Cluster replica/failover blip
unrelated to this change. Re-running.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
Claude-Session: https://claude.ai/code/session_011q2WUYZRg6VSfcmndVCWHU
Address review feedback:
- Replace the `LOCAL_JOIN_CLICKHOUSE_SETTINGS` dict (only ever held one entry)
  with a `LOCAL_JOIN_DISTRIBUTED_PRODUCT_MODE` string constant.
- Extract the identical outer-query settings block (skip outer-query sampling
  for cross-item queries + set distributed_product_mode='local') that was
  duplicated in the TraceItemTable and TimeSeries resolvers into a shared
  `apply_cross_item_outer_query_settings()` helper. GetTraces sets the same
  setting via its existing `_build_snuba_request(clickhouse_settings=...)` param
  (its outer query isn't sampled, so the sampling branch doesn't apply there).

Co-Authored-By: Claude Opus 4.8 <[email protected]>
Claude-Session: https://claude.ai/code/session_011q2WUYZRg6VSfcmndVCWHU
- Rename LOCAL_JOIN_DISTRIBUTED_PRODUCT_MODE -> CROSS_ITEM_DISTRIBUTED_PRODUCT_MODE
  so the constant name describes the setting rather than baking in its "local"
  value.
- Add test_cross_item_query_local_join_end_to_end: runs a full cross-item query
  and asserts both correct results and that the outer query carries
  distributed_product_mode='local'. Under SNUBA_SETTINGS=test_distributed this
  exercises the trace_id IN (subquery) + local-join path against the Distributed
  table engine (single shard, cluster_one_sh), where the setting actually applies.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
Claude-Session: https://claude.ai/code/session_011q2WUYZRg6VSfcmndVCWHU
Re-introduce `use_local_join_for_cross_item_queries` (default off) gating the
EAP-377 optimization, so it can be rolled out and rolled back without a deploy.
When disabled, the legacy behavior is used: dash-stripped
`in(trace_id, (subquery))`, a dash-stripped subquery select, and no
`distributed_product_mode` override.

Gated in three places, all in cross_item_queries.py:
- trace_id_in_subquery_condition(): raw `trace_id IN (...)` vs legacy in_cond.
- the trace-ids subquery selects the bare trace_id vs the dash-stripped column.
- apply_cross_item_outer_query_settings(): only pushes
  distributed_product_mode='local' when enabled (GetTraces gates its own
  clickhouse_settings the same way).

Tests: unit tests assert both flag states; the cross-item functional suites run
the default path, and flag-on parity subclasses re-run them with the
optimization enabled across all three endpoints.

Co-Authored-By: Claude Opus 4.8 <[email protected]>
Claude-Session: https://claude.ai/code/session_011q2WUYZRg6VSfcmndVCWHU
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants