feat(eap): speed up cross-item queries with local trace_id joins [EAP-377]#8131
Open
phacops wants to merge 8 commits into
Open
feat(eap): speed up cross-item queries with local trace_id joins [EAP-377]#8131phacops wants to merge 8 commits into
phacops wants to merge 8 commits into
Conversation
…-377] Cross-item queries filter the outer query with `trace_id IN (<subquery>)`. Today the trace_id on both sides is rewritten to `replaceAll(toString(trace_id), '-', '')` by the UUIDColumnProcessor, which defeats the trace_id bloom-filter index, and the condition is moved to PREWHERE. Because eap_items is sharded by trace_id, the join can instead run locally on each shard and use the index. Behind a new runtime config `use_local_join_for_cross_item_queries` (default off), cross-item queries now: - compare the bare `trace_id` UUID column on both sides of the IN (the subquery selects/group-bys the raw trace_id; the outer condition is emitted as raw SQL), so the bloom-filter index applies; - keep the condition in WHERE rather than PREWHERE (the raw-SQL condition has no trace_id Column node for PrewhereProcessor to move) — PREWHERE is incompatible with distributed_product_mode='local' due to a ClickHouse bug; - set distributed_product_mode='local' so the join is pushed down to the local storage nodes. This avoids the generic query-pipeline table-aliasing work by using the uncorrelated IN-subquery form, which does not require qualified aliases. Applied consistently across the TraceItemTable, TimeSeries and GetTraces endpoints. Adds unit tests for the new helpers and re-runs the existing cross-item functional tests with the flag enabled to verify result parity. Co-Authored-By: Claude Opus 4.8 <[email protected]> Claude-Session: https://claude.ai/code/session_011q2WUYZRg6VSfcmndVCWHU
… flag Per review: the `trace_id IN (subquery)` condition and the distributed_product_mode='local' setting are now applied unconditionally for cross-item queries instead of being gated behind the `use_local_join_for_cross_item_queries` runtime config. The raw-SQL condition and raw trace_id subquery select are correct in all cases, so the flag (and the dash-stripped / PREWHERE legacy path) is removed. - Remove `use_local_join_for_cross_item_queries()` and `local_join_clickhouse_settings()`; expose a `LOCAL_JOIN_CLICKHOUSE_SETTINGS` constant instead. - `trace_id_in_subquery_condition()` always emits the raw `trace_id IN (...)`. - The trace-ids subquery always selects/group-bys the bare trace_id. - Drop the now-redundant flag-toggling test subclasses; the base cross-item functional suites exercise the only path. Simplify the unit test. Co-Authored-By: Claude Opus 4.8 <[email protected]> Claude-Session: https://claude.ai/code/session_011q2WUYZRg6VSfcmndVCWHU
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 1b80122. Configure here.
The test_distributed (0, 2) shard failed at redis_db fixture setup with `redis.exceptions.ReadOnlyError: You can't write against a read only replica` across tests/test_api.py — a transient Redis Cluster replica/failover blip unrelated to this change. Re-running. Co-Authored-By: Claude Opus 4.8 <[email protected]> Claude-Session: https://claude.ai/code/session_011q2WUYZRg6VSfcmndVCWHU
Address review feedback: - Replace the `LOCAL_JOIN_CLICKHOUSE_SETTINGS` dict (only ever held one entry) with a `LOCAL_JOIN_DISTRIBUTED_PRODUCT_MODE` string constant. - Extract the identical outer-query settings block (skip outer-query sampling for cross-item queries + set distributed_product_mode='local') that was duplicated in the TraceItemTable and TimeSeries resolvers into a shared `apply_cross_item_outer_query_settings()` helper. GetTraces sets the same setting via its existing `_build_snuba_request(clickhouse_settings=...)` param (its outer query isn't sampled, so the sampling branch doesn't apply there). Co-Authored-By: Claude Opus 4.8 <[email protected]> Claude-Session: https://claude.ai/code/session_011q2WUYZRg6VSfcmndVCWHU
- Rename LOCAL_JOIN_DISTRIBUTED_PRODUCT_MODE -> CROSS_ITEM_DISTRIBUTED_PRODUCT_MODE so the constant name describes the setting rather than baking in its "local" value. - Add test_cross_item_query_local_join_end_to_end: runs a full cross-item query and asserts both correct results and that the outer query carries distributed_product_mode='local'. Under SNUBA_SETTINGS=test_distributed this exercises the trace_id IN (subquery) + local-join path against the Distributed table engine (single shard, cluster_one_sh), where the setting actually applies. Co-Authored-By: Claude Opus 4.8 <[email protected]> Claude-Session: https://claude.ai/code/session_011q2WUYZRg6VSfcmndVCWHU
Co-Authored-By: Claude Opus 4.8 <[email protected]> Claude-Session: https://claude.ai/code/session_011q2WUYZRg6VSfcmndVCWHU
Re-introduce `use_local_join_for_cross_item_queries` (default off) gating the EAP-377 optimization, so it can be rolled out and rolled back without a deploy. When disabled, the legacy behavior is used: dash-stripped `in(trace_id, (subquery))`, a dash-stripped subquery select, and no `distributed_product_mode` override. Gated in three places, all in cross_item_queries.py: - trace_id_in_subquery_condition(): raw `trace_id IN (...)` vs legacy in_cond. - the trace-ids subquery selects the bare trace_id vs the dash-stripped column. - apply_cross_item_outer_query_settings(): only pushes distributed_product_mode='local' when enabled (GetTraces gates its own clickhouse_settings the same way). Tests: unit tests assert both flag states; the cross-item functional suites run the default path, and flag-on parity subclasses re-run them with the optimization enabled across all three endpoints. Co-Authored-By: Claude Opus 4.8 <[email protected]> Claude-Session: https://claude.ai/code/session_011q2WUYZRg6VSfcmndVCWHU
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Implements EAP-377 — "Speed up cross item queries using local joins".
Background
Cross-item queries filter the outer query with
trace_id IN (<subquery>). Today thetrace_idon both sides is rewritten byUUIDColumnProcessorintoreplaceAll(toString(trace_id), '-', ''), which defeats thetrace_idbloom-filter index, and the condition is moved toPREWHERE. Becauseeap_itemsis sharded bytrace_id, the join can instead run locally on each shard and use the index — but that requires the comparison to be on the bare UUID column anddistributed_product_mode='local'.What this does
Behind a new runtime config
use_local_join_for_cross_item_queries(default0/off), cross-item queries:trace_idUUID on both sides of theINso the bloom-filter index applies. The trace-ids subquery selects/group-bys the rawtrace_id, and the outer condition is emitted as raw SQL (trace_id IN (<subquery>)) instead ofin(replaceAll(toString(trace_id),…), …).WHEREinstead ofPREWHERE. The raw-SQL condition contains notrace_idColumnnode, soPrewhereProcessorleaves it inWHERE.PREWHEREis incompatible withdistributed_product_mode='local'due to a ClickHouse bug (per the ticket).distributed_product_mode='local'so the join is pushed down to the local storage nodes (the storage'sClickhouseSettingsOverrideusesoverwrite_existing: false, so the per-query setting wins over the defaultglobal).When the flag is off (the default), behavior is unchanged: the legacy dash-stripped
in(trace_id, (subquery))+ distributed join.Note on the table-aliasing work in the ticket
The ticket lists generic query-pipeline support for table aliasing as task (1). This PR avoids that by keeping the uncorrelated
IN-subquery form: the inner and outertrace_idreferences live in separate query scopes, so qualified aliases aren't required fordistributed_product_mode='local'to be unambiguous. Gating behind a config lets the EAP team validate on real data and roll back instantly.Changes
cross_item_queries.py:use_local_join_for_cross_item_queries()flag,trace_id_in_subquery_condition()(raw vs legacy),CROSS_ITEM_DISTRIBUTED_PRODUCT_MODE, and a sharedapply_cross_item_outer_query_settings()helper used by the resolvers; the trace-ids subquery selects the baretrace_idwhen enabled.Tests
tests/web/rpc/v1/test_cross_item_local_join.py: unit tests asserting both flag states for the condition shape and the ClickHouse settings.distributed_product_mode='local'— underSNUBA_SETTINGS=test_distributedit exercises thetrace_id IN (subquery)+ local-join path against the Distributed table engine (single shard,cluster_one_sh).Rollout
Default off. Enable per-environment with the
use_local_join_for_cross_item_queriesruntime config.Legal Boilerplate
Look, I get it. The entity doing business as "Sentry" was incorporated in the State of Delaware in 2015 as Functional Software, Inc. and is gonna need some rights from me in order to utilize my contributions in this here PR. So here's the deal: I retain all rights, title and interest in and to my contributions, and by keeping this boilerplate intact I confirm that Sentry can use, modify, copy, and redistribute my contributions, under Sentry's choice of terms.
🤖 Generated with Claude Code
https://claude.ai/code/session_011q2WUYZRg6VSfcmndVCWHU