feat(data-pipeline): OTLP gRPC trace export#2171
Draft
bm1549 wants to merge 17 commits into
Draft
Conversation
Adds tonic dependency with transport and codegen features required for implementing gRPC-based OTLP trace export. Also adds h2 and tokio-stream as dev dependencies for gRPC test infrastructure. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…Config Re-add OtlpProtocol::Grpc (previously removed) with support for gRPC as a valid export protocol alongside HTTP/JSON and HTTP/Protobuf. Add OtlpGrpcTraceConfig struct to hold gRPC-specific configuration (endpoint, headers, timeout, semantics flag). The Grpc variant now parses successfully from "grpc" string and is available for use by the gRPC exporter implementation (Task 3+). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…_grpc Add grpc_exporter.rs with OtlpGrpcTransport, build_grpc_channel, and send_otlp_traces_grpc. Implements a minimal ProstCodecImpl since tonic 0.14 removed ProstCodec from its public API. All items are pub(crate); the dispatch wiring comes in the next task. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Downstream tasks depend on OtlpGrpcTraceConfig, build_grpc_channel, send_otlp_traces_grpc, and OtlpGrpcTransport. Add public and crate-private re-exports to libdd-data-pipeline's otlp module. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Introduces `OtlpExportMode` to unify HTTP and gRPC OTLP transport selection. Renames `TraceExporter.otlp_config` to `otlp: Option<OtlpExportMode>`, updates the dispatch in `send_trace_chunks_inner` to match on the enum variant, and adds `send_otlp_grpc_inner` that calls `send_otlp_traces_grpc` with the pre-built `OtlpGrpcTransport`. Builder updated to wrap the HTTP config in `OtlpExportMode::Http`; gRPC wiring follows in Task 5. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…point configured Update TraceExporterBuilder to dispatch to the correct OtlpExportMode at build time based on otlp_protocol. When OtlpProtocol::Grpc is set and an endpoint URL is provided, call build_grpc_channel and wrap the result in OtlpExportMode::Grpc(OtlpGrpcTransport); otherwise fall back to OtlpExportMode::Http. When no otlp_endpoint is set, otlp is None and the exporter uses the Datadog agent path. Also update the set_otlp_protocol doc comment to reflect that grpc is now a supported protocol string. Three new builder tests cover: grpc+endpoint succeeds, grpc without endpoint falls back to agent, and https:// with grpc is rejected at build. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Update doc comment, inline comment, and tests for ddog_trace_exporter_config_set_otlp_protocol to reflect that "grpc" is now a supported value. Replaces set_otlp_protocol_rejects_grpc_and_unknown with set_otlp_protocol_accepts_all_three_protocols and set_otlp_protocol_rejects_unknown; updates config_otlp_protocol_test to assert "grpc" -> success with OtlpProtocol::Grpc stored. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…r Buffer poll contract Tower's Buffer service requires poll_ready/poll_reserve to be called before call/send_item. tonic 0.14's Grpc::unary does not call poll_ready internally, so callers must invoke client.ready().await first. Without this, the exporter panicked with "send_item called without first calling poll_reserve" on the first gRPC send. Also removes the now-stale #![allow(dead_code)] annotation since the gRPC transport is fully wired up. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…rocess h2 server
Adds two tests in test_trace_exporter_otlp_grpc.rs:
- grpc_export_sends_decodable_request: spins up an in-process h2 server,
configures a TraceExporter with OtlpProtocol::Grpc, sends a msgpack-encoded
trace via spawn_blocking, and asserts the server received a decodable
ExportTraceServiceRequest with the expected service.name attribute.
- grpc_protocol_string_parses: verifies OtlpProtocol::from_str("grpc") returns
OtlpProtocol::Grpc, covering the FromStr path without FFI ceremony.
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- Fix doc placeholder in OtlpExportMode: [] → [TraceExporter] - Remove duplicate grpc parse tests (keep grpc_parses_successfully) - Remove #[allow(dead_code)] from OtlpExportMode (now wired up) - Remove #[allow(unused_imports)] from otlp/mod.rs re-exports - Mark OtlpGrpcTraceConfig.endpoint_url with #[allow(dead_code)] (unused by gRPC path) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
🎉 All green!🧪 All tests passed 🎯 Code Coverage (details) 🔗 Commit SHA: 9dcddce | Docs | Datadog PR Page | Give us feedback! |
Contributor
📚 Documentation Check Results📦
|
Contributor
Clippy Allow Annotation ReportComparing clippy allow annotations between branches:
Summary by Rule
Annotation Counts by File
Annotation Stats by Crate
About This ReportThis report tracks Clippy allow annotations for specific rules, showing how they've changed in this PR. Decreasing the number of these annotations generally improves code quality. |
Contributor
🔒 Cargo Deny Results📦
|
…ri/coverage - Move tonic to a cfg(not(target_arch = "wasm32")) dependency and gate the grpc_exporter module, OtlpExportMode::Grpc, the builder construction arm, and send_otlp_grpc_inner behind the same cfg. tonic's transport (hyper/tokio/socket2) does not build for wasm32, which broke `cargo check -p libdd-data-pipeline --target wasm32-unknown-unknown --no-default-features`. - Add #[cfg_attr(miri, ignore)] to build_with_grpc_https_endpoint_rejected, matching its sibling builder tests: build() starts the shared runtime whose worker hits a syscall Miri cannot execute. - Raise the e2e gRPC test's per-request timeout (set_connection_timeout) so the heavily instrumented coverage build does not time out the in-process round trip. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
… + benchmark - Attach the `datadog-client-computed-stats` marker as request metadata inside send_otlp_traces_grpc (new client_computed_stats arg) instead of cloning the whole OtlpGrpcTransport (and its header Vec) on every export. - Add unit tests for grpc_status_to_error (transient -> Io, application -> Request) and attach_metadata (valid header + token, invalid header skipped, stats marker). - Add a criterion benchmark (benches/otlp_grpc_export.rs) for the gRPC wire-frame encoding and end-to-end native-spans -> framed-wire preparation across trace sizes. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
…review cleanups Addresses pre-push review findings across the OTLP gRPC change: - builder: on wasm32, reject `OtlpProtocol::Grpc` with a clear error instead of falling through to the HTTP arm — which stored `OtlpProtocol::Grpc` in `OtlpTraceConfig` and panicked at `encode()`'s `unreachable!()` on first send (a panic across the FFI boundary). The unreachable arms are now truly unreachable on every target. - grpc_status_to_error: map `DeadlineExceeded` to `TimedOut` (was `ConnectionRefused`); the unit test now asserts the exact error kind. - ProstDecoder: drain the buffer via `copy_to_bytes`, robust if tonic's `DecodeBuf` ever becomes non-contiguous (`chunk()` could truncate). - extract `build_otlp_resource_info`, shared by the HTTP and gRPC send paths; hoist the OTLP drop_chunks/empty-check guard above the dispatch match. - drop the unused `OtlpGrpcTraceConfig::endpoint_url` field; gate `prost` off wasm32 since only the (wasm-gated) gRPC exporter uses it. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
The `grpc_export_sends_decodable_request` test timed out (Io::TimedOut at the full per-request timeout) on the alpine/arm release-build jobs in GitLab CI, while passing locally and on centos-amd. Root cause: the `#[tokio::test]` current-thread runtime hosted both the in-process h2 server task and the `spawn_blocking` exporter coordination, so under parallel CI load the server task could be starved and the client's send would hang to the timeout. Run the h2 server on its own OS thread + current-thread runtime, and drive the exporter's own runtime via `send` directly on the test thread (no ambient async runtime, so no `spawn_blocking`). The two sides now have independent threads and runtimes and cannot starve each other. Verified 15/15 runs locally. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
The previous rewrite blocked the test on `server.join()`, whose `while h2_conn.accept()` loop only ends when the pooled client connection closes — which the exporter never does — so the alpine/centos release-build jobs hung to GitLab's 2h `job_execution_timeout`. It passed locally only because process exit killed the detached runtime before the join mattered. Detach the server thread and time-bound every server-side await: a 60s overall ceiling (so a never-connecting client can't wedge the thread) and a 10s response-flush drive. The test no longer joins the server — verification comes from `send` returning Ok (client received grpc-status:0) plus the decoded request on the channel. 20/20 local runs, ~100ms each. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
The in-process exporter<->h2-server round trip is too timing-sensitive for heavily contended CI runners: the client `send` has hit its request timeout on macos-15 (GitHub) and the alpine/arm release-build matrix (GitLab) across multiple runs, despite passing locally (20/20) and on less-loaded runners, and despite the live backend verification in PR #2171 succeeding. Mark it #[ignore] so it no longer runs in the default CI suites (it can't flake the PR or burn release-build runner time), while keeping it runnable on demand via `--run-ignored all`. The gRPC export path stays covered by this crate's unit tests (ProstCodec, grpc_status_to_error, attach_metadata, build_grpc_channel, builder dispatch, protocol parse) and by the live backend verification. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
Contributor
Artifact Size Benchmark Reportaarch64-alpine-linux-musl
aarch64-unknown-linux-gnu
libdatadog-x64-windows
libdatadog-x86-windows
x86_64-alpine-linux-musl
x86_64-unknown-linux-gnu
|
…; un-ignore Root cause of the CI flakiness (found, not masked): the in-process h2 test server accepted the stream once, then read the request body inline without continuing to poll the `server::Connection`. Polling the connection is what reads incoming frames off the socket and routes DATA to an open stream's body. When the client's request DATA frame was already buffered during `accept()` (the common coalesced case) the read returned immediately and the test passed; when the DATA frame landed *after* `accept()` returned (more likely under load, as HEADERS and DATA split across socket reads), nothing ever polled the connection to read it, so `body.data()` hung until the client's request timeout fired — surfacing as `Io(Kind(TimedOut))` / "Timeout expired" on the contended macos-15 and alpine/arm runners. Confirmed by instrumentation (server reached "stream accepted" then stalled 30s with no body) and by it being independent of the exporter's worker-thread count (1 vs 4 both failed), ruling out CPU starvation / a product bug. Fix: keep the `accept()` loop running (which drives the connection and flushes responses) and handle each Export stream in a spawned task. Reproduced the failure under 6x CPU oversubscription (24/160 failures) and verified the fix (160/160, then 120/120, all green) with the default 1-worker runtime. Reverts the earlier #[ignore] from 69aab62 — the test is correct and robust now, so it runs in CI again. Co-Authored-By: Claude Sonnet 4.6 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds OTLP gRPC as a third trace-export transport alongside HTTP/JSON and HTTP/protobuf, selectable via the OTel-standard
OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=grpc. A lazy tonicChannelis built from the configured endpoint at builder time;send_trace_chunks_innerdispatches through a newOtlpExportModeenum to either the existing HTTP path or the gRPC path, which sends anExportTraceServiceRequestovertonic::client::Grpc.Motivation
libdatadog's OTLP trace export was HTTP-only. Java and .NET already do gRPC OTLP by hand; PHP, Python, Ruby, and Rust all export through libdatadog FFI, so adding gRPC here unblocks all four at once, and lets SDKs honor
OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=grpcinstead of warning or coercing it to HTTP.Additional Notes
OtlpExportMode { Http(OtlpTraceConfig), Grpc(OtlpGrpcTransport) }replaces theotlp_configfield; the variant is chosen at build time and is mutually exclusive.default-features = false, features = ["transport","codegen"](theprostfeature doesn't exist in 0.14; codegen carries the prost integration). A smallProstCodecImplstands in for theProstCodecthat 0.14 dropped from the public API.Grpc::ready().awaitis driven beforeunary(): tonic'sChannelis a towerBufferthat panics ifcallprecedespoll_ready; both are inside onetokio::time::timeoutso connect and RPC share the deadline.https://endpoints are rejected at build time with a clear error (use a TLS-terminating sidecar). Retry on transient gRPC status is out of scope for v1 (TODO left in place).datadog-client-computed-statsheader are carried through as gRPC metadata.How to test the change?
cargo nextest run -p libdd-data-pipeline -p libdd-data-pipeline-ffi(204 tests),cargo test --doc, clippy, and fmt pass. An in-process h2 server test (test_trace_exporter_otlp_grpc.rs) drives a real gRPC send and asserts the server decodes anExportTraceServiceRequestwith the expectedservice.name; builder and FFI tests covergrpcacceptance andhttps://rejection.:4317) to the backend. Spans landed underservice:verify-grpc-1454: operationgrpc.verify.op, resources checkout/payment/inventory, taggedotlp.transport:grpc(1523 hits viatrace.grpc.verify.op.hits, 0 trace-writer errors). A known-good OpenTelemetry SDK gRPC client through the same Agent produced the same result.BREAKING CHANGE: adds the
OtlpProtocol::Grpcvariant to an exhaustive public enum, and embedding a tonicChannelinTraceExporterdrops itsUnwindSafe/RefUnwindSafeauto-trait impls. libdatadog consumers pin by version, so they pick this up on the next release.