Skip to content

docs: Dekaf consumer behaviors + Spark Structured Streaming guidance#3094

Open
jwhartley wants to merge 1 commit into
masterfrom
docs/dekaf-consumer-spark
Open

docs: Dekaf consumer behaviors + Spark Structured Streaming guidance#3094
jwhartley wants to merge 1 commit into
masterfrom
docs/dekaf-consumer-spark

Conversation

@jwhartley

Copy link
Copy Markdown
Contributor

What

Extends using-dekaf.md with consumer guidance that has come up repeatedly in support, in two layers:

  1. Consumer behaviors to know (any client): offsets are journal byte positions; the advertised latest offset can transiently move backward during a broker hand-off and is not data loss (with how to confirm via flowctl collections read); Avro logicalType decoding (e.g. uuid -> UUID object); parallelism via journal splits.
  2. Reading from Apache Spark Structured Streaming: avoid maxOffsetsPerTrigger (byte-budget cap drops partial records), handle failOnDataLoss (it aborts on the transient backward-offset case), and set spark.sql.avro.datetimeRebaseModeInRead explicitly (PERMISSIVE silently nulls pre-Gregorian dates, SPARK-31404). Plus an example reader config.

Why

These are recurring, non-obvious Dekaf consumer issues. The byte-offset model, the transient latest-offset regression, and the Avro decoding traps each surfaced as "missing data" reports that turned out to be consumer-side or transient. The transient-latest behavior is tracked in #3092.

Part 1 lives with the general consumer guidance so non-Spark consumers (Flink, librdkafka, kcat) benefit too; Part 2 is the Spark-specific config that builds on it.

Notes

  • Single file changed, no code. Generic content, no customer specifics.

Add a 'Consumer behaviors to know' section to using-dekaf.md (offsets are
journal byte positions; the advertised latest offset can transiently move
backward on a broker hand-off and is not data loss; Avro logicalType
decoding; parallelism via journal splits) and a 'Reading from Apache Spark
Structured Streaming' section (avoid maxOffsetsPerTrigger, handle
failOnDataLoss, set the Avro datetime rebase mode explicitly).
@github-actions

Copy link
Copy Markdown

🚀 Preview deployed to https://docs.estuary.dev/pr-preview/pr-3094/

📄 Changed pages:

@jwhartley jwhartley requested review from aeluce and jshearer June 30, 2026 02:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant