Skip to content

GH-49946: [Format] Better document equivalence between IPC file and streams#49947

Draft
pitrou wants to merge 1 commit intoapache:mainfrom
pitrou:ipc-file-stream-equivalence
Draft

GH-49946: [Format] Better document equivalence between IPC file and streams#49947
pitrou wants to merge 1 commit intoapache:mainfrom
pitrou:ipc-file-stream-equivalence

Conversation

@pitrou
Copy link
Copy Markdown
Member

@pitrou pitrou commented May 7, 2026

Rationale for this change

As discussed in https://lists.apache.org/thread/jpxl3yzm96wkxzb1clokxklsy32b3plh, we want to better document the rough equivalence between IPC files and streams.

What changes are included in this PR?

  • Recommendation around emitting "nice" IPC file footers that don't reorder, omit or repeat batches
  • Better outlining of deviations of the IPC file format vs. the IPC streaming format
  • Assorted minor wording and presentation improvements in the columnar / IPC doc

Are these changes tested?

N/A.

Are there any user-facing changes?

No.

@pitrou pitrou force-pushed the ipc-file-stream-equivalence branch from 3da85f2 to 6ea2485 Compare May 7, 2026 14:27
@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented May 7, 2026

@github-actions crossbow submit preview-docs

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Revision: 6ea2485

Submitted crossbow builds: ursacomputing/crossbow @ actions-0208c5a8f4

Task Status
preview-docs GitHub Actions

@pitrou pitrou force-pushed the ipc-file-stream-equivalence branch from 6ea2485 to 28765e5 Compare May 7, 2026 15:17
@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented May 7, 2026

@github-actions crossbow submit preview-docs

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Revision: 28765e5

Submitted crossbow builds: ursacomputing/crossbow @ actions-6cfd80b0e4

Task Status
preview-docs GitHub Actions

@pitrou pitrou force-pushed the ipc-file-stream-equivalence branch from 28765e5 to 51fb5a1 Compare May 7, 2026 16:01
@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented May 7, 2026

@github-actions crossbow submit preview-docs

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 7, 2026

Revision: 51fb5a1

Submitted crossbow builds: ursacomputing/crossbow @ actions-3cd6e7c16a

Task Status
preview-docs GitHub Actions

@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented May 7, 2026

Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for writing this up!

Comment on lines +1350 to +1351
The ``Buffer`` Flatbuffers value describes the location and size of a buffer's
data, relatively to the start of the RecordBatch message's body.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the offsets can be global in the dissociated IPC protocol (metadata and bodies sent on separate streams), although I forget if that was ever actually implemented anywhere.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the question is then whether this section of the doc describes the IPC protocol in a narrow sense, or whether it applies to any variation over it such as the Dissociated IPC protocol.

For readability I think it's better if it describes the concrete IPC protocol as commonly implemented. What do you think?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind either way, I was just reminded of it when Ben was working on the nanoarrow IPC writer. Will RST let you make a footnote?

Comment on lines +1541 to +1542
compliant writers SHOULD arrange the IPC File footer so that an IPC File can be
read using an IPC Stream reader with equivalent results.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be nice at some point to indicate in the "features" section of a flatbuffers Schema that the stream can definitely be read as an IPC stream (i.e., doesn't differ between what one would get from reading using the blocks in the footer). The fact that nanoarrow does this blindly is not great and I'll fix it, but it is a cool feature that you can do full scans without random access in most cases.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you call the features section? Is it

arrow/format/Schema.fbs

Lines 71 to 81 in a0d2885

enum Feature : long {
/// Needed to make flatbuffers happy.
UNUSED = 0,
/// The stream makes use of multiple full dictionaries with the
/// same ID and assumes clients implement dictionary replacement
/// correctly.
DICTIONARY_REPLACEMENT = 1,
/// The stream makes use of compressed bodies as described
/// in Message.fbs.
COMPRESSED_BODY = 2
}
? I don't think anyone is reading this table currently.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I was thinking of (but no need to deal with this now 🙂 )

@github-actions github-actions Bot added awaiting merge Awaiting merge and removed awaiting review Awaiting review labels May 7, 2026
@pitrou
Copy link
Copy Markdown
Member Author

pitrou commented May 11, 2026

@github-actions github-actions Bot added awaiting changes Awaiting changes and removed awaiting merge Awaiting merge labels May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants