Skip to content

Stream large files with sketches instead of full reads #1

Description

@delcenjo

For very large datasets, profiling currently reads the whole file through polars. Backing the numeric profile with a t-digest and categoricals with a count-min/HLL sketch would let dsdiff build a profile in a single streaming pass and diff files that do not fit in memory, without changing the PSI semantics.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions