tinyparquet

Version: v0.1.0-beta

A zero-dependency, header-only C++17 Parquet reader. Designed to memory-map and decode highly-compressed Parquet files directly into memory without requiring the Apache Thrift compiler, Apache Arrow, Boost, zlib, Snappy, or any external compression libraries.

The Parquet Challenge

Parquet is an incredibly complex columnar data format built for big data ecosystems. It requires navigating three primary layers of complexity:

Metadata Serialization: Parquet schemas and structural metadata are written at the end of the file using Apache Thrift's TCompactProtocol.
Data Page Compression: Column chunks are broken down into pages that are usually compressed using algorithms like Snappy, GZIP, LZ4, or Zstandard.
Values Encoding: The actual scalar values are tightly packed using Dictionary Encoding, Run-Length Encoding (RLE), Bit-Packing, or Delta Encoding.

tinyparquet implements a native decoder for all three of these layers without a single external dependency.

Supported Formats

Compression Codecs

Parquet files support many compression algorithms. tinyparquet implements several of these natively in C++ header-only form:

UNCOMPRESSED
SNAPPY (Native zero-dependency decompressor included)
LZ4_RAW (Native zero-dependency decompressor included)
GZIP (via bundled third_party/miniz)
ZSTD (via bundled third_party/zstd)
BROTLI (via bundled third_party/brotli)

Data Encodings

PLAIN (Raw values)
PLAIN_DICTIONARY / RLE_DICTIONARY
RLE (Definition & Repetition Levels)
DELTA_BINARY_PACKED (Integers)
DELTA_BYTE_ARRAY (Strings)

Architecture

Header-Only: Drop tinyparquet.hpp into your project.
Zero-Dependency: No external libraries required. The custom decompress.h implements Snappy and LZ4 from scratch.
Zero-Copy Architecture: Uses POSIX mmap to read binary data at RAM speed without buffering entire files into memory.
Custom Thrift Decoder: Implements a lightweight TCompactProtocol decoder to parse Parquet FileMetaData without compiling Thrift structs.

Usage

Include the single header file in your C++ project:

#include "tinyparquet.hpp"
#include <iostream>
#include <vector>

int main() {
    try {
        // Initialize the reader
        tinyparquet::Reader reader("testing/alltypes_plain.snappy.parquet");
        auto metadata = reader.GetMetaData();
        
        std::cout << "Rows: " << metadata.num_rows << "\n";
        
        // Extract values from an integer column
        auto int_reader = reader.GetColumnReader("int_col");
        std::vector<int32_t> int_values;
        int_reader.ReadAllInt32(int_values);
        
        for (auto v : int_values) {
            std::cout << v << " ";
        }
        std::cout << "\n";
    } catch (const std::exception& e) {
        std::cerr << "Error: " << e.what() << "\n";
        return 1;
    }
    return 0;
}

Compilation

Since tinyparquet is a header-only library, no separate library compilation is required. Simply compile your code with C++17 support:

g++ -std=c++17 main.cpp -o app

If you wish to enable GZIP, ZSTD, and BROTLI decompression, define the compiler flags and compile the provided third_party sources:

gcc -c third_party/miniz/miniz.c -o miniz.o
gcc -c third_party/zstd/zstd.c -o zstd.o
gcc -c third_party/brotli/c/common/*.c third_party/brotli/c/dec/*.c
g++ -std=c++17 -Ithird_party/miniz -Ithird_party/zstd -Ithird_party/brotli/c/include -DTINYPARQUET_ENABLE_GZIP -DTINYPARQUET_ENABLE_ZSTD -DTINYPARQUET_ENABLE_BROTLI main.cpp miniz.o zstd.o *.o -o app

Development & Testing

Test files are located in the testing/ directory. These are scraped from the official apache/parquet-testing repository to ensure compliance with the spec.

To regenerate tinyparquet.hpp from the src/ directory after making changes:

python3 amalgamate.py

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
src		src
testing		testing
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
amalgamate.py		amalgamate.py
main.cpp		main.cpp
test_all.cpp		test_all.cpp
tinyparquet.hpp		tinyparquet.hpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tinyparquet

The Parquet Challenge

Supported Formats

Compression Codecs

Data Encodings

Architecture

Usage

Compilation

Development & Testing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tinyparquet

The Parquet Challenge

Supported Formats

Compression Codecs

Data Encodings

Architecture

Usage

Compilation

Development & Testing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages