Version: v0.1.0-beta
A zero-dependency, header-only C++17 Parquet reader. Designed to memory-map and decode highly-compressed Parquet files directly into memory without requiring the Apache Thrift compiler, Apache Arrow, Boost, zlib, Snappy, or any external compression libraries.
Parquet is an incredibly complex columnar data format built for big data ecosystems. It requires navigating three primary layers of complexity:
- Metadata Serialization: Parquet schemas and structural metadata are written at the end of the file using Apache Thrift's
TCompactProtocol. - Data Page Compression: Column chunks are broken down into pages that are usually compressed using algorithms like Snappy, GZIP, LZ4, or Zstandard.
- Values Encoding: The actual scalar values are tightly packed using Dictionary Encoding, Run-Length Encoding (RLE), Bit-Packing, or Delta Encoding.
tinyparquet implements a native decoder for all three of these layers without a single external dependency.
Parquet files support many compression algorithms. tinyparquet implements several of these natively in C++ header-only form:
- UNCOMPRESSED
- SNAPPY (Native zero-dependency decompressor included)
- LZ4_RAW (Native zero-dependency decompressor included)
- GZIP (via bundled
third_party/miniz) - ZSTD (via bundled
third_party/zstd) - BROTLI (via bundled
third_party/brotli)
- PLAIN (Raw values)
- PLAIN_DICTIONARY / RLE_DICTIONARY
- RLE (Definition & Repetition Levels)
- DELTA_BINARY_PACKED (Integers)
- DELTA_BYTE_ARRAY (Strings)
- Header-Only: Drop
tinyparquet.hppinto your project. - Zero-Dependency: No external libraries required. The custom
decompress.himplements Snappy and LZ4 from scratch. - Zero-Copy Architecture: Uses POSIX
mmapto read binary data at RAM speed without buffering entire files into memory. - Custom Thrift Decoder: Implements a lightweight
TCompactProtocoldecoder to parse Parquet FileMetaData without compiling Thrift structs.
Include the single header file in your C++ project:
#include "tinyparquet.hpp"
#include <iostream>
#include <vector>
int main() {
try {
// Initialize the reader
tinyparquet::Reader reader("testing/alltypes_plain.snappy.parquet");
auto metadata = reader.GetMetaData();
std::cout << "Rows: " << metadata.num_rows << "\n";
// Extract values from an integer column
auto int_reader = reader.GetColumnReader("int_col");
std::vector<int32_t> int_values;
int_reader.ReadAllInt32(int_values);
for (auto v : int_values) {
std::cout << v << " ";
}
std::cout << "\n";
} catch (const std::exception& e) {
std::cerr << "Error: " << e.what() << "\n";
return 1;
}
return 0;
}Since tinyparquet is a header-only library, no separate library compilation is required. Simply compile your code with C++17 support:
g++ -std=c++17 main.cpp -o appIf you wish to enable GZIP, ZSTD, and BROTLI decompression, define the compiler flags and compile the provided third_party sources:
gcc -c third_party/miniz/miniz.c -o miniz.o
gcc -c third_party/zstd/zstd.c -o zstd.o
gcc -c third_party/brotli/c/common/*.c third_party/brotli/c/dec/*.c
g++ -std=c++17 -Ithird_party/miniz -Ithird_party/zstd -Ithird_party/brotli/c/include -DTINYPARQUET_ENABLE_GZIP -DTINYPARQUET_ENABLE_ZSTD -DTINYPARQUET_ENABLE_BROTLI main.cpp miniz.o zstd.o *.o -o appTest files are located in the testing/ directory. These are scraped from the official apache/parquet-testing repository to ensure compliance with the spec.
To regenerate tinyparquet.hpp from the src/ directory after making changes:
python3 amalgamate.py