Gr-SA is an index for DNA pattern matching on Wheeler Deterministic Finite Automata (WDFAs). Co-lexicographic ordering of the automaton enables binary-search-based pattern lookup, analogous to suffix array, but operating directly on graphs.
- C++20 compiler (GCC or Clang)
- CMake 3.16+
- x86-64 (AVX2) or ARM (NEON) recommended for best performance; scalar fallback is available
No external dependencies beyond the bundled include/elias_fano/ library.
mkdir -p build && cd build
cmake ..
make -j4The binary is placed at build/gr_sa. Release builds use -O3 -march=native -funroll-loops with LTO enabled.
# Debug build (no optimizations, LTO disabled)
cmake -DCMAKE_BUILD_TYPE=Debug ..
# Cross-compile or CI: disable -march=native
cmake -DENABLE_NATIVE_ARCH=OFF .../build/gr_sa -b input.wg index.bin # default layout: HLD
./build/gr_sa -b -l hld input.wg index.bin # HLD (label-length heaviness)
./build/gr_sa -b -l hldn input.wg index.bin # HLD (node-count heaviness)
./build/gr_sa -b -l bfs input.wg index.bin # BFS order
./build/gr_sa -b -l dfs input.wg index.bin # DFS pre-order# Single pattern — prints co-lex Wheeler ID range
./build/gr_sa -q index.bin ACGTACGT
# Interactive mode (reads patterns from stdin)
./build/gr_sa -q index.bin
# Batch: patterns from stdin, existence counts (0/1) to file
./build/gr_sa -t index.bin counts.txt
# Batch: patterns from stdin, Step-3 GFA states traversed per query to file
./build/gr_sa -d index.bin traversal.txtBatch modes read one DNA pattern per line from stdin:
cat queries.txt | ./build/gr_sa -t index.bin output.txt| Flag | Description |
|---|---|
hld (default) |
Heavy-light decomposition of the inf-predecessor forest; heaviness = subtree label length. Keeps the inf_preds backward walk cache-local. |
hldn |
Same as hld but heaviness = subtree node count. A/B-testable alternative. |
bfs |
BFS path-compression order. Fast to build; good forward-traversal locality. |
dfs |
DFS pre-order. Every node's forward-reachable subtree is a contiguous ID range. |
n m sigma p # states, edges, alphabet size, source count
<char> # sigma lines, one label per line (e.g. A C G T)
<in-degree>... # n space-separated in-degrees
<out-degree>... # n space-separated out-degrees
<edge-labels> # m edge labels in Wheeler order (no separators)
<path-string> # full concatenated path string
Sample graphs are in graphs/. DNA.wg and out.wg are small test graphs. chr-22-4.txt is a large genomic dataset (pair it with the query set q-1500-5.txt).
Queries follow a three-step pipeline defined in WDFAIndex::count / locate:
-
Prefix binary search —
_longest_prefix_matchbinary-searches for the longest prefix of the pattern present in the Wheeler graph. A pre-computed tabulation table (default: all 16-mers) seeds the co-lex interval; exact k-length matches skip the oracle entirely. -
Oracle extension —
compute_Treturns the co-lex interval[inf, sup]for the matched prefix.compute_candidatefinds the single candidate state forprefix + next_char.LabelsMap::has_labelconfirms the transition exists. -
Graph traversal —
GFA::match_patternwalks the path-compressed graph from the candidate state to verify the remaining suffix. The-dmode counts GFA nodes visited in this step.
| Method | Returns | Notes |
|---|---|---|
count(pattern) |
size_t (0 or 1) |
Existence check; Step 3 walks a single candidate only |
locate(pattern) |
optional<pair<GFA_position, GFA_position>> |
Co-lex range as GFA positions |
locate_range(pattern) |
pair<uint64_t, uint64_t> |
Co-lex range as raw Wheeler IDs; used by -q mode |
Note:
count()returns 0 or 1, not true multiplicity — Step 3 only confirms existence via a single candidate walk.
WDFA file → parse adjacency list → GFA (path-compressed)
→ GFAOracle (inf/sup predecessor arrays)
→ LabelsMap (Elias-Fano edge labels)
→ tabulated_DNA (k-mer table)
→ WDFAIndex (serialized binary)
| Component | Files | Role |
|---|---|---|
GFA |
src/GFA/GFA.cpp, include/GFA/GFA.hpp |
Path-compressed graph. Consecutive degree-1 nodes are merged. Labels stored in a pooled DnaPacked blob (CSR); hot fields in 16-byte NodeHot AoS records. |
GFAOracle |
src/InfSupOracles/GFAOracle.cpp |
Flat uint32 inf_preds/sup_preds arrays for co-lex predecessor navigation. |
WDFAIndex |
src/WDFAIndex.hpp |
Header-only template on OracleType; owns query algorithm and binary serialization. |
tabulated_DNA |
src/InfSupOracles/tabulation.cpp |
Pre-computed co-lex intervals for all k-length DNA patterns (default k=16). |
DnaPacked |
include/bit_packing/DnaPacked.hpp |
2-bit DNA encoding (A=00 C=01 G=10 T=11); SIMD comparison via AVX2/NEON. |
LabelsMap |
include/LabelsMap/LabelsMap.hpp |
Elias-Fano bitvector per alphabet symbol; O(1) has_label range queries. |
elias_fano |
include/elias_fano/ |
Succinct bitvector with rank/select (external library). |
Two distinct node ID spaces exist throughout the code:
StateID(uint64_t) — original co-lex Wheeler-order node ID; the canonical identity used by the algorithm.GFAId/gfa_index_t(uint32_t) — index into the path-compressed GFA's internal arrays. One GFA node may represent many original Wheeler states.
Use GFA::get_state_position(StateID) to convert a StateID to a GFA_position{GFAId, offset}.
Defined directly in source (not via CMake). Uncomment to enable:
| Macro | Location | Effect |
|---|---|---|
DEBUG |
src/main.cpp, src/WDFAIndex.hpp |
Verbose step-by-step output during build and query |
DEBUG_PRINT |
src/WDFAIndex.hpp |
Full memory breakdown from get_memory_usage() |
ENABLE_PERF_TIMING |
src/WDFAIndex.hpp |
Per-step timing stats printed at end of -t mode |
ENABLE_PREFIX_LOGGING |
src/WDFAIndex.hpp |
Writes longest-prefix length per query to <output>.prefix_len |
include/testing.hpp provides verify_navigation(edges, gfa, oracle) — a friend-access sanity check that walks every original edge and confirms GFA adjacency and inf/sup bracket consistency. Useful during development after building the index.
src/
main.cpp CLI entry point and batch I/O
WDFAIndex.hpp Header-only index (template on OracleType)
GFA/GFA.cpp
InfSupOracles/GFAOracle.cpp
InfSupOracles/tabulation.cpp
include/
GFA/GFA.hpp
InfSupOracles/GFAOracle.hpp
InfSupOracles/tabulation.hpp
LabelsMap/LabelsMap.hpp
bit_packing/DnaPacked.hpp
utils/utils.hpp
elias_fano/ Succinct bitvector library (external)
testing.hpp
graphs/ Sample .wg files and query sets