Skip to content

regindex/Gr-SA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gr-SA

Gr-SA is an index for DNA pattern matching on Wheeler Deterministic Finite Automata (WDFAs). Co-lexicographic ordering of the automaton enables binary-search-based pattern lookup, analogous to suffix array, but operating directly on graphs.

Requirements

  • C++20 compiler (GCC or Clang)
  • CMake 3.16+
  • x86-64 (AVX2) or ARM (NEON) recommended for best performance; scalar fallback is available

No external dependencies beyond the bundled include/elias_fano/ library.

Build

mkdir -p build && cd build
cmake ..
make -j4

The binary is placed at build/gr_sa. Release builds use -O3 -march=native -funroll-loops with LTO enabled.

# Debug build (no optimizations, LTO disabled)
cmake -DCMAKE_BUILD_TYPE=Debug ..

# Cross-compile or CI: disable -march=native
cmake -DENABLE_NATIVE_ARCH=OFF ..

Usage

Build an index

./build/gr_sa -b input.wg index.bin            # default layout: HLD
./build/gr_sa -b -l hld  input.wg index.bin    # HLD (label-length heaviness)
./build/gr_sa -b -l hldn input.wg index.bin    # HLD (node-count heaviness)
./build/gr_sa -b -l bfs  input.wg index.bin    # BFS order
./build/gr_sa -b -l dfs  input.wg index.bin    # DFS pre-order

Query

# Single pattern — prints co-lex Wheeler ID range
./build/gr_sa -q index.bin ACGTACGT

# Interactive mode (reads patterns from stdin)
./build/gr_sa -q index.bin

# Batch: patterns from stdin, existence counts (0/1) to file
./build/gr_sa -t index.bin counts.txt

# Batch: patterns from stdin, Step-3 GFA states traversed per query to file
./build/gr_sa -d index.bin traversal.txt

Batch modes read one DNA pattern per line from stdin:

cat queries.txt | ./build/gr_sa -t index.bin output.txt

Layout options

Flag Description
hld (default) Heavy-light decomposition of the inf-predecessor forest; heaviness = subtree label length. Keeps the inf_preds backward walk cache-local.
hldn Same as hld but heaviness = subtree node count. A/B-testable alternative.
bfs BFS path-compression order. Fast to build; good forward-traversal locality.
dfs DFS pre-order. Every node's forward-reachable subtree is a contiguous ID range.

Input format (.wg)

n m sigma p            # states, edges, alphabet size, source count
<char>                 # sigma lines, one label per line  (e.g. A C G T)
<in-degree>...         # n space-separated in-degrees
<out-degree>...        # n space-separated out-degrees
<edge-labels>          # m edge labels in Wheeler order (no separators)
<path-string>          # full concatenated path string

Sample graphs are in graphs/. DNA.wg and out.wg are small test graphs. chr-22-4.txt is a large genomic dataset (pair it with the query set q-1500-5.txt).

Query algorithm

Queries follow a three-step pipeline defined in WDFAIndex::count / locate:

  1. Prefix binary search_longest_prefix_match binary-searches for the longest prefix of the pattern present in the Wheeler graph. A pre-computed tabulation table (default: all 16-mers) seeds the co-lex interval; exact k-length matches skip the oracle entirely.

  2. Oracle extensioncompute_T returns the co-lex interval [inf, sup] for the matched prefix. compute_candidate finds the single candidate state for prefix + next_char. LabelsMap::has_label confirms the transition exists.

  3. Graph traversalGFA::match_pattern walks the path-compressed graph from the candidate state to verify the remaining suffix. The -d mode counts GFA nodes visited in this step.

Query API

Method Returns Notes
count(pattern) size_t (0 or 1) Existence check; Step 3 walks a single candidate only
locate(pattern) optional<pair<GFA_position, GFA_position>> Co-lex range as GFA positions
locate_range(pattern) pair<uint64_t, uint64_t> Co-lex range as raw Wheeler IDs; used by -q mode

Note: count() returns 0 or 1, not true multiplicity — Step 3 only confirms existence via a single candidate walk.

Architecture

WDFA file → parse adjacency list → GFA (path-compressed)
          → GFAOracle (inf/sup predecessor arrays)
          → LabelsMap (Elias-Fano edge labels)
          → tabulated_DNA (k-mer table)
          → WDFAIndex (serialized binary)

Key components

Component Files Role
GFA src/GFA/GFA.cpp, include/GFA/GFA.hpp Path-compressed graph. Consecutive degree-1 nodes are merged. Labels stored in a pooled DnaPacked blob (CSR); hot fields in 16-byte NodeHot AoS records.
GFAOracle src/InfSupOracles/GFAOracle.cpp Flat uint32 inf_preds/sup_preds arrays for co-lex predecessor navigation.
WDFAIndex src/WDFAIndex.hpp Header-only template on OracleType; owns query algorithm and binary serialization.
tabulated_DNA src/InfSupOracles/tabulation.cpp Pre-computed co-lex intervals for all k-length DNA patterns (default k=16).
DnaPacked include/bit_packing/DnaPacked.hpp 2-bit DNA encoding (A=00 C=01 G=10 T=11); SIMD comparison via AVX2/NEON.
LabelsMap include/LabelsMap/LabelsMap.hpp Elias-Fano bitvector per alphabet symbol; O(1) has_label range queries.
elias_fano include/elias_fano/ Succinct bitvector with rank/select (external library).

Type system

Two distinct node ID spaces exist throughout the code:

  • StateID (uint64_t) — original co-lex Wheeler-order node ID; the canonical identity used by the algorithm.
  • GFAId / gfa_index_t (uint32_t) — index into the path-compressed GFA's internal arrays. One GFA node may represent many original Wheeler states.

Use GFA::get_state_position(StateID) to convert a StateID to a GFA_position{GFAId, offset}.

Debug macros

Defined directly in source (not via CMake). Uncomment to enable:

Macro Location Effect
DEBUG src/main.cpp, src/WDFAIndex.hpp Verbose step-by-step output during build and query
DEBUG_PRINT src/WDFAIndex.hpp Full memory breakdown from get_memory_usage()
ENABLE_PERF_TIMING src/WDFAIndex.hpp Per-step timing stats printed at end of -t mode
ENABLE_PREFIX_LOGGING src/WDFAIndex.hpp Writes longest-prefix length per query to <output>.prefix_len

Structural verification

include/testing.hpp provides verify_navigation(edges, gfa, oracle) — a friend-access sanity check that walks every original edge and confirms GFA adjacency and inf/sup bracket consistency. Useful during development after building the index.

Project layout

src/
  main.cpp                       CLI entry point and batch I/O
  WDFAIndex.hpp                  Header-only index (template on OracleType)
  GFA/GFA.cpp
  InfSupOracles/GFAOracle.cpp
  InfSupOracles/tabulation.cpp
include/
  GFA/GFA.hpp
  InfSupOracles/GFAOracle.hpp
  InfSupOracles/tabulation.hpp
  LabelsMap/LabelsMap.hpp
  bit_packing/DnaPacked.hpp
  utils/utils.hpp
  elias_fano/                    Succinct bitvector library (external)
  testing.hpp
graphs/                          Sample .wg files and query sets

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors