Lazy Column Reading with Early Predicate Evaluation

**Part of**: #29 (SCAN_REL_TABLE Performance Optimization)

## Priority: 🎯 HIGH IMPACT

## Problem Statement

Currently, `SCAN_REL_TABLE` reads ALL property columns for ALL relationships in a CSR list, even when most will be filtered out. For workloads with:
- Low cardinality: < 100 rels/node
- Very selective filters: < 1% pass rate

This results in reading ~100 relationships but only 1 passes the filter = **99% wasted I/O**.

## Solution

Implement lazy column reading with early predicate evaluation:
1. Read ONLY predicate columns first
2. Evaluate predicates immediately
3. Read remaining columns ONLY for rows that pass predicates

**Expected Impact**: 90-95% reduction in I/O for selective queries

---

## Implementation Details

### 1.1 Modify CSRNodeGroup Scan Logic

**File**: `src/storage/table/csr_node_group.cpp`

**Current Flow** (lines 138-170):
```cpp
// Current: Read ALL columns for ALL relationships
1. Read CSR header (offset + length)
2. Read ALL property columns for entire CSR list
3. Return to ScanRelTable
4. Filter happens in separate Filter operator
```

**New Flow**:
```cpp
1. Read CSR header (offset + length)
2. Identify predicate columns from columnPredicateSets
3. FOR each relationship in CSR list:
   a. Read ONLY predicate columns
   b. Evaluate predicates immediately
   c. If predicates PASS:
      - Read remaining non-predicate columns
      - Add to selection vector
   d. If predicates FAIL:
      - Skip reading remaining columns
      - Mark as filtered in selection vector
4. Return only rows in selection vector
```

### 1.2 Add Method to Separate Predicate vs Non-Predicate Columns

```cpp
// New method in CSRNodeGroup
struct ColumnReadPlan {
    std::vector<column_id_t> predicateColumns;     // Read first
    std::vector<column_id_t> nonPredicateColumns;  // Read only if predicate passes
    std::vector<size_t> predicateIndices;          // Index mapping
};

ColumnReadPlan planColumnReads(
    const std::vector<column_id_t>& allColumns,
    const std::vector<ColumnPredicateSet>& predicateSets) {
    // Logic to identify which columns have predicates
}
```

### 1.3 Modify Scan Loop to Read Columns in Phases

```cpp
// In CSRNodeGroup::scan() around line 150
void CSRNodeGroup::scan(TableScanState& scanState) {
    auto plan = planColumnReads(scanState.columnIDs, scanState.columnPredicateSets);
    
    for (size_t relIdx = 0; relIdx < csrLength; relIdx++) {
        // Phase 1: Read predicate columns
        for (auto colID : plan.predicateColumns) {
            readColumn(colID, relIdx, tempVectors[colID]);
        }
        
        // Phase 2: Evaluate predicates
        bool passesFilter = evaluatePredicates(
            plan.predicateColumns, 
            scanState.columnPredicateSets,
            tempVectors);
        
        if (!passesFilter) {
            continue; // Skip this relationship
        }
        
        // Phase 3: Read remaining columns (only for passing rows)
        for (auto colID : plan.nonPredicateColumns) {
            readColumn(colID, relIdx, outputVectors[colID]);
        }
        
        selVector.append(relIdx);
    }
}
```

### 1.4 Add Inline Predicate Evaluation

```cpp
// New method in CSRNodeGroup or TableScanState
bool evaluatePredicates(
    const std::vector<column_id_t>& predicateColumns,
    const std::vector<ColumnPredicateSet>& predicateSets,
    const std::vector<ValueVector*>& vectors) {
    
    for (size_t i = 0; i < predicateColumns.size(); i++) {
        auto& predicateSet = predicateSets[predicateColumns[i]];
        if (predicateSet.empty()) continue;
        
        for (auto& predicate : predicateSet.predicates) {
            if (!predicate->evaluate(vectors[i], currentRow)) {
                return false; // Early exit on first failed predicate
            }
        }
    }
    return true;
}
```

### 1.5 Update ColumnPredicate Interface

**File**: `src/include/storage/predicate/column_predicate.h`

**Add evaluate method**:
```cpp
class ColumnPredicate {
public:
    // Existing: checkZoneMap for chunk-level filtering
    
    // NEW: Row-level evaluation during scan
    virtual bool evaluate(ValueVector* vector, sel_t pos) const = 0;
};
```

**Implement in ConstantPredicate**:

**File**: `src/storage/predicate/constant_predicate.cpp`
```cpp
bool ColumnConstantPredicate::evaluate(ValueVector* vector, sel_t pos) const {
    auto value = vector->getValue(pos);
    switch (comparisonOperator) {
        case EQUALS: return value == val;
        case GREATER_THAN: return value > val;
        case LESS_THAN: return value < val;
        // ... etc
    }
}
```

---

## Testing Strategy

### Unit Tests

**File**: `test/storage/table/test_csr_scan_lazy_columns.cpp` (NEW)

```cpp
TEST(CSRScanTest, LazyColumnReading) {
    // Setup: Create rel table with 3 properties: type, weight, timestamp
    // Insert 100 relationships, only 1 has type='SPECIAL'
    
    // Test 1: Scan with predicate on 'type' column
    auto predicate = makeEqualityPredicate("type", "SPECIAL");
    auto result = scanWithPredicate({predicate});
    
    // Verify:
    EXPECT_EQ(result.size(), 1);  // Only 1 row returned
    
    // Check internal metrics (NEW metrics to add):
    EXPECT_EQ(metrics.columnsRead["type"], 100);       // Read all for filtering
    EXPECT_EQ(metrics.columnsRead["weight"], 1);       // Only read for 1 passing row
    EXPECT_EQ(metrics.columnsRead["timestamp"], 1);    // Only read for 1 passing row
}

TEST(CSRScanTest, MultiplePredicates) {
    // Test with predicates on multiple columns
    // Verify short-circuit: if first predicate fails, don't evaluate second
}

TEST(CSRScanTest, NoPredicates) {
    // Verify backward compatibility: with no predicates, read all columns normally
    auto result = scanWithoutPredicate();
    EXPECT_EQ(result.size(), 100);
}
```

### Integration Tests

**File**: `test/processor/test_scan_rel_table_predicates.cpp` (NEW)

```cpp
TEST(ScanRelTableTest, EndToEndPredicatePushdown) {
    // Setup graph with nodes and relationships
    // Execute query: MATCH (a)-[r:KNOWS WHERE r.since > 2020]->(b)
    
    auto result = executeQuery(query);
    
    // Verify correct results
    EXPECT_EQ(result.size(), expectedCount);
    
    // Verify performance: check that we didn't read all columns
    auto metrics = getQueryMetrics();
    EXPECT_LT(metrics.bytesRead, fullScanBytes * 0.2);  // < 20% of full scan
}
```

### Correctness Tests

```cpp
TEST(CorrectnessTest, VerifyResultsMatchFullScan) {
    // Run same query with and without optimization
    auto optimizedResult = scanWithLazyColumns();
    auto fullScanResult = scanWithoutOptimization();
    
    // Results must be identical
    EXPECT_EQ(optimizedResult, fullScanResult);
}
```

---

## Benchmarking Strategy

### Micro Benchmarks

**File**: `benchmark/storage/scan_rel_table_benchmark.cpp` (NEW)

```cpp
class ScanRelTableBenchmark : public benchmark::Fixture {
    void SetUp() {
        // Create graph with varying selectivity
        createGraph(numNodes, relsPerNode, selectivity);
    }
};

BENCHMARK_F(ScanRelTableBenchmark, VaryingSelectivity)(benchmark::State& state) {
    double selectivity = state.range(0) / 100.0;  // 0.1%, 1%, 10%, 50%, 100%
    
    for (auto _ : state) {
        auto result = scanWithPredicate(selectivity);
    }
    
    state.SetLabel(fmt::format("selectivity={}%", selectivity));
    state.counters["rows_scanned"] = numRelationships;
    state.counters["rows_returned"] = result.size();
    state.counters["bytes_read"] = getBytesRead();
}

BENCHMARK_REGISTER_F(ScanRelTableBenchmark, VaryingSelectivity)
    ->Args({0.1})   // 0.1% selectivity
    ->Args({1})     // 1% selectivity  
    ->Args({10})    // 10% selectivity
    ->Args({50})    // 50% selectivity
    ->Args({100});  // 100% selectivity (no filtering)
```

### End-to-End Query Benchmarks

**File**: `benchmark/query/graph_pattern_benchmark.cpp`

```cypher
// Benchmark queries:
Q1: MATCH (a:Person)-[r:KNOWS WHERE r.since > 2020]->(b) RETURN count(*)
Q2: MATCH (a)-[r:TRANSACTION WHERE r.amount > 10000]->(b) RETURN a, b, r
Q3: MATCH (a)-[r WHERE r.weight < 0.1]->(b)-[r2]->(c) RETURN *
```

**Metrics to Collect**:
```cpp
struct BenchmarkMetrics {
    double executionTimeMs;
    uint64_t bytesReadFromStorage;
    uint64_t rowsScanned;
    uint64_t rowsFiltered;
    uint64_t columnsRead;
    double filteringRatio;  // rows_filtered / rows_scanned
    double ioReduction;     // compared to baseline
};
```

### Baseline Comparison

```bash
# Establish baseline BEFORE optimization
./benchmark --benchmark_filter="ScanRelTable.*" --benchmark_out=baseline.json

# After implementing Phase 1
./benchmark --benchmark_filter="ScanRelTable.*" --benchmark_out=phase1.json

# Compare
python scripts/compare_benchmarks.py baseline.json phase1.json
```

**Expected Results**:
- **0.1% selectivity**: 90-95% reduction in bytes read
- **1% selectivity**: 85-90% reduction in bytes read
- **10% selectivity**: 50-70% reduction in bytes read
- **50% selectivity**: 20-30% reduction in bytes read
- **100% selectivity**: ~0% change (no filter overhead)

---

## Files to Modify

- [ ] `src/storage/table/csr_node_group.cpp` - Main scan logic implementation
- [ ] `src/include/storage/table/csr_node_group.h` - Add ColumnReadPlan struct
- [ ] `src/include/storage/predicate/column_predicate.h` - Add evaluate() method
- [ ] `src/storage/predicate/constant_predicate.cpp` - Implement evaluate()
- [ ] `src/storage/predicate/null_predicate.cpp` - Implement evaluate()
- [ ] `test/storage/table/test_csr_scan_lazy_columns.cpp` - Unit tests (NEW)
- [ ] `test/processor/test_scan_rel_table_predicates.cpp` - Integration tests (NEW)
- [ ] `benchmark/storage/scan_rel_table_benchmark.cpp` - Benchmarks (NEW)

---

## Success Criteria

- [ ] All unit tests pass
- [ ] 80%+ I/O reduction for <1% selectivity workloads
- [ ] No correctness regressions (differential testing passes)
- [ ] <5% overhead for 100% selectivity (no filter penalty)
- [ ] Backward compatible: works with existing code when no predicates present

---

## Risk Mitigation

### Feature Flag
Add configuration option to enable/disable:
```cpp
if (context->getClientConfig()->enableLazyColumnReading) {
    // Use optimized path
} else {
    // Use original path
}
```

### Performance Risk
If selectivity is high (>50%), optimization may have overhead. Consider dynamic switching based on selectivity estimates.

---

## Related Issues

- Parent: #29
- Works with: Phase 2 (#TBD) - predicate pushdown
- Enables: Phase 3 (#TBD) - short-circuit traversal
- Synergizes with: Phase 4 (#TBD) - zone maps

---

## Estimated Effort

**Complexity**: Medium  
**Estimated Time**: 1-2 weeks  
**Dependencies**: None (can start immediately)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazy Column Reading with Early Predicate Evaluation #30

Priority: 🎯 HIGH IMPACT

Problem Statement

Solution

Implementation Details

1.1 Modify CSRNodeGroup Scan Logic

1.2 Add Method to Separate Predicate vs Non-Predicate Columns

1.3 Modify Scan Loop to Read Columns in Phases

1.4 Add Inline Predicate Evaluation

1.5 Update ColumnPredicate Interface

Testing Strategy

Unit Tests

Integration Tests

Correctness Tests

Benchmarking Strategy

Micro Benchmarks

End-to-End Query Benchmarks

Baseline Comparison

Files to Modify

Success Criteria

Risk Mitigation

Feature Flag

Performance Risk

Related Issues

Estimated Effort

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Lazy Column Reading with Early Predicate Evaluation #30

Description

Priority: 🎯 HIGH IMPACT

Problem Statement

Solution

Implementation Details

1.1 Modify CSRNodeGroup Scan Logic

1.2 Add Method to Separate Predicate vs Non-Predicate Columns

1.3 Modify Scan Loop to Read Columns in Phases

1.4 Add Inline Predicate Evaluation

1.5 Update ColumnPredicate Interface

Testing Strategy

Unit Tests

Integration Tests

Correctness Tests

Benchmarking Strategy

Micro Benchmarks

End-to-End Query Benchmarks

Baseline Comparison

Files to Modify

Success Criteria

Risk Mitigation

Feature Flag

Performance Risk

Related Issues

Estimated Effort

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions