Skip to content

Commit 798501e

Browse files
committed
Release v0.2.0: Zero-Copy Feature Store
1 parent a15bc06 commit 798501e

8 files changed

Lines changed: 464 additions & 43 deletions

File tree

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
.vscode/
22
venv/
33
build/
4+
dist/
45
benchmark/dataset
56

67
*.exe

CODE-DOCS.md

Lines changed: 198 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -73,18 +73,68 @@ Performs uniform neighbour sampling for a single node using **reservoir sampling
7373
* **Behavior:** Returns up to `K` neighbour IDs sampled uniformly at random. If the node degree <= `K`, all neighbours are returned.
7474
* **Return shape & dtype:** 1-D `ndarray` of length `<= K`, dtype `np.int64`.
7575

76+
## 🗄️ Feature Engine: `FeatureStore` & `DataType`
77+
78+
The main entry point for zero-copy node feature matrices. It maps massive datasets directly into Numpy/PyTorch without consuming RAM.
79+
80+
`
81+
fs = gz.FeatureStore("path/to/features.gd")
82+
`
83+
84+
### Properties
85+
86+
| Property | Type | Description |
87+
| --- | --- | --- |
88+
| `fs.num_nodes` | `int` | Total number of nodes (rows). |
89+
| `fs.feature_dim` | `int` | Number of features per node (columns). |
90+
91+
### Methods
92+
93+
#### `get_data(node_id: int) -> numpy.ndarray`
94+
95+
Returns the features for a single specific node ID.
96+
97+
* **Return shape & dtype:** 1-D `ndarray` of shape `(feature_dim,)`. The underlying dtype matches the `DataType` used during conversion.
98+
99+
#### `get_tensor() -> numpy.ndarray`
100+
101+
Returns a zero-copy view of the *entire* feature matrix.
102+
103+
* **Behavior:** Hands Python a direct pointer to the memory-mapped file. It consumes **0 Bytes of RAM** upon calling. Data is only paged into memory by the OS when PyTorch actively indexes a specific row during training.
104+
* **Return shape & dtype:** 2-D `ndarray` of shape `(num_nodes, feature_dim)`.
105+
76106
## 🛠️ Utilities
77107

78108
#### `gz.convert_csv_to_gl(input_csv: str, output_bin: str, directed: bool)`
79109

80110
Converts a raw Edge List CSV into the optimized GraphLite binary format (`.gl`).
81111

82-
* **Input CSV Format:** Two columns (Source, Destination). Headers are ignored if they exist.
112+
* **Input CSV Format:** Two/Three columns (Source, Destination, Weight(optional)). Headers are ignored if they exist.
83113
* **Process:** 1. **Pass 1:** Scans file to count degrees (Memory: Low).
84114
2. **Allocation:** Creates the `.gl` file and `mmaps` it.
85115
3. **Pass 2:** Reads CSV again and places edges into the correct memory buckets.
86116
* **Note:** This process handles graphs larger than RAM.
87117

118+
#### 'DataType' Enum
119+
120+
`DataType` is an enumeration that defines the supported data types for feature storage. It ensures that the binary files created by `convert_csv_to_gd` have a consistent and optimized memory layout.
121+
available data types include:
122+
- `gz.DataType.INT32`: 32-bit signed integer.
123+
- `gz.DataType.INT64`: 64-bit signed integer.
124+
- `gz.DataType.FLOAT32`: 32-bit floating-point number.
125+
- `gz.DataType.FLOAT64`: 64-bit floating-point number.
126+
127+
#### `gz.convert_csv_to_gd(csv_path: str, out_path: str, dtype: gz.DataType)`
128+
129+
Converts a raw feature CSV into the optimized GraphZero Data format (`.gd`).
130+
131+
* **Input CSV Format:** The first column must be the `NodeID`, followed by its features separated by commas (e.g., `0, 0.5, 0.1, 0.9...`).
132+
* **Arguments:**
133+
- `dtype` (`DataType`): Strictly enforces the memory layout of the resulting binary file (e.g., `gz.DataType.FLOAT32`).
134+
* **Process:** 1. **Pass 1 (Zero-Allocation):** Fast-scans the CSV using C++ `string_view` to find the maximum Node ID and feature dimension without triggering heap allocations.
135+
2. **Allocation:** `mmaps` a perfectly sized, C-contiguous binary file.
136+
3. **Pass 2:** Parses and writes the features. Automatically handles missing Node IDs by leaving their rows safely padded with zeroes.
137+
88138

89139
# 🧠 Example: Training Node2Vec with PyTorch
90140

@@ -222,4 +272,150 @@ print("✅ Training Complete.")
222272
This example showcases how `GraphZero` can be seamlessly integrated into a PyTorch training loop, allowing for efficient data loading and processing of massive graphs. The C++ engine handles the heavy lifting of random walk generation, freeing up Python to focus on model training.
223273
here is the screenshot of the output when running the script:
224274

225-
![Training Output](benchmark/images/examplecode.png)
275+
![Training Output](benchmark/images/examplecode.png)
276+
277+
# 🧠 Example 2: End-to-End GraphSAGE with Zero-Copy Features
278+
279+
This script is a complete, runnable example. It generates a synthetic graph dataset, compiles it into GraphZero's zero-copy formats (`.gl` and `.gd`), and trains a GraphSAGE model.
280+
281+
Notice how we use `gz.FLOAT32` for the node features and `gz.INT64` for the classification labels. Both are memory-mapped directly into PyTorch natively without consuming system RAM.
282+
283+
**File:** `train_graphsage.py`
284+
285+
```python
286+
import os
287+
import time
288+
import torch
289+
import torch.nn as nn
290+
import torch.optim as optim
291+
import graphzero as gz
292+
import numpy as np
293+
from torch.utils.data import DataLoader, Dataset
294+
295+
# --- 1. CONFIGURATION & DATA GENERATION ---
296+
NUM_NODES = 50_000
297+
NUM_EDGES = 200_000
298+
FEATURE_DIM = 32
299+
NUM_CLASSES = 10
300+
FANOUT_K = 5
301+
BATCH_SIZE = 1024
302+
303+
def generate_synthetic_data():
304+
"""Generates synthetic CSVs if they don't exist yet."""
305+
if os.path.exists("dataset/edges.csv"): return
306+
os.makedirs("dataset", exist_ok=True)
307+
308+
print("Generating synthetic dataset (CSVs)...")
309+
# Edges
310+
src = np.random.randint(0, NUM_NODES, NUM_EDGES)
311+
dst = np.random.randint(0, NUM_NODES, NUM_EDGES)
312+
with open("dataset/edges.csv", "w") as f:
313+
for s, d in zip(src, dst): f.write(f"{s},{d}\n")
314+
315+
# Features (Float32)
316+
with open("dataset/features.csv", "w") as f:
317+
for i in range(NUM_NODES):
318+
feats = ",".join([f"{np.random.randn():.4f}" for _ in range(FEATURE_DIM)])
319+
f.write(f"{i},{feats}\n")
320+
321+
# Labels (Int64)
322+
with open("dataset/labels.csv", "w") as f:
323+
for i in range(NUM_NODES):
324+
f.write(f"{i},{np.random.randint(0, NUM_CLASSES)}\n")
325+
326+
generate_synthetic_data()
327+
328+
# --- 2. GRAPHZERO CONVERSION (CSV -> Binary) ---
329+
print("\nConverting CSVs to GraphZero formats...")
330+
if not os.path.exists("graph.gl"):
331+
gz.convert_csv_to_gl("dataset/edges.csv", "graph.gl", directed=True)
332+
if not os.path.exists("features.gd"):
333+
gz.convert_csv_to_gd("dataset/features.csv", "features.gd", dtype=gz.DataType.FLOAT32)
334+
if not os.path.exists("labels.gd"):
335+
gz.convert_csv_to_gd("dataset/labels.csv", "labels.gd", dtype=gz.DataType.INT64)
336+
337+
# --- 3. ZERO-COPY MOUNTING ---
338+
print("\nMounting Zero-Copy Engines...")
339+
g = gz.Graph("graph.gl")
340+
fs_feats = gz.FeatureStore("features.gd")
341+
fs_labels = gz.FeatureStore("labels.gd")
342+
343+
print(f"Graph Mounted. Nodes: {g.num_nodes:,} | Edges: {g.num_edges:,}")
344+
345+
# Instantly map SSD data to PyTorch (RAM used: 0 Bytes)
346+
X = torch.from_numpy(fs_feats.get_tensor())
347+
Y = torch.from_numpy(fs_labels.get_tensor()).squeeze() # Squeeze (N, 1) to (N,)
348+
349+
print(f"Feature Tensor: {X.shape} ({X.dtype})")
350+
print(f"Label Tensor: {Y.shape} ({Y.dtype})")
351+
352+
353+
# --- 4. PYTORCH DATALOADER & COLLATOR ---
354+
class TargetNodeDataset(Dataset):
355+
def __len__(self): return NUM_NODES
356+
def __getitem__(self, idx): return idx
357+
358+
def collate_neighborhoods(batch_nodes):
359+
targets = [int(n) for n in batch_nodes]
360+
# Fast C++ neighbor sampling (Releases GIL)
361+
neighbors = g.batch_random_fanout(targets, FANOUT_K)
362+
return torch.tensor(targets, dtype=torch.long), torch.tensor(neighbors, dtype=torch.long)
363+
364+
loader = DataLoader(
365+
TargetNodeDataset(), batch_size=BATCH_SIZE,
366+
collate_fn=collate_neighborhoods, shuffle=True
367+
)
368+
369+
370+
# --- 5. THE GRAPHSAGE MODEL ---
371+
class GraphSAGE(nn.Module):
372+
def __init__(self, in_dim, hidden_dim, out_dim):
373+
super().__init__()
374+
self.fc = nn.Linear(in_dim * 2, hidden_dim)
375+
self.classifier = nn.Linear(hidden_dim, out_dim)
376+
self.relu = nn.ReLU()
377+
378+
def forward(self, target_nodes, neighbor_nodes):
379+
# OS Page Fault Magic:
380+
# PyTorch indexes the mapped SSD tensor, pulling only required 4KB blocks
381+
target_feats = X[target_nodes]
382+
neighbor_feats = X[neighbor_nodes]
383+
384+
# Mean pool the neighbors' features
385+
agg_neighbor_feats = neighbor_feats.mean(dim=1)
386+
387+
# Concat [Target || Aggregated] and pass through NN
388+
combined = torch.cat([target_feats, agg_neighbor_feats], dim=1)
389+
return self.classifier(self.relu(self.fc(combined)))
390+
391+
392+
# --- 6. TRAINING LOOP ---
393+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
394+
model = GraphSAGE(FEATURE_DIM, 64, NUM_CLASSES).to(device)
395+
X, Y = X.to(device), Y.to(device) # Move memory mappings to GPU buffer
396+
optimizer = optim.Adam(model.parameters(), lr=0.01)
397+
criterion = nn.CrossEntropyLoss()
398+
399+
print("\n🚀 Starting GraphSAGE Training...")
400+
t0 = time.time()
401+
402+
for epoch in range(3):
403+
total_loss = 0
404+
for targets, neighbors in loader:
405+
targets, neighbors = targets.to(device), neighbors.to(device)
406+
407+
optimizer.zero_grad()
408+
logits = model(targets, neighbors)
409+
loss = criterion(logits, Y[targets]) # Fetch actual labels from .gd mapping
410+
411+
loss.backward()
412+
optimizer.step()
413+
total_loss += loss.item()
414+
415+
print(f"Epoch {epoch+1}/3 | Avg Loss: {total_loss/len(loader):.4f}")
416+
417+
print(f"✅ Training Complete in {time.time() - t0:.2f} seconds.")
418+
```
419+
This example demonstrates a complete end-to-end workflow using `GraphZero` for a GNN training task. The synthetic dataset is generated, converted to the optimized binary formats, and then seamlessly integrated into a PyTorch training loop with zero-copy data access. The C++ engine handles all graph sampling efficiently, allowing the GPU to focus on training the model.
420+
421+
![GraphSAGE Training Output](benchmark/images/graphsage_output.png)

0 commit comments

Comments
 (0)