|
| 1 | +# GraphZero API Reference 📘 |
| 2 | + |
| 3 | +This document details the Python API exposed by the `graphzero` C++ engine. |
| 4 | + |
| 5 | +## 📦 Core Class: `Graph` |
| 6 | + |
| 7 | +The main entry point for interacting with the graph. |
| 8 | + |
| 9 | +```python |
| 10 | +import graphzero as gz |
| 11 | +g = gz.Graph("path/to/graph.gl") |
| 12 | + |
| 13 | +``` |
| 14 | + |
| 15 | +### Properties |
| 16 | + |
| 17 | +| Property | Type | Description | |
| 18 | +| --- | --- | --- | |
| 19 | +| `g.num_nodes` | `int` | Total number of nodes in the graph. | |
| 20 | +| `g.num_edges` | `int` | Total number of edges (directed). | |
| 21 | + |
| 22 | +### Methods |
| 23 | + |
| 24 | +#### `get_degree(node_id: int) -> int` |
| 25 | + |
| 26 | +Returns the out-degree (number of neighbors) for a specific node. |
| 27 | + |
| 28 | +* **Usage:** checking if a node is a dead-end before walking. |
| 29 | + |
| 30 | +#### `get_neighbours(node_id: int) -> numpy.ndarray` |
| 31 | + |
| 32 | +Returns a **1-D numpy ndarray** of neighbour node IDs (dtype: `np.int64`). This is returned from the C++ layer as a fast zero-copy buffer and can be used directly with NumPy/PyTorch. |
| 33 | + |
| 34 | +* **Notes:** |
| 35 | + - The binding uses the British spelling `get_neighbours` (this is the function name exposed in the Python API). |
| 36 | + - For very high-degree nodes prefer `sample_neighbours` or `batch_random_fanout` to avoid copying large arrays. |
| 37 | + |
| 38 | +--- |
| 39 | + |
| 40 | +### 🎲 Sampling Methods (The Engine) |
| 41 | + |
| 42 | +These functions use OpenMP multithreading on the C++ side and release the GIL to fully saturate CPU/disk bandwidth. All batch functions return a **NumPy ndarray** of dtype `np.int64`. |
| 43 | + |
| 44 | +#### `batch_random_walk_uniform(start_nodes: List[int], walk_length: int) -> numpy.ndarray` |
| 45 | + |
| 46 | +**The Speed King.** Performs unbiased uniform random walks. |
| 47 | + |
| 48 | +* **Return shape & dtype:** `ndarray` with shape `(len(start_nodes), walk_length)` and dtype `np.int64`. |
| 49 | +* **Algorithm:** At every step, pick a neighbour uniformly at random. |
| 50 | +* **Use Case:** DeepWalk, uniform walk baselines, and fast data generation for training. |
| 51 | + |
| 52 | +#### `batch_random_walk(start_nodes: List[int], walk_length: int, p: float = 1.0, q: float = 1.0) -> numpy.ndarray` |
| 53 | + |
| 54 | +**The Biased Walker.** Performs Node2Vec-style 2nd-order random walks. |
| 55 | + |
| 56 | +* **Arguments:** |
| 57 | + - `p` (Return parameter): Low = keeps walk local (BFS-like). |
| 58 | + - `q` (In-out parameter): Low = explores far away (DFS-like). |
| 59 | +* **Return shape & dtype:** `ndarray` with shape `(len(start_nodes), walk_length)` and dtype `np.int64`. |
| 60 | +* **Performance:** Slower than uniform walks due to additional transition calculations. |
| 61 | + |
| 62 | +#### `batch_random_fanout(start_nodes: List[int], K: int) -> numpy.ndarray` |
| 63 | + |
| 64 | +Performs uniform neighbor *fanout* sampling for a batch of start nodes (useful for GNN neighbour sampling). |
| 65 | + |
| 66 | +* **Behavior:** For each start node returns `K` sampled neighbour IDs (using reservoir sampling / uniform sampling without replacement where possible). |
| 67 | +* **Return shape & dtype:** `ndarray` with shape `(len(start_nodes), K)`, dtype `np.int64`. |
| 68 | + |
| 69 | +#### `sample_neighbours(start_node: int, K: int) -> numpy.ndarray` |
| 70 | + |
| 71 | +Performs uniform neighbour sampling for a single node using **reservoir sampling**. |
| 72 | + |
| 73 | +* **Behavior:** Returns up to `K` neighbour IDs sampled uniformly at random. If the node degree <= `K`, all neighbours are returned. |
| 74 | +* **Return shape & dtype:** 1-D `ndarray` of length `<= K`, dtype `np.int64`. |
| 75 | + |
| 76 | +## 🛠️ Utilities |
| 77 | + |
| 78 | +#### `gz.convert_csv_to_gl(input_csv: str, output_bin: str, directed: bool)` |
| 79 | + |
| 80 | +Converts a raw Edge List CSV into the optimized GraphLite binary format (`.gl`). |
| 81 | + |
| 82 | +* **Input CSV Format:** Two columns (Source, Destination). Headers are ignored if they exist. |
| 83 | +* **Process:** 1. **Pass 1:** Scans file to count degrees (Memory: Low). |
| 84 | +2. **Allocation:** Creates the `.gl` file and `mmaps` it. |
| 85 | +3. **Pass 2:** Reads CSV again and places edges into the correct memory buckets. |
| 86 | +* **Note:** This process handles graphs larger than RAM. |
| 87 | + |
| 88 | + |
| 89 | +# 🧠 Example: Training Node2Vec with PyTorch |
| 90 | + |
| 91 | +This script demonstrates how to use `GraphZero` to train a real Node2Vec model. |
| 92 | +Since `GraphZero` handles the **Data Loading** (the bottleneck), the GPU can focus entirely on **Training** (the math). |
| 93 | + |
| 94 | +**File:** `train_node2vec.py` |
| 95 | + |
| 96 | +```python |
| 97 | +import torch |
| 98 | +import torch.nn as nn |
| 99 | +import torch.optim as optim |
| 100 | +import graphzero as gz |
| 101 | +import numpy as np |
| 102 | +from torch.utils.data import DataLoader, Dataset |
| 103 | + |
| 104 | +# --- CONFIGURATION --- |
| 105 | +GRAPH_PATH = "papers100M.gl" # The beast |
| 106 | +EMBEDDING_DIM = 128 |
| 107 | +WALK_LENGTH = 20 |
| 108 | +WALKS_PER_EPOCH = 100_000 # Number of starts per batch |
| 109 | +BATCH_SIZE = 1024 |
| 110 | +EPOCHS = 5 |
| 111 | + |
| 112 | +print(f"Initializing GraphZero Engine on {GRAPH_PATH}...") |
| 113 | +g = gz.Graph(GRAPH_PATH) |
| 114 | +print(f" Nodes: {g.num_nodes:,} | Edges: {g.num_edges:,}") |
| 115 | + |
| 116 | +# --- 1. THE DATASET (Powered by GraphZero) --- |
| 117 | +class GraphZeroWalkDataset(Dataset): |
| 118 | + """ |
| 119 | + Generates random walks on-the-fly using C++ engine. |
| 120 | + """ |
| 121 | + def __init__(self, graph_engine, num_walks, walk_len): |
| 122 | + self.g = graph_engine |
| 123 | + self.num_walks = num_walks |
| 124 | + self.walk_len = walk_len |
| 125 | + |
| 126 | + def __len__(self): |
| 127 | + # In a real scenario, this might be num_nodes |
| 128 | + # For this demo, we define an arbitrary epoch size |
| 129 | + return self.num_walks |
| 130 | + |
| 131 | + def __getitem__(self, idx): |
| 132 | + # We don't generate single walks (too slow). |
| 133 | + # We let the DataLoader batch them, then call C++ in the collate_fn. |
| 134 | + # So we just return a random start node here. |
| 135 | + return np.random.randint(0, self.g.num_nodes) |
| 136 | + |
| 137 | +# --- 2. CUSTOM COLLATE FUNCTION (The Secret Sauce) --- |
| 138 | +def collate_walks(batch_start_nodes): |
| 139 | + """ |
| 140 | + This is where the magic happens. |
| 141 | + Instead of Python looping, we give the whole batch of start nodes |
| 142 | + to C++ and get back the massive walk matrix instantly. |
| 143 | + """ |
| 144 | + # 1. Convert batch to list of uint64 for C++ |
| 145 | + start_nodes = [int(x) for x in batch_start_nodes] |
| 146 | + |
| 147 | + # 2. Call C++ Engine (Releases GIL, runs OpenMP) |
| 148 | + # Result is a flat list: [walk1_step1, walk1_step2... walk2_step1...] |
| 149 | + flat_walks = g.batch_random_walk_uniform(start_nodes, WALK_LENGTH) |
| 150 | + |
| 151 | + # 3. Reshape for PyTorch (Batch Size, Walk Length) |
| 152 | + walks_tensor = torch.tensor(flat_walks, dtype=torch.long) |
| 153 | + walks_tensor = walks_tensor.view(len(start_nodes), WALK_LENGTH) |
| 154 | + |
| 155 | + return walks_tensor |
| 156 | + |
| 157 | +# --- CONFIGURATION ADJUSTMENT --- |
| 158 | +# We map 204M nodes -> 1M unique embeddings to save RAM |
| 159 | +HASH_SIZE = 1_000_000 |
| 160 | +# RAM Usage: 1M * 128 * 4 bytes = ~512 MB (Very safe) |
| 161 | + |
| 162 | +# --- 3. THE MODEL (Hashed Skip-Gram) --- |
| 163 | +class Node2Vec(nn.Module): |
| 164 | + def __init__(self, num_nodes, embed_dim): |
| 165 | + super().__init__() |
| 166 | + # INSTEAD OF: self.in_embed = nn.Embedding(num_nodes, embed_dim) |
| 167 | + # WE USE: |
| 168 | + self.in_embed = nn.Embedding(HASH_SIZE, embed_dim) |
| 169 | + self.out_embed = nn.Embedding(HASH_SIZE, embed_dim) |
| 170 | + |
| 171 | + def forward(self, target, context): |
| 172 | + # Hashing Trick: Map massive ID -> Small ID |
| 173 | + # In a real app, you'd use a better hash, but modulo is fine for a demo |
| 174 | + t_hashed = target % HASH_SIZE |
| 175 | + c_hashed = context % HASH_SIZE |
| 176 | + |
| 177 | + v_in = self.in_embed(t_hashed) |
| 178 | + v_out = self.out_embed(c_hashed) |
| 179 | + |
| 180 | + return torch.sum(v_in * v_out, dim=1) |
| 181 | + |
| 182 | +# --- 4. TRAINING LOOP --- |
| 183 | +device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| 184 | +model = Node2Vec(g.num_nodes, EMBEDDING_DIM).to(device) |
| 185 | +optimizer = optim.Adam(model.parameters(), lr=0.01) |
| 186 | + |
| 187 | +# PyTorch DataLoader wraps our C++ engine |
| 188 | +loader = DataLoader( |
| 189 | + GraphZeroWalkDataset(g, WALKS_PER_EPOCH, WALK_LENGTH), |
| 190 | + batch_size=BATCH_SIZE, |
| 191 | + collate_fn=collate_walks, # <--- Connects PyTorch to GraphZero |
| 192 | + num_workers=0 # Windows needs 0, Linux can use more |
| 193 | +) |
| 194 | + |
| 195 | +print("\nStarting Training...") |
| 196 | + |
| 197 | +for epoch in range(EPOCHS): |
| 198 | + total_loss = 0 |
| 199 | + |
| 200 | + for batch_walks in loader: |
| 201 | + # batch_walks shape: [1024, 20] |
| 202 | + batch_walks = batch_walks.to(device) |
| 203 | + |
| 204 | + # Simple Positive Pair generation: (Current, Next) |
| 205 | + # Real implementations use sliding windows, simplified here for brevity |
| 206 | + target = batch_walks[:, :-1].flatten() |
| 207 | + context = batch_walks[:, 1:].flatten() |
| 208 | + |
| 209 | + optimizer.zero_grad() |
| 210 | + loss = -model(target, context).mean() # Dummy loss for demo |
| 211 | + loss.backward() |
| 212 | + optimizer.step() |
| 213 | + |
| 214 | + total_loss += loss.item() |
| 215 | + |
| 216 | + print(f"Epoch {epoch+1}/{EPOCHS} | Avg Loss: {total_loss/len(loader):.4f}") |
| 217 | + |
| 218 | +print("✅ Training Complete.") |
| 219 | + |
| 220 | +``` |
| 221 | + |
| 222 | +This example showcases how `GraphZero` can be seamlessly integrated into a PyTorch training loop, allowing for efficient data loading and processing of massive graphs. The C++ engine handles the heavy lifting of random walk generation, freeing up Python to focus on model training. |
| 223 | +here is the screenshot of the output when running the script: |
| 224 | + |
| 225 | + |
0 commit comments