Skip to content

Commit f9d2a9d

Browse files
committed
Graphzero-v0.1.0 Release Commit
1 parent dd60b3a commit f9d2a9d

21 files changed

Lines changed: 1150 additions & 81 deletions

.github/workflows/release.yml

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
name: Build and Publish Wheels
2+
3+
on:
4+
release:
5+
types: [published]
6+
workflow_dispatch: # Allows manual triggering for testing
7+
8+
jobs:
9+
build_wheels:
10+
name: Build wheels on ${{ matrix.os }}
11+
runs-on: ${{ matrix.os }}
12+
strategy:
13+
matrix:
14+
os: [ubuntu-latest, windows-latest, macos-latest]
15+
16+
steps:
17+
- uses: actions/checkout@v4
18+
19+
# Build the wheels
20+
- name: Build wheels
21+
uses: pypa/[email protected]
22+
env:
23+
# Skip old Python versions and PyPy to save time
24+
CIBW_SKIP: "cp36-* cp37-* pp*"
25+
# Force C++20 standard for Linux builds (Nanobind needs it)
26+
CIBW_ENVIRONMENT_LINUX: "CXXFLAGS='-std=c++20'"
27+
28+
- uses: actions/upload-artifact@v4
29+
with:
30+
name: cibw-wheels-${{ matrix.os }}-${{ strategy.job-index }}
31+
path: ./wheelhouse/*.whl
32+
33+
build_sdist:
34+
name: Build source distribution
35+
runs-on: ubuntu-latest
36+
steps:
37+
- uses: actions/checkout@v4
38+
- name: Build sdist
39+
run: pipx run build --sdist
40+
- uses: actions/upload-artifact@v4
41+
with:
42+
name: cibw-sdist
43+
path: dist/*.tar.gz
44+
45+
publish_to_pypi:
46+
needs: [build_wheels, build_sdist]
47+
runs-on: ubuntu-latest
48+
steps:
49+
- uses: actions/download-artifact@v4
50+
with:
51+
pattern: cibw-*
52+
path: dist
53+
merge-multiple: true
54+
55+
- name: Publish to PyPI
56+
uses: pypa/gh-action-pypi-publish@release/v1
57+
with:
58+
user: __token__
59+
password: ${{ secrets.PYPI_PASSWORD }}

.gitignore

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,10 @@
11
.vscode/
2-
*.gl
3-
graph.gl
42
venv/
5-
*.exe
3+
build/
4+
benchmark/dataset
5+
6+
*.exe
7+
*.gl
8+
*.zip
9+
*.npz
10+
benchmark/*.csv

CODE-DOCS.md

Lines changed: 225 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
# GraphZero API Reference 📘
2+
3+
This document details the Python API exposed by the `graphzero` C++ engine.
4+
5+
## 📦 Core Class: `Graph`
6+
7+
The main entry point for interacting with the graph.
8+
9+
```python
10+
import graphzero as gz
11+
g = gz.Graph("path/to/graph.gl")
12+
13+
```
14+
15+
### Properties
16+
17+
| Property | Type | Description |
18+
| --- | --- | --- |
19+
| `g.num_nodes` | `int` | Total number of nodes in the graph. |
20+
| `g.num_edges` | `int` | Total number of edges (directed). |
21+
22+
### Methods
23+
24+
#### `get_degree(node_id: int) -> int`
25+
26+
Returns the out-degree (number of neighbors) for a specific node.
27+
28+
* **Usage:** checking if a node is a dead-end before walking.
29+
30+
#### `get_neighbours(node_id: int) -> numpy.ndarray`
31+
32+
Returns a **1-D numpy ndarray** of neighbour node IDs (dtype: `np.int64`). This is returned from the C++ layer as a fast zero-copy buffer and can be used directly with NumPy/PyTorch.
33+
34+
* **Notes:**
35+
- The binding uses the British spelling `get_neighbours` (this is the function name exposed in the Python API).
36+
- For very high-degree nodes prefer `sample_neighbours` or `batch_random_fanout` to avoid copying large arrays.
37+
38+
---
39+
40+
### 🎲 Sampling Methods (The Engine)
41+
42+
These functions use OpenMP multithreading on the C++ side and release the GIL to fully saturate CPU/disk bandwidth. All batch functions return a **NumPy ndarray** of dtype `np.int64`.
43+
44+
#### `batch_random_walk_uniform(start_nodes: List[int], walk_length: int) -> numpy.ndarray`
45+
46+
**The Speed King.** Performs unbiased uniform random walks.
47+
48+
* **Return shape & dtype:** `ndarray` with shape `(len(start_nodes), walk_length)` and dtype `np.int64`.
49+
* **Algorithm:** At every step, pick a neighbour uniformly at random.
50+
* **Use Case:** DeepWalk, uniform walk baselines, and fast data generation for training.
51+
52+
#### `batch_random_walk(start_nodes: List[int], walk_length: int, p: float = 1.0, q: float = 1.0) -> numpy.ndarray`
53+
54+
**The Biased Walker.** Performs Node2Vec-style 2nd-order random walks.
55+
56+
* **Arguments:**
57+
- `p` (Return parameter): Low = keeps walk local (BFS-like).
58+
- `q` (In-out parameter): Low = explores far away (DFS-like).
59+
* **Return shape & dtype:** `ndarray` with shape `(len(start_nodes), walk_length)` and dtype `np.int64`.
60+
* **Performance:** Slower than uniform walks due to additional transition calculations.
61+
62+
#### `batch_random_fanout(start_nodes: List[int], K: int) -> numpy.ndarray`
63+
64+
Performs uniform neighbor *fanout* sampling for a batch of start nodes (useful for GNN neighbour sampling).
65+
66+
* **Behavior:** For each start node returns `K` sampled neighbour IDs (using reservoir sampling / uniform sampling without replacement where possible).
67+
* **Return shape & dtype:** `ndarray` with shape `(len(start_nodes), K)`, dtype `np.int64`.
68+
69+
#### `sample_neighbours(start_node: int, K: int) -> numpy.ndarray`
70+
71+
Performs uniform neighbour sampling for a single node using **reservoir sampling**.
72+
73+
* **Behavior:** Returns up to `K` neighbour IDs sampled uniformly at random. If the node degree <= `K`, all neighbours are returned.
74+
* **Return shape & dtype:** 1-D `ndarray` of length `<= K`, dtype `np.int64`.
75+
76+
## 🛠️ Utilities
77+
78+
#### `gz.convert_csv_to_gl(input_csv: str, output_bin: str, directed: bool)`
79+
80+
Converts a raw Edge List CSV into the optimized GraphLite binary format (`.gl`).
81+
82+
* **Input CSV Format:** Two columns (Source, Destination). Headers are ignored if they exist.
83+
* **Process:** 1. **Pass 1:** Scans file to count degrees (Memory: Low).
84+
2. **Allocation:** Creates the `.gl` file and `mmaps` it.
85+
3. **Pass 2:** Reads CSV again and places edges into the correct memory buckets.
86+
* **Note:** This process handles graphs larger than RAM.
87+
88+
89+
# 🧠 Example: Training Node2Vec with PyTorch
90+
91+
This script demonstrates how to use `GraphZero` to train a real Node2Vec model.
92+
Since `GraphZero` handles the **Data Loading** (the bottleneck), the GPU can focus entirely on **Training** (the math).
93+
94+
**File:** `train_node2vec.py`
95+
96+
```python
97+
import torch
98+
import torch.nn as nn
99+
import torch.optim as optim
100+
import graphzero as gz
101+
import numpy as np
102+
from torch.utils.data import DataLoader, Dataset
103+
104+
# --- CONFIGURATION ---
105+
GRAPH_PATH = "papers100M.gl" # The beast
106+
EMBEDDING_DIM = 128
107+
WALK_LENGTH = 20
108+
WALKS_PER_EPOCH = 100_000 # Number of starts per batch
109+
BATCH_SIZE = 1024
110+
EPOCHS = 5
111+
112+
print(f"Initializing GraphZero Engine on {GRAPH_PATH}...")
113+
g = gz.Graph(GRAPH_PATH)
114+
print(f" Nodes: {g.num_nodes:,} | Edges: {g.num_edges:,}")
115+
116+
# --- 1. THE DATASET (Powered by GraphZero) ---
117+
class GraphZeroWalkDataset(Dataset):
118+
"""
119+
Generates random walks on-the-fly using C++ engine.
120+
"""
121+
def __init__(self, graph_engine, num_walks, walk_len):
122+
self.g = graph_engine
123+
self.num_walks = num_walks
124+
self.walk_len = walk_len
125+
126+
def __len__(self):
127+
# In a real scenario, this might be num_nodes
128+
# For this demo, we define an arbitrary epoch size
129+
return self.num_walks
130+
131+
def __getitem__(self, idx):
132+
# We don't generate single walks (too slow).
133+
# We let the DataLoader batch them, then call C++ in the collate_fn.
134+
# So we just return a random start node here.
135+
return np.random.randint(0, self.g.num_nodes)
136+
137+
# --- 2. CUSTOM COLLATE FUNCTION (The Secret Sauce) ---
138+
def collate_walks(batch_start_nodes):
139+
"""
140+
This is where the magic happens.
141+
Instead of Python looping, we give the whole batch of start nodes
142+
to C++ and get back the massive walk matrix instantly.
143+
"""
144+
# 1. Convert batch to list of uint64 for C++
145+
start_nodes = [int(x) for x in batch_start_nodes]
146+
147+
# 2. Call C++ Engine (Releases GIL, runs OpenMP)
148+
# Result is a flat list: [walk1_step1, walk1_step2... walk2_step1...]
149+
flat_walks = g.batch_random_walk_uniform(start_nodes, WALK_LENGTH)
150+
151+
# 3. Reshape for PyTorch (Batch Size, Walk Length)
152+
walks_tensor = torch.tensor(flat_walks, dtype=torch.long)
153+
walks_tensor = walks_tensor.view(len(start_nodes), WALK_LENGTH)
154+
155+
return walks_tensor
156+
157+
# --- CONFIGURATION ADJUSTMENT ---
158+
# We map 204M nodes -> 1M unique embeddings to save RAM
159+
HASH_SIZE = 1_000_000
160+
# RAM Usage: 1M * 128 * 4 bytes = ~512 MB (Very safe)
161+
162+
# --- 3. THE MODEL (Hashed Skip-Gram) ---
163+
class Node2Vec(nn.Module):
164+
def __init__(self, num_nodes, embed_dim):
165+
super().__init__()
166+
# INSTEAD OF: self.in_embed = nn.Embedding(num_nodes, embed_dim)
167+
# WE USE:
168+
self.in_embed = nn.Embedding(HASH_SIZE, embed_dim)
169+
self.out_embed = nn.Embedding(HASH_SIZE, embed_dim)
170+
171+
def forward(self, target, context):
172+
# Hashing Trick: Map massive ID -> Small ID
173+
# In a real app, you'd use a better hash, but modulo is fine for a demo
174+
t_hashed = target % HASH_SIZE
175+
c_hashed = context % HASH_SIZE
176+
177+
v_in = self.in_embed(t_hashed)
178+
v_out = self.out_embed(c_hashed)
179+
180+
return torch.sum(v_in * v_out, dim=1)
181+
182+
# --- 4. TRAINING LOOP ---
183+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
184+
model = Node2Vec(g.num_nodes, EMBEDDING_DIM).to(device)
185+
optimizer = optim.Adam(model.parameters(), lr=0.01)
186+
187+
# PyTorch DataLoader wraps our C++ engine
188+
loader = DataLoader(
189+
GraphZeroWalkDataset(g, WALKS_PER_EPOCH, WALK_LENGTH),
190+
batch_size=BATCH_SIZE,
191+
collate_fn=collate_walks, # <--- Connects PyTorch to GraphZero
192+
num_workers=0 # Windows needs 0, Linux can use more
193+
)
194+
195+
print("\nStarting Training...")
196+
197+
for epoch in range(EPOCHS):
198+
total_loss = 0
199+
200+
for batch_walks in loader:
201+
# batch_walks shape: [1024, 20]
202+
batch_walks = batch_walks.to(device)
203+
204+
# Simple Positive Pair generation: (Current, Next)
205+
# Real implementations use sliding windows, simplified here for brevity
206+
target = batch_walks[:, :-1].flatten()
207+
context = batch_walks[:, 1:].flatten()
208+
209+
optimizer.zero_grad()
210+
loss = -model(target, context).mean() # Dummy loss for demo
211+
loss.backward()
212+
optimizer.step()
213+
214+
total_loss += loss.item()
215+
216+
print(f"Epoch {epoch+1}/{EPOCHS} | Avg Loss: {total_loss/len(loader):.4f}")
217+
218+
print("✅ Training Complete.")
219+
220+
```
221+
222+
This example showcases how `GraphZero` can be seamlessly integrated into a PyTorch training loop, allowing for efficient data loading and processing of massive graphs. The C++ engine handles the heavy lifting of random walk generation, freeing up Python to focus on model training.
223+
here is the screenshot of the output when running the script:
224+
225+
![Training Output](benchmark/images/examplecode.png)

LICENSE

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2025 Krish
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

0 commit comments

Comments
 (0)