You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The main entry point for zero-copy node feature matrices. It maps massive datasets directly into Numpy/PyTorch without consuming RAM.
79
+
80
+
`
81
+
fs = gz.FeatureStore("path/to/features.gd")
82
+
`
83
+
84
+
### Properties
85
+
86
+
| Property | Type | Description |
87
+
| --- | --- | --- |
88
+
|`fs.num_nodes`|`int`| Total number of nodes (rows). |
89
+
|`fs.feature_dim`|`int`| Number of features per node (columns). |
90
+
91
+
### Methods
92
+
93
+
#### `get_data(node_id: int) -> numpy.ndarray`
94
+
95
+
Returns the features for a single specific node ID.
96
+
97
+
***Return shape & dtype:** 1-D `ndarray` of shape `(feature_dim,)`. The underlying dtype matches the `DataType` used during conversion.
98
+
99
+
#### `get_tensor() -> numpy.ndarray`
100
+
101
+
Returns a zero-copy view of the *entire* feature matrix.
102
+
103
+
***Behavior:** Hands Python a direct pointer to the memory-mapped file. It consumes **0 Bytes of RAM** upon calling. Data is only paged into memory by the OS when PyTorch actively indexes a specific row during training.
104
+
***Return shape & dtype:** 2-D `ndarray` of shape `(num_nodes, feature_dim)`.
2.**Allocation:** Creates the `.gl` file and `mmaps` it.
85
115
3.**Pass 2:** Reads CSV again and places edges into the correct memory buckets.
86
116
***Note:** This process handles graphs larger than RAM.
87
117
118
+
#### 'DataType' Enum
119
+
120
+
`DataType` is an enumeration that defines the supported data types for feature storage. It ensures that the binary files created by `convert_csv_to_gd` have a consistent and optimized memory layout.
Converts a raw feature CSV into the optimized GraphZero Data format (`.gd`).
130
+
131
+
***Input CSV Format:** The first column must be the `NodeID`, followed by its features separated by commas (e.g., `0, 0.5, 0.1, 0.9...`).
132
+
***Arguments:**
133
+
-`dtype` (`DataType`): Strictly enforces the memory layout of the resulting binary file (e.g., `gz.DataType.FLOAT32`).
134
+
***Process:** 1. **Pass 1 (Zero-Allocation):** Fast-scans the CSV using C++ `string_view` to find the maximum Node ID and feature dimension without triggering heap allocations.
135
+
2.**Allocation:**`mmaps` a perfectly sized, C-contiguous binary file.
136
+
3.**Pass 2:** Parses and writes the features. Automatically handles missing Node IDs by leaving their rows safely padded with zeroes.
137
+
88
138
89
139
# 🧠 Example: Training Node2Vec with PyTorch
90
140
@@ -222,4 +272,150 @@ print("✅ Training Complete.")
222
272
This example showcases how `GraphZero` can be seamlessly integrated into a PyTorch training loop, allowing for efficient data loading and processing of massive graphs. The C++ engine handles the heavy lifting of random walk generation, freeing up Python to focus on model training.
223
273
here is the screenshot of the output when running the script:
# 🧠 Example 2: End-to-End GraphSAGE with Zero-Copy Features
278
+
279
+
This script is a complete, runnable example. It generates a synthetic graph dataset, compiles it into GraphZero's zero-copy formats (`.gl` and `.gd`), and trains a GraphSAGE model.
280
+
281
+
Notice how we use `gz.FLOAT32` for the node features and `gz.INT64` for the classification labels. Both are memory-mapped directly into PyTorch natively without consuming system RAM.
282
+
283
+
**File:**`train_graphsage.py`
284
+
285
+
```python
286
+
import os
287
+
import time
288
+
import torch
289
+
import torch.nn as nn
290
+
import torch.optim as optim
291
+
import graphzero as gz
292
+
import numpy as np
293
+
from torch.utils.data import DataLoader, Dataset
294
+
295
+
# --- 1. CONFIGURATION & DATA GENERATION ---
296
+
NUM_NODES=50_000
297
+
NUM_EDGES=200_000
298
+
FEATURE_DIM=32
299
+
NUM_CLASSES=10
300
+
FANOUT_K=5
301
+
BATCH_SIZE=1024
302
+
303
+
defgenerate_synthetic_data():
304
+
"""Generates synthetic CSVs if they don't exist yet."""
print(f"✅ Training Complete in {time.time() - t0:.2f} seconds.")
418
+
```
419
+
This example demonstrates a complete end-to-end workflow using `GraphZero` for a GNN training task. The synthetic dataset is generated, converted to the optimized binary formats, and then seamlessly integrated into a PyTorch training loop with zero-copy data access. The C++ engine handles all graph sampling efficiently, allowing the GPU to focus on training the model.
420
+
421
+

0 commit comments