Learning how perception works with the Waymo Open Perception Dataset, by performing 3D Semantic Segmentation and Sensor Fusion from scratch, without the Waymo Python package and extremely minimal LLM assistance.
- Motivation
- Dataset
- What I've Built So Far
- Technical Deep-Dives
- Pipeline & Project Structure
- Sensor Fusion
- Challenges & Lessons Learned
- Roadmap
- Setup & Running
- References
I am very interested in autonomous vehicles and want to pursue it for a career, as I believe there is a strong potential market for them one day due to their game-changing safety and convenience
(, and because I think the tech is beyond fascinating. Having grown up in Phoenix, AZ, I've seen Waymo grow tremendously and I always use their services when I have the chance to. This is the future).
Therefore, I want to learn how self-driving cars work, and what better way to do it than through recreating their functions.
This is one of the first steps in my journey of learning the ins and outs of self-driving cars, and I'm having a great time.
Waymo Open Dataset v2.0 (waymo_open_dataset_v_2_0_0 GCS bucket).
This holds the lidar points along with their labels, calibration data, and the corresponding image frames. These are the folders I am working with (for now, I will be using more throughout this project).
| Folder | What it holds |
|---|---|
lidar/ |
Range Image 3D points |
lidar_segmentation/ |
Lidar Labels |
lidar_calibration/ |
Extrinsic Matrices per Laser |
camera_image/ |
Images per Timestamp |
camera_calibration/ |
Extrinsic / Intrinsic Matrices |
- GCS authentication & parquet access
- Range-image decode (spherical → Cartesian)
- Extrinsic transform to a global frame
- Multi-laser fusion into one point cloud
- Segmentation-label decoding & per-point coloring
- Memory-safe, timestamp-aligned data loading
- Bird's-eye + 3D (Plotly) visualization
- Scene animation (matplotlib / ffmpeg)
- LiDAR → camera projection
- LiDAR-camera fused overlay video
- Labeled sensor-fusion render
- ...
Converting spherical coordinates (phi, theta, rho: range image format) to cartesian (x, y, z) was a refresher from Calculus III, and a welcome one since I found a worthy application of it. This is necessary for plotting in a 3D space, as well as for future model training.
This was a challenge that I realized deep into development, as I didn't know that the sensor had a mounting yaw, and had to apply this to the azimuth (phi) calculation. This made the everything swing on the wrong bearing, which ruined the segmentation plotting. I realized it was necessary to apply this transformation to get the correct image.
For beam inclination, I assumed the beam values were in descending order, but were actually ascending, so this capped my height to a wrong value when plotting.
The extrinsic matrix is for the camera and lidar sensors, as it relates the position of these sensors so that their measured points can be represented relative to them (or global, not relative to them).
To use these with the points, we must stack the X, Y, and Z coordinates in a numpy array, then add a 1's column to the right to make it 4xN (homogeneous) after transposing, then do a matrix multiple by the extrinsic matrix, and finally get rid of the 4th added column.
After inspecting the data with pandas, I noticed that the segmentation labels are pretty sparse: only about 30 timestamps compared to 198 for lidar, as well as only laser 1 containing labels.
To get these labels, there is a Masking that has to be done to get only true values (values that are actually visible, non negative) after converting.
To get the already-processed 3D global coordinates relative to the camera we are taking frames from, we must:
-
Multiply the 3D global coords by the inverse of the extrinsic matrix for the camera.
-
Divide by the depth (X axis in this case) to get a normalized set of 2D coords (u, v)
-
Scale by the intrinsic values of the camera (focal length, lens centerpoint)
Finally, you take these (u, v) coordinates and do a masking that only takes points within the bounds of the image dimensions.
| File | Role |
|---|---|
semseg.ipynb |
Semantic Segmentation |
semseg_functions.py |
SemSeg functions |
sensor_fusion.ipynb |
Sensor Fusion learning |
sensor_fusion_functions.py |
Fusion functions |
media/ |
videos/plots generated |
This implementation does an Early Fusion approach, but I will try Late Fusion in the future. We are using depth to measure and color the Lidar beams overlayed on top of the image (working on segmentation labels actively, TBD)
- Memory Usage
-
My first implementations of the data retrieval and processing algorithms were very sub-optimal, and it led to my kernel crashing quite often, so I tried to think of ways to minimize my data usage while still getting demonstrative results.
-
Issues included retrieving large files multiple times for only a small portion of their data, holding large dataframes in memory for too long, and loading unnecessary columns that went unused.
-
FIX: being memory efficient and doing processing/projecting immediately after loading to not hold too much data in memory. Also using the "del" keyword and the garbage collector to delete data that that wasn't necesarry in the loop.
- Unaligned LiDAR and Camera for Fusion video
-
The lidar points were too high up on the image, and it took me a while to figure out that there was a root issue in my lidar processing function, which had to do with height correction along with azimuth.
-
FIX: I had to reverse the theta series array because I assumed it would be in descending order, but was actually in ascending, which changed my point cloud direction change when iterating over timestamps.
-
FIX: I had to do a small transformation to the azimuth calculation to factor in the yaw of the sensor, which translated it to be visualized at the correct angle relative to the camera.
- Next: Implementing predefined segmentation labels into the sensor fusion pipeline.
- Later: Creating a model to detect labels from each relevant sensor output.
- Eventually: Running optimized versions of these perception functions in a CARLA simulator to evaluate my progress, and iterate from there.
Environment: Python 3.10, run inside Jupyter (these are notebook-driven).
Python dependencies: pyarrow (parquet + GCS filesystem access), pandas,
numpy, matplotlib, Pillow (JPEG decode), plotly (interactive 3D),
tensorflow and open3d (imported by the helper module), plus gcsfs and
google-cloud-storage. Install them into a Python 3.10 environment with your
package manager of choice.
System dependency: ffmpeg must be on your PATH — the scene/fusion
animations are written to disk with matplotlib's FFMpegWriter.
Google Cloud authentication: The data is read live from the public GCS
bucket waymo_open_dataset_v_2_0_0 — there are no local copies. Authentication
goes through the Google Cloud SDK: sign in once with the gcloud auth login flow,
and make sure gcloud is installed (this project expects it at /usr/bin/gcloud,
with config under /home/jacob/.config/gcloud — adjust the two paths at the top
of semseg_functions.py for your machine). On import, the helper module shells
out to gcloud auth print-access-token and builds the GcsFileSystem from that
token. The token expires after one hour, so for long sessions you'll need to
re-import the module (or re-run its first cell) to refresh it.
Running it: Open semseg.ipynb for the LiDAR-only segmentation pipeline, or
sensor_fusion.ipynb for the LiDAR-camera fusion work, and run the cells top to
bottom. Generated videos and plots land in media/.
Dataset
- Waymo Open Dataset — official site
- Waymo Open Dataset v2.0 documentation
- Sun et al., Scalability in Perception for Autonomous Driving: Waymo Open Dataset, CVPR 2020 — arXiv:1912.04838
Data access
- Apache Arrow / PyArrow — reading Parquet
- PyArrow filesystems — Google Cloud Storage
- gcloud auth print-access-token reference
Geometry & projection
- OpenCV camera calibration & 3D reconstruction (pinhole model, intrinsics/extrinsics)
- Spherical coordinate system (azimuth / inclination / radius)
- Homogeneous coordinates
- ...
