3D Semantic Segmentation & LiDAR-Camera Sensor Fusion - by Jacob Igo

Learning how perception works with the Waymo Open Perception Dataset, by performing 3D Semantic Segmentation and Sensor Fusion from scratch, without the Waymo Python package and extremely minimal LLM assistance.

Motivation

I am very interested in autonomous vehicles and want to pursue it for a career, as I believe there is a strong potential market for them one day due to their game-changing safety and convenience

(, and because I think the tech is beyond fascinating. Having grown up in Phoenix, AZ, I've seen Waymo grow tremendously and I always use their services when I have the chance to. This is the future).

Therefore, I want to learn how self-driving cars work, and what better way to do it than through recreating their functions.

This is one of the first steps in my journey of learning the ins and outs of self-driving cars, and I'm having a great time.

Dataset

Waymo Open Dataset v2.0 (waymo_open_dataset_v_2_0_0 GCS bucket).

This holds the lidar points along with their labels, calibration data, and the corresponding image frames. These are the folders I am working with (for now, I will be using more throughout this project).

Folder	What it holds
`lidar/`	Range Image 3D points
`lidar_segmentation/`	Lidar Labels
`lidar_calibration/`	Extrinsic Matrices per Laser
`camera_image/`	Images per Timestamp
`camera_calibration/`	Extrinsic / Intrinsic Matrices

What I've Built So Far

Technical Deep-Dives

Range Images → 3D Points (Spherical → Cartesian)

Converting spherical coordinates (phi, theta, rho: range image format) to cartesian (x, y, z) was a refresher from Calculus III, and a welcome one since I found a worthy application of it. This is necessary for plotting in a 3D space, as well as for future model training.

Beam Inclination & Azimuth Correction

This was a challenge that I realized deep into development, as I didn't know that the sensor had a mounting yaw, and had to apply this to the azimuth (phi) calculation. This made the everything swing on the wrong bearing, which ruined the segmentation plotting. I realized it was necessary to apply this transformation to get the correct image.

For beam inclination, I assumed the beam values were in descending order, but were actually ascending, so this capped my height to a wrong value when plotting.

Extrinsic Transform (Sensor → Vehicle/Global Frame)

The extrinsic matrix is for the camera and lidar sensors, as it relates the position of these sensors so that their measured points can be represented relative to them (or global, not relative to them).

To use these with the points, we must stack the X, Y, and Z coordinates in a numpy array, then add a 1's column to the right to make it 4xN (homogeneous) after transposing, then do a matrix multiple by the extrinsic matrix, and finally get rid of the 4th added column.

Segmentation Labels

After inspecting the data with pandas, I noticed that the segmentation labels are pretty sparse: only about 30 timestamps compared to 198 for lidar, as well as only laser 1 containing labels.

To get these labels, there is a Masking that has to be done to get only true values (values that are actually visible, non negative) after converting.

LiDAR → Camera Projection

To get the already-processed 3D global coordinates relative to the camera we are taking frames from, we must:

Multiply the 3D global coords by the inverse of the extrinsic matrix for the camera.
Divide by the depth (X axis in this case) to get a normalized set of 2D coords (u, v)
Scale by the intrinsic values of the camera (focal length, lens centerpoint)

Finally, you take these (u, v) coordinates and do a masking that only takes points within the bounds of the image dimensions.

Pipeline & Project Structure

File	Role
`semseg.ipynb`	Semantic Segmentation
`semseg_functions.py`	SemSeg functions
`sensor_fusion.ipynb`	Sensor Fusion learning
`sensor_fusion_functions.py`	Fusion functions
`media/`	videos/plots generated

Sensor Fusion

This implementation does an Early Fusion approach, but I will try Late Fusion in the future. We are using depth to measure and color the Lidar beams overlayed on top of the image (working on segmentation labels actively, TBD)

Challenges & Lessons Learned

Memory Usage

My first implementations of the data retrieval and processing algorithms were very sub-optimal, and it led to my kernel crashing quite often, so I tried to think of ways to minimize my data usage while still getting demonstrative results.
Issues included retrieving large files multiple times for only a small portion of their data, holding large dataframes in memory for too long, and loading unnecessary columns that went unused.
FIX: being memory efficient and doing processing/projecting immediately after loading to not hold too much data in memory. Also using the "del" keyword and the garbage collector to delete data that that wasn't necesarry in the loop.

Unaligned LiDAR and Camera for Fusion video

The lidar points were too high up on the image, and it took me a while to figure out that there was a root issue in my lidar processing function, which had to do with height correction along with azimuth.
FIX: I had to reverse the theta series array because I assumed it would be in descending order, but was actually in ascending, which changed my point cloud direction change when iterating over timestamps.
FIX: I had to do a small transformation to the azimuth calculation to factor in the yaw of the sensor, which translated it to be visualized at the correct angle relative to the camera.

Roadmap

Next: Implementing predefined segmentation labels into the sensor fusion pipeline.
Later: Creating a model to detect labels from each relevant sensor output.
Eventually: Running optimized versions of these perception functions in a CARLA simulator to evaluate my progress, and iterate from there.

Setup & Running

Environment: Python 3.10, run inside Jupyter (these are notebook-driven).

Python dependencies: pyarrow (parquet + GCS filesystem access), pandas, numpy, matplotlib, Pillow (JPEG decode), plotly (interactive 3D), tensorflow and open3d (imported by the helper module), plus gcsfs and google-cloud-storage. Install them into a Python 3.10 environment with your package manager of choice.

System dependency: ffmpeg must be on your PATH — the scene/fusion animations are written to disk with matplotlib's FFMpegWriter.

Google Cloud authentication: The data is read live from the public GCS bucket waymo_open_dataset_v_2_0_0 — there are no local copies. Authentication goes through the Google Cloud SDK: sign in once with the gcloud auth login flow, and make sure gcloud is installed (this project expects it at /usr/bin/gcloud, with config under /home/jacob/.config/gcloud — adjust the two paths at the top of semseg_functions.py for your machine). On import, the helper module shells out to gcloud auth print-access-token and builds the GcsFileSystem from that token. The token expires after one hour, so for long sessions you'll need to re-import the module (or re-run its first cell) to refresh it.

Running it: Open semseg.ipynb for the LiDAR-only segmentation pipeline, or sensor_fusion.ipynb for the LiDAR-camera fusion work, and run the cells top to bottom. Generated videos and plots land in media/.

References

Dataset

Waymo Open Dataset — official site
Waymo Open Dataset v2.0 documentation
Sun et al., Scalability in Perception for Autonomous Driving: Waymo Open Dataset, CVPR 2020 — arXiv:1912.04838

Data access

Geometry & projection

...

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
__pycache__		__pycache__
media		media
README.md		README.md
semseg.ipynb		semseg.ipynb
semseg_functions.py		semseg_functions.py
semseg_modeling.ipynb		semseg_modeling.ipynb
sensor_fusion.ipynb		sensor_fusion.ipynb
sensor_fusion_functions.py		sensor_fusion_functions.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

3D Semantic Segmentation & LiDAR-Camera Sensor Fusion - by Jacob Igo

Table of Contents

Motivation

Dataset

What I've Built So Far

Technical Deep-Dives

Range Images → 3D Points (Spherical → Cartesian)

Beam Inclination & Azimuth Correction

Extrinsic Transform (Sensor → Vehicle/Global Frame)

Segmentation Labels

LiDAR → Camera Projection

Pipeline & Project Structure

Sensor Fusion

Challenges & Lessons Learned

Roadmap

Setup & Running

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

3D Semantic Segmentation & LiDAR-Camera Sensor Fusion - by Jacob Igo

Table of Contents

Motivation

Dataset

What I've Built So Far

Technical Deep-Dives

Range Images → 3D Points (Spherical → Cartesian)

Beam Inclination & Azimuth Correction

Extrinsic Transform (Sensor → Vehicle/Global Frame)

Segmentation Labels

LiDAR → Camera Projection

Pipeline & Project Structure

Sensor Fusion

Challenges & Lessons Learned

Roadmap

Setup & Running

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages