PeopleCounter

Real-time AI crowd counting on a 4K camera stream. Multiple inference modes (object detection, tiling, density estimation) run entirely inside a Docker container on an NVIDIA GPU. A web UI served by the container lets you switch modes live, watch overlays on the video feed, and monitor GPU utilisation, person count, and end-to-end latency.

The current codebase is app_v2/ — a TensorRT-only, GPU-first orchestrator. The legacy pipeline lives under app_v1/ for reference.

Hardware Requirements

app_v2 was developed and tested on:

Component	Tested configuration
GPU	NVIDIA RTX 5060 Ti (Blackwell sm_120, 16 GB GDDR7, 448 GB/s)
CUDA	13.1
TensorRT	10.15.1
OS (host)	Ubuntu 24.04 (bare metal or WSL2)

RTX 5000 series (Blackwell) — recommended. All FP8-QDQ TensorRT engines and model conversions have been optimised for the Blackwell architecture.

RTX 4000 series (Ada Lovelace) — should work; FP8 is supported, but the engines must be recompiled locally (TensorRT engines are hardware-specific). Real-world performance on Ada has not been measured.

OpenVINO / Intel GPU / NPU — not supported in app_v2. WSL2 and Docker do not expose Intel accelerators. OpenVINO integration was part of app_v1, which ran natively on Windows and leveraged Intel NPU/GPU via the OpenVINO runtime. The conversion script (convert_pth_to_openvino.py) is preserved for app_v1 reference only.

Inference Modes

Mode	Model	Precision	Output
`passthrough`	—	—	raw video
`yolo_global`	yolo26x (decoder: YOLOv8)	FP8-QDQ	bbox + seg masks
`yolo_tiles`	yolo26n (decoder: YOLOv8)	FP8-QDQ	bbox (3×2 tiles)
`density`	DM-Count QNRF (VGG16)	FP16	density heatmap
`crowd_global`	YOLO-CROWD (YOLOv5)	FP16	bbox
`crowd_tiles`	YOLO-CROWD (YOLOv5)	FP16	bbox (3×2 tiles)

Model naming note: The inference models are yolo26 (YOLO v26 family). "YOLOv8" refers to the TensorRT decoder engine format — the decoding logic is compatible with YOLOv8, YOLO11, and yolo26 output tensors. The models themselves are yolo26n/s/m/l/x.

Quick Start

Prerequisites — Linux / WSL2 (Ubuntu 24.04)

app_v2 runs inside Docker. Clone and run on a machine with:

Ubuntu 24.04 (native or WSL2)
NVIDIA GPU with drivers that expose CUDA inside WSL/containers
Docker Engine + NVIDIA Container Toolkit

Windows users: app_v2 is not run directly on Windows. Clone under WSL2 (Ubuntu 24.04) to build and run the Docker container. The Windows windows/ sub-directory contains a separate video streaming tool — see the Windows Video Streamer section.

1. Clone (Linux / WSL2)

git clone <repo-url> PeopleCounter
cd PeopleCounter

2. Build the Docker image

Run once. Takes ~30–60 minutes (downloads CUDA base image, builds OpenCV from source).

./0_build_image.sh

This produces people-counter:gpu-final.

3. Install the runtime toolchain layer

Installs TensorRT 10.15, Python packages, NVIDIA ModelOpt, and CUDA bindings on top of the base image. Takes ~5–10 minutes.

./1_prepare.sh

4. Build the NVDEC layer (required for app_v2)

Compiles PyNvCodec / VPF from source and commits a new image layer. NVDEC hardware video decode is required by the app_v2 orchestrator.

./2_prepare_nvdec.sh

Produces people-counter:gpu-final-nvdec.

The installer looks for external/Video_Codec_SDK_13.0.37 inside the repo. If absent it downloads the archive from NVIDIA's servers (sign-in may be required). Set VIDEO_CODEC_SDK_DIR=/path/to/sdk to use a local copy.

5. (Optional) Build TensorRT engines

Pre-built *.engine files for RTX 5060 Ti are included in models/tensorrt/. Run this step only when you need to regenerate them (e.g. on a different GPU or after a TensorRT version change):

./3_prepare_models.sh

To rebuild individual model families:

# All YOLO26 variants (FP32 + FP16 + FP8-QDQ):
./3_prepare_models.sh

# FP16 only (skip FP8):
SKIP_FP8=1 ./3_prepare_models.sh

# FP8 only (skip FP16):
SKIP_FP16=1 ./3_prepare_models.sh

# Specific seg models only:
SEG_MODELS="yolo26x-seg" ./3_prepare_models.sh

# Specific bbox models only (yolo_tiles):
BBOX_MODELS="yolo26n" SKIP_FP16=1 ./3_prepare_models.sh

# Skip density model rebuild:
SKIP_DENSITY=1 ./3_prepare_models.sh

For density model ONNX export and TensorRT conversion, see export_density_to_onnx.py and convert_onnx_to_trt.py.

6. Run the application

./4_run_app.sh 'http://<windows-ip>:5002/video_feed' --app-version v2
./4_run_app.sh 'http://127.0.0.1:5002/live' --app-version v2

Then open http://localhost:5000 in your browser.

Other source examples:

# Local USB camera:
./4_run_app.sh /dev/video0 --app-version v2

# RTSP stream:
./4_run_app.sh 'rtsp://192.168.1.100:554/stream' --app-version v2

# Pass extra environment variables:
./4_run_app.sh 'http://...' --app-version v2 -e PERF_LOG=1

# CUDA profiling (nsys):
./4_run_app.sh 'http://...' --app-version v2 --cuda-profile

7. Run tests

./5_run_tests.sh

Runs 199 unit and integration tests inside people-counter:gpu-final-nvdec. Expected output: 199 passed, 1 skipped.

Windows Video Streamer

The windows/ directory contains standalone streaming tools that run on Windows and expose camera, image, or video sources for the rest of the project.

Legacy bridge: Python + FFmpeg over HTTP MPEG-TS (documented below)
New bridge: Python UI + GStreamer + MediaMTX over RTSP, under active development

For the new Windows GStreamer workflow, installation notes, manual recovery steps, and plugin verification commands, see windows/README_GST_BRIDGE.md.

Legacy technology: Python + FFmpeg (auto-downloaded on first run). Does not require CUDA or Docker on the Windows machine.

Setup and launch (interactive)

Double-click windows/setup_and_run.bat, or run it from a terminal:

windows\setup_and_run.bat

On first run it:

Creates a venv_bridge Python virtual environment
Installs dependencies (Flask, OpenCV)
Auto-downloads FFmpeg if not found in windows/bin/
Asks you to choose a webcam, resolution, and H.264 encoder
Starts streaming on http://<your-ip>:5002/video_feed

Streaming a reference image or video file

Use windows/setup_and_run_ref_image.bat (pre-configured for 4K@30fps via Intel QSV encoder):

windows\setup_and_run_ref_image.bat

Or pass arguments directly to camera_bridge.py:

# Stream an image (looped endlessly):
venv_bridge\Scripts\python.exe windows\camera_bridge.py ^
    --input-file "windows\ref_images\people_walking.jpg" ^
    --resolution 4K ^
    --fps 30 ^
    --bitrate 20000 ^
    --encoder h264_qsv

# Stream a video file (looped):
venv_bridge\Scripts\python.exe windows\camera_bridge.py ^
    --input-file "C:\Videos\crowd.mp4" ^
    --fps 25

# Specify a port:
venv_bridge\Scripts\python.exe windows\camera_bridge.py --port 5003

camera_bridge.py arguments

Argument	Default	Description
`--input-file PATH`	—	Image (`.jpg`/`.png`) or video file to stream; looped endlessly. Omit to use a webcam.
`--resolution`	source	Output resolution: `1080p`, `4K`, etc. Scales/letterboxes the source.
`--fps N`	30	Output framerate (images or video cap).
`--bitrate KBPS`	auto	Target bitrate in kbps (encoder-dependent if omitted).
`--encoder ENC`	auto	H.264 encoder: `h264_nvenc` (NVENC), `h264_qsv` (Intel QSV), `libx264` (CPU). Auto-detected from available hardware.
`--port N`	5002	HTTP listen port.

The stream URL to pass to 4_run_app.sh is http://<windows-ip>:<port>/video_feed.

Configuration (YAML)

app_v2 is configured via YAML files — not .env profiles. Key files:

File	Contents
`app_v2/config/pipeline.yaml`	Fusion strategy, CUDA stream IDs, tile groups, tensor pool, enabled models
`app_v2/config/model_inference.yaml`	Per-mode confidence thresholds, NMS IoU, density peak weight
`app_v2/config/test_config.yaml`	Test harness settings (stream URL, timeouts)

See app_v2/docs/README_ARCHI.md for a full description of every configuration key.

Important Model Notes

Model family: yolo26 (not YOLOv8). The decoder is named yolo_v8_decoder because it handles the YOLOv8-compatible output tensor format, which is also produced by yolo26 and YOLO11 models.
TensorRT engines are hardware-specific: the .engine files included in this repo were compiled for RTX 5060 Ti (sm_120, CUDA 13.1, TRT 10.15). Running them on a different GPU requires recompiling with ./3_prepare_models.sh.
Opset 18: ONNX exports target Opset 18 so that Ultralytics stays compatible with TensorRT conversion.
OpenVINO IR: artifacts in models/openvino/ are for app_v1 only. app_v2 does not use OpenVINO.

Performance Optimization

FP8-QDQ (production standard): All YOLO26 TensorRT engines use FP8 with quantize-dequantize (Q/DQ) nodes inserted by NVIDIA ModelOpt. This is the recommended precision for Blackwell GPUs, which have dedicated FP8 Tensor Core acceleration.

INT8 quantization: On paper, INT8 promises a 2–4× speedup over FP16. In practice, on RTX 5000 Blackwell GPUs optimised for FP8, INT8 calibration showed no significant advantage over FP8-QDQ. Gains were only marginally visible on larger models (yolo26m/l/x) and essentially absent on yolo26n. FP8-QDQ via ModelOpt is the correct path for this hardware.

Production numbers (FP8-QDQ engines):

Mode	Model	Input	Latency
`yolo_global`	yolo26x FP8-QDQ	640×640	~3–5 ms
`yolo_tiles`	yolo26n FP8-QDQ	6 × 640×640	~10–15 ms
`density`	DM-Count QNRF FP16	1920×1088	22.9 ms
`crowd_global`	YOLO-CROWD FP16	640×640	~6–10 ms
`crowd_tiles`	YOLO-CROWD FP16	6 × 640×640	~20–30 ms

See app_v2/docs/README_PERFORMANCE_ANALYSIS.md for the full benchmark, latency budget, and metric reference.

Credits & Licenses

LWCC — Lightweight Crowd Counting

Used for DM-Count, CSRNet, Bayesian crowd counting, and SFANet models.

This library is a result of research by Matija Teršek and Maša Kljun. Although the paper has not been published yet, please provide the link to the GitHub repository if you use LWCC in your research. Repository: https://github.com/tersekmatija/lwcc License: MIT (model licenses — CSRNet, Bayesian, DM-Count, SFANet — are inherited from their respective upstream repositories)

YOLO-CROWD

Used for crowd-optimised bounding box detection (crowd_global, crowd_tiles).

Repository: https://github.com/zaki1003/YOLO-CROWD License: MIT References: YOLOv5 · yolov5-face · mmdetection · repulsion_loss_pytorch

Ultralytics (YOLOv8 decoder / yolo26 architecture)

Repository: https://github.com/ultralytics/ultralytics License: AGPL-3.0 for open-source use. For commercial deployment, an Ultralytics Enterprise License is required — see https://www.ultralytics.com/license.

Name		Name	Last commit message	Last commit date
Latest commit History 199 Commits
app_v1		app_v1
app_v2		app_v2
diagnostics		diagnostics
external		external
frame_tasks		frame_tasks
logger		logger
maixcam		maixcam
pipeline_components		pipeline_components
plans		plans
scripts		scripts
vendors		vendors
windows		windows
.dockerignore		.dockerignore
.gitignore		.gitignore
0_build_image.sh		0_build_image.sh
1_prepare.sh		1_prepare.sh
2_prepare_nvdec.sh		2_prepare_nvdec.sh
3_prepare_models.sh		3_prepare_models.sh
3c_prepare_int8_optimized.sh		3c_prepare_int8_optimized.sh
4_run_app.sh		4_run_app.sh
5_run_tests.sh		5_run_tests.sh
Conductor.agent.md		Conductor.agent.md
DENSITY_PERFORMANCE.md		DENSITY_PERFORMANCE.md
Dockerfile		Dockerfile
PERFORMANCE_OPTIMIZATION.md		PERFORMANCE_OPTIMIZATION.md
PeopleCounter.code-workspace		PeopleCounter.code-workspace
README.md		README.md
README_DOCKER.md		README_DOCKER.md
app.yaml		app.yaml
build_image.sh		build_image.sh
code-review-subagent.agent.md		code-review-subagent.agent.md
convert_onnx_to_trt.py		convert_onnx_to_trt.py
convert_p2pnet_to_trt.py		convert_p2pnet_to_trt.py
convert_pth_to_openvino.py		convert_pth_to_openvino.py
cuda_stream_manager.py		cuda_stream_manager.py
docker-compose.yml		docker-compose.yml
docker_exec.sh		docker_exec.sh
download_lwcc_weights.py		download_lwcc_weights.py
e2e_inference_test.py		e2e_inference_test.py
env_utils.py		env_utils.py
export_density_to_onnx.py		export_density_to_onnx.py
export_p2pnet_to_onnx.py		export_p2pnet_to_onnx.py
export_yolocrowd_to_onnx.py		export_yolocrowd_to_onnx.py
export_yolos_to_trt.py		export_yolos_to_trt.py
implement-subagent.agent.md		implement-subagent.agent.md
people_counter_processor.py		people_counter_processor.py
planning-subagent.agent.md		planning-subagent.agent.md
prepare_density_models.py		prepare_density_models.py
prepare_models.py		prepare_models.py
prepare_yolo_all.py		prepare_yolo_all.py
prepare_yolo_autocast_fp16.py		prepare_yolo_autocast_fp16.py
prepare_yolo_bbox_fp8.py		prepare_yolo_bbox_fp8.py
prepare_yolo_crowd_trt.py		prepare_yolo_crowd_trt.py
prepare_yolo_modelopt_fp8.py		prepare_yolo_modelopt_fp8.py
prepare_yolo_models.sh		prepare_yolo_models.sh
requirements-runtime.txt		requirements-runtime.txt
requirements.cuda.txt		requirements.cuda.txt
requirements.txt		requirements.txt
run_gst_stream.bat		run_gst_stream.bat
run_inside_docker.sh		run_inside_docker.sh
run_stream.bat		run_stream.bat
save_requirements.sh		save_requirements.sh
screenshot.jpg		screenshot.jpg
strip_ultralytics_header.py		strip_ultralytics_header.py
try_variants.py		try_variants.py
windows_camera_bridge.py		windows_camera_bridge.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PeopleCounter

Hardware Requirements

Inference Modes

Quick Start

Prerequisites — Linux / WSL2 (Ubuntu 24.04)

1. Clone (Linux / WSL2)

2. Build the Docker image

3. Install the runtime toolchain layer

4. Build the NVDEC layer (required for app_v2)

5. (Optional) Build TensorRT engines

6. Run the application

7. Run tests

Windows Video Streamer

Setup and launch (interactive)

Streaming a reference image or video file

camera_bridge.py arguments

Configuration (YAML)

Important Model Notes

Performance Optimization

See also

Credits & Licenses

LWCC — Lightweight Crowd Counting

YOLO-CROWD

Ultralytics (YOLOv8 decoder / yolo26 architecture)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PeopleCounter

Hardware Requirements

Inference Modes

Quick Start

Prerequisites — Linux / WSL2 (Ubuntu 24.04)

1. Clone (Linux / WSL2)

2. Build the Docker image

3. Install the runtime toolchain layer

4. Build the NVDEC layer (required for app_v2)

5. (Optional) Build TensorRT engines

6. Run the application

7. Run tests

Windows Video Streamer

Setup and launch (interactive)

Streaming a reference image or video file

camera_bridge.py arguments

Configuration (YAML)

Important Model Notes

Performance Optimization

See also

Credits & Licenses

LWCC — Lightweight Crowd Counting

YOLO-CROWD

Ultralytics (YOLOv8 decoder / yolo26 architecture)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages