A computer vision pipeline for automated behavioural analysis from video — body pose estimation and facial expression classification using MediaPipe.
The pipeline processes two independent camera feeds:
- Body camera — 33-point body landmarks for movement tracking and activity segmentation.
- Face camera — 468-point facial mesh for expression and head movement classification across 7 classes (neutral, smiling, mouth open, eyebrows raised, eyes closed, head turn, head tilt).
Outputs include per-frame landmark and feature CSVs, annotated videos, and publication-ready figures.
cv-behavioural-analysis/
├── notebooks/
│ └── body_pose_demo.ipynb # Body pose estimation
├── scripts/
│ └── face_classifier.py # Facial expression + head movement classifier
├── figures/ # Sample outputs
├── requirements.txt
├── setup_env.sh # Conda setup
└── setup_env_venv.sh # venv setup
videos/ and output/ are created at runtime and are gitignored.
bash setup_env_venv.sh
source cv-env/bin/activate
bash setup_env.sh
conda activate cv-behavioural
# CUDA 12.4
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
Place your video in videos/, then:
jupyter notebook notebooks/body_pose_demo.ipynb
Update VIDEO_FILENAME in the notebook to match your file. Run all cells. Generates pose figures, landmark CSV, and an annotated video.
Place your face video in videos/, then:
export TF_USE_LEGACY_KERAS=1
python scripts/face_classifier.py
Extracts MAR, EAR, eyebrow height, head pose (via solvePnP), and nose-to-chin geometry; classifies each frame into one of 7 expression/movement classes. Outputs figures, CSV, and an annotated video.
The figures/ directory contains:
pose_estimation_samples.png— body landmark overlayslandmark_trajectories.png— body joint trajectories over timemovement_velocity.png— wrist velocity with activity segmentationbehavioural_segmentation.png— movement state classificationface_mesh_samples.png— facial mesh with expression labels (7 classes)feature_exploration.png— raw facial features with phase annotationsexpression_features.png— feature traces with classification thresholds
- MediaPipe is optimised for single-subject scenes. Multi-person video requires additional tracking.
- The face mesh degrades under low light or strong backlighting.
- Head-pose estimation uses solvePnP; yaw estimates approaching ±90° become unreliable (gimbal-lock-adjacent behaviour).
- The 7-class facial classifier uses heuristic thresholds calibrated on the sample recording. Thresholds may need re-tuning for other recording conditions.
Arjun Vinayak Chikkankod
MIT — see LICENSE.
