Skip to content

TinhNguyen2312/SignBart

Repository files navigation

SignBart introduces a novel method for Isolated Sign Language Recognition (ISLR) using skeleton sequences, focusing on decoupling the x and y coordinates and leveraging a lightweight encoder-decoder architecture based on BART.

Hightlights

  • Independent encoding of coordinates: x and y coordinates are independently encoded to better capture their unique spatial characteristics.
  • Cross-Attention mechanism: Cross-Attention integrates information between x and y after independent encoding.
  • Lightweight model: Only ~750K parameters, making it significantly smaller than traditional SLR models.
  • High generalization ability: Achieves superior performance on diverse datasets including LSA-64, WLASL, and ASL-Citizen.
  • Efficient skeleton sequence processing: Lower computational costs compared to RNN, LSTM, GCN-based models.
  • Strong ablation results: Highlights the importance of normalization, coordinate projection, and multi-part skeleton input.

About the Model

SignBart addresses the limitations of treating skeleton keypoints as inseparable x-y pairs. Instead, it proposes:

  • Separate Coordinate Encoding:

    • x-coordinates are encoded by the Encoder.
    • y-coordinates are encoded by the Decoder.
  • Attention Mechanisms:

    • Self-Attention for x-coordinate encoding, allowing rich bidirectional context learning.
    • Self-Causal-Attention for y-coordinate encoding, maintaining temporal causality.
    • Cross-Attention to integrate information from x into y, preserving relational dependency.
  • Input Format:

    • Skeleton data extracted using Mediapipe.
    • Shape: (T, 75, 2), where T = frames, 75 = keypoints (6 body + 21 left hand + 21 right hand), 2 = (x, y).
  • Normalization:

    • Each component (body, left hand, right hand) normalized independently based on its local bounding box.
    • Enhances model generalization and reduces overfitting.
  • Projection:

    • Before entering attention layers, keypoints are linearly projected to a higher-dimensional space (d_model) to enrich feature representation.

Dataset and Keypoints Extraction

Dataset Videos Words Signers Language
LSA-64 3,200 64 10 Argentinian Sign Language
WLASL 21,083 2,000 119 American Sign Language
ASL-Citizen >84,000 2,731 52 Community-sourced American Sign Language
Keypoint Extraction Process Details
Extraction Tool Google Mediapipe
Keypoints 6 body + full left & right hand
Missing Keypoints Filled with (0, 0)
Coordinate Normalization Scaled to [0, 1] relative to frame size
Further Normalization Local bounding boxes for body, left hand, and right hand

Pretrained Weights

Name Weight Config
LSA-64 LSA-64.pth LSA-64.yaml
WLASL-100 WLASL-100.pth WLASL-100.yaml
WLASL-300 WLASL-300.pth WLASL-300.yaml
WLASL-1000 WLASL-1000.pth WLASL-1000.yaml
WLASL-2000 WLASL-2000.pth WLASL-2000.yaml
ASL-Citizen-100 ASL-Citizen-100.pth ASL-Citizen-100.yaml
ASL-Citizen-200 ASL-Citizen-200.pth ASL-Citizen-200.yaml
ASL-Citizen-400 ASL-Citizen-400.pth ASL-Citizen-400.yaml
ASL-Citizen-1000 ASL-Citizen-1000.pth ASL-Citizen-1000.yaml
ASL-Citizen-2731 ASL-Citizen-2731.pth ASL-Citizen-2731.yaml

Installation

Prerequisites

  • Python >= 3.8
  • pip

Setup

# Clone and enter project
git clone https://github.com/tinh2044/SignBart.git
cd SignBart

# (Optional) create virtual environment
python -m venv venv
# macOS/Linux: source venv/bin/activate
# Windows PowerShell: venv\Scripts\Activate.ps1

# Install dependencies
pip install -r requirements.txt

Data Preparation

  1. Download each dataset (LSA-64, WLASL, ASL-Citizen) from its source.
  2. Extract into data/ with structure:
    data/lsa-64/{label2id.json,id2label.json,train/,test/}
    data/wlasl/{...}
    data/asl-citizen/{...}
    

Usage

Below are two ways to run SignBart: via provided shell scripts or by calling main.py directly.

Using shell scripts

The scripts/ directory includes dataset-specific training and evaluation scripts. For example:

# Training on LSA-64
bash scripts/train_LSA-64.sh
# Evaluation on LSA-64
bash scripts/eval_LSA-64.sh

You can replace dataset names to run other scripts (e.g., train_WLASL-100.sh, eval_ASL-Citizen-100.sh).

Using Python entry point

Train:

python main.py --task train \
  --experiment_name my_experiment \
  --config_path configs/lsa-64.yaml \
  --data_path data/lsa-64 \
  --epochs 200 \
  --lr 2e-5 \
  --seed 379

Evaluate:

python main.py --task eval \
  --experiment_name my_experiment \
  --config_path configs/lsa-64.yaml \
  --pretrained_path checkpoints/my_experiment/epoch_X.pth \
  --data_path data/lsa-64 \
  --seed 379

Optional flags:

  • --resume_checkpoints PATH
  • --scheduler_factor FACTOR
  • --scheduler_patience PATIENCE

Experiments and Results

LSA-64 Dataset

Model Accuracy Parameters
SPOTER 100% 5,918,848
HWGATE 98.59% 10,758,354
ST-GCN 92.81% 3,604,180
SL-GCN 98.13% 4,872,306
SignBart 96.04% 749,888

WLASL Dataset

Subset SignBart Accuracy
WLASL-100 78.00%
WLASL-300 78.50%
WLASL-1000 81.45%
WLASL-2000 68.00%

ASL-Citizen Dataset

Subset Accuracy Parameters
ASL-Citizen-100 80.32% 754,532
ASL-Citizen-200 81.49% 2,845,384
ASL-Citizen-400 78.96% 3,424,144
ASL-Citizen-1000 81.45% 3,578,344
ASL-Citizen-2731 75.22% 4,548,523

Ablation Studies

Projection Effect Accuracy
Without projection 62.08%
With projection 96.04%
Normalization Effect Accuracy
No normalization 82.50%
One bounding box 90.52%
Two bounding boxes 90.41%
Three bounding boxes (body, left hand, right hand) 96.04%
Skeleton Components Accuracy
Only body 86.97%
Only left hand 23.02%
Only right hand 70.20%
Both hands 91.35%
All components (body + left + right) 96.04%

About

SignBart is an efficient Isolated Sign Language Recognition model that decouples x and y coordinates using a lightweight encoder-decoder architecture. It achieves high accuracy with fewer than 750K parameters, outperforming traditional models on datasets like LSA-64, WLASL, and ASL-Citizen.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors