SignBart introduces a novel method for Isolated Sign Language Recognition (ISLR) using skeleton sequences, focusing on decoupling the x and y coordinates and leveraging a lightweight encoder-decoder architecture based on BART.
- Independent encoding of coordinates: x and y coordinates are independently encoded to better capture their unique spatial characteristics.
- Cross-Attention mechanism: Cross-Attention integrates information between x and y after independent encoding.
- Lightweight model: Only ~750K parameters, making it significantly smaller than traditional SLR models.
- High generalization ability: Achieves superior performance on diverse datasets including LSA-64, WLASL, and ASL-Citizen.
- Efficient skeleton sequence processing: Lower computational costs compared to RNN, LSTM, GCN-based models.
- Strong ablation results: Highlights the importance of normalization, coordinate projection, and multi-part skeleton input.
SignBart addresses the limitations of treating skeleton keypoints as inseparable x-y pairs. Instead, it proposes:
-
Separate Coordinate Encoding:
- x-coordinates are encoded by the Encoder.
- y-coordinates are encoded by the Decoder.
-
Attention Mechanisms:
- Self-Attention for x-coordinate encoding, allowing rich bidirectional context learning.
- Self-Causal-Attention for y-coordinate encoding, maintaining temporal causality.
- Cross-Attention to integrate information from x into y, preserving relational dependency.
-
Input Format:
- Skeleton data extracted using Mediapipe.
- Shape:
(T, 75, 2), where T = frames, 75 = keypoints (6 body + 21 left hand + 21 right hand), 2 = (x, y).
-
Normalization:
- Each component (body, left hand, right hand) normalized independently based on its local bounding box.
- Enhances model generalization and reduces overfitting.
-
Projection:
- Before entering attention layers, keypoints are linearly projected to a higher-dimensional space (d_model) to enrich feature representation.
| Dataset | Videos | Words | Signers | Language |
|---|---|---|---|---|
| LSA-64 | 3,200 | 64 | 10 | Argentinian Sign Language |
| WLASL | 21,083 | 2,000 | 119 | American Sign Language |
| ASL-Citizen | >84,000 | 2,731 | 52 | Community-sourced American Sign Language |
| Keypoint Extraction Process | Details |
|---|---|
| Extraction Tool | Google Mediapipe |
| Keypoints | 6 body + full left & right hand |
| Missing Keypoints | Filled with (0, 0) |
| Coordinate Normalization | Scaled to [0, 1] relative to frame size |
| Further Normalization | Local bounding boxes for body, left hand, and right hand |
| Name | Weight | Config |
|---|---|---|
| LSA-64 | LSA-64.pth | LSA-64.yaml |
| WLASL-100 | WLASL-100.pth | WLASL-100.yaml |
| WLASL-300 | WLASL-300.pth | WLASL-300.yaml |
| WLASL-1000 | WLASL-1000.pth | WLASL-1000.yaml |
| WLASL-2000 | WLASL-2000.pth | WLASL-2000.yaml |
| ASL-Citizen-100 | ASL-Citizen-100.pth | ASL-Citizen-100.yaml |
| ASL-Citizen-200 | ASL-Citizen-200.pth | ASL-Citizen-200.yaml |
| ASL-Citizen-400 | ASL-Citizen-400.pth | ASL-Citizen-400.yaml |
| ASL-Citizen-1000 | ASL-Citizen-1000.pth | ASL-Citizen-1000.yaml |
| ASL-Citizen-2731 | ASL-Citizen-2731.pth | ASL-Citizen-2731.yaml |
Prerequisites
- Python >= 3.8
- pip
Setup
# Clone and enter project
git clone https://github.com/tinh2044/SignBart.git
cd SignBart
# (Optional) create virtual environment
python -m venv venv
# macOS/Linux: source venv/bin/activate
# Windows PowerShell: venv\Scripts\Activate.ps1
# Install dependencies
pip install -r requirements.txtData Preparation
- Download each dataset (LSA-64, WLASL, ASL-Citizen) from its source.
- Extract into
data/with structure:data/lsa-64/{label2id.json,id2label.json,train/,test/} data/wlasl/{...} data/asl-citizen/{...}
Below are two ways to run SignBart: via provided shell scripts or by calling main.py directly.
The scripts/ directory includes dataset-specific training and evaluation scripts. For example:
# Training on LSA-64
bash scripts/train_LSA-64.sh
# Evaluation on LSA-64
bash scripts/eval_LSA-64.shYou can replace dataset names to run other scripts (e.g., train_WLASL-100.sh, eval_ASL-Citizen-100.sh).
Train:
python main.py --task train \
--experiment_name my_experiment \
--config_path configs/lsa-64.yaml \
--data_path data/lsa-64 \
--epochs 200 \
--lr 2e-5 \
--seed 379Evaluate:
python main.py --task eval \
--experiment_name my_experiment \
--config_path configs/lsa-64.yaml \
--pretrained_path checkpoints/my_experiment/epoch_X.pth \
--data_path data/lsa-64 \
--seed 379Optional flags:
--resume_checkpoints PATH--scheduler_factor FACTOR--scheduler_patience PATIENCE
| Model | Accuracy | Parameters |
|---|---|---|
| SPOTER | 100% | 5,918,848 |
| HWGATE | 98.59% | 10,758,354 |
| ST-GCN | 92.81% | 3,604,180 |
| SL-GCN | 98.13% | 4,872,306 |
| SignBart | 96.04% | 749,888 |
| Subset | SignBart Accuracy |
|---|---|
| WLASL-100 | 78.00% |
| WLASL-300 | 78.50% |
| WLASL-1000 | 81.45% |
| WLASL-2000 | 68.00% |
| Subset | Accuracy | Parameters |
|---|---|---|
| ASL-Citizen-100 | 80.32% | 754,532 |
| ASL-Citizen-200 | 81.49% | 2,845,384 |
| ASL-Citizen-400 | 78.96% | 3,424,144 |
| ASL-Citizen-1000 | 81.45% | 3,578,344 |
| ASL-Citizen-2731 | 75.22% | 4,548,523 |
| Projection Effect | Accuracy |
|---|---|
| Without projection | 62.08% |
| With projection | 96.04% |
| Normalization Effect | Accuracy |
|---|---|
| No normalization | 82.50% |
| One bounding box | 90.52% |
| Two bounding boxes | 90.41% |
| Three bounding boxes (body, left hand, right hand) | 96.04% |
| Skeleton Components | Accuracy |
|---|---|
| Only body | 86.97% |
| Only left hand | 23.02% |
| Only right hand | 70.20% |
| Both hands | 91.35% |
| All components (body + left + right) | 96.04% |
