Real-time, two-way ASL ↔ Speech communication
A deaf or hard-of-hearing person signs into their webcam → the app translates it into spoken words. A hearing person speaks back → their words appear as text for the signer to read.
sequenceDiagram
actor S as 🤟 Signer
participant B as Browser
participant WS as WebSocket
participant MP as MediaPipe Hands
participant RF as Random Forest
participant UI as Sentence Builder
actor L as 👂 Listener
Note over S,L: ── ASL → Speech (Signer to Listener) ──────────────────────
S->>B: Show ASL sign to webcam
B->>WS: JPEG frame (binary · 15 fps)
WS->>MP: Raw frame bytes
MP->>MP: Extract 21 hand landmarks (63 floats · wrist-normalised)
MP->>RF: Landmark feature vector
RF-->>WS: Predicted letter + confidence (0 – 1)
WS-->>UI: Prediction JSON
UI->>UI: Hold-ring fills over 1.5 s → letter confirmed → sentence grows
UI->>L: Speak & Send → speechSynthesis reads sentence aloud
Note over S,L: ── Speech → Text (Listener to Signer) ──────────────────────
L->>B: Speaks into microphone
B->>B: Web Speech API — continuous SpeechRecognition
B-->>S: Transcript appears in conversation feed
| Feature | Detail |
|---|---|
| Real-time sign detection | 21-landmark MediaPipe pipeline, one inference per WebSocket round-trip |
| Hold-to-confirm | Hold any ASL sign for 1.5 s to commit the letter — prevents accidental input |
| Two-way communication | Signer → TTS speaks it aloud · Listener → STT transcribes back to text |
| Dual theme | Neumorphic dark + light themes, persisted across sessions |
| 91 automated tests | 59 frontend (Vitest) + 32 backend (pytest) |
| Zero-queue latency | Back-pressure: only one frame in-flight at a time — no buffer buildup |
| Layer | Technology | Purpose |
|---|---|---|
| Frontend | React 19 + Vite + TypeScript | UI and camera capture |
| Styling | Tailwind CSS v4 + custom neumorphic system | Dark/light neumorphism theme |
| Animations | Framer Motion | Transitions, ripple rings, entrance choreography |
| Backend | FastAPI + Uvicorn | WebSocket endpoint, inference API |
| Hand Tracking | MediaPipe Hands | 21-landmark extraction (63 floats per frame) |
| Classifier | scikit-learn Random Forest | Letter prediction from landmarks |
| Real-time | WebSocket binary frames | JPEG bytes → JSON prediction |
| Speech | Web Speech API | TTS (speechSynthesis) + STT (SpeechRecognition) |
Sign-Language-Translator/
│
├── backend/
│ ├── model/
│ │ ├── preprocess.py ← Runs MediaPipe on dataset → landmarks.csv
│ │ ├── train.py ← Trains Random Forest → asl_model.pkl
│ │ └── asl_model.pkl ← Trained model (not in git — you generate it)
│ │
│ ├── tests/
│ │ ├── conftest.py ← Shared fixtures (TestClient, JPEG frames)
│ │ ├── test_predictor.py ← 14 tests — _normalise(), predict_from_bytes()
│ │ └── test_websocket.py ← 8 tests — WebSocket lifecycle + protocol
│ │
│ ├── main.py ← FastAPI app: GET /health · WS /ws
│ ├── predictor.py ← MediaPipe + RF inference engine
│ ├── requirements.txt ← Runtime dependencies
│ ├── requirements-dev.txt ← + pytest, httpx for testing
│ └── pytest.ini
│
├── frontend/
│ ├── public/
│ │ └── favicon.svg ← Hand-icon SVG favicon
│ │
│ ├── src/
│ │ ├── components/
│ │ │ ├── CameraPanel.tsx ← Webcam feed + glassmorphic prediction overlay
│ │ │ ├── SentenceBuilder.tsx ← Hold-to-confirm ring + letter trough
│ │ │ ├── TranscriptPanel.tsx ← Chat-style message feed (signer + listener)
│ │ │ ├── SpeechInput.tsx ← Mic toggle with animated ripple rings
│ │ │ └── __tests__/ ← 26 component tests
│ │ │
│ │ ├── hooks/
│ │ │ ├── useWebSocket.ts ← WS connection + back-pressure sendFrame
│ │ │ ├── useSpeech.ts ← useTTS + useSTT (Web Speech API)
│ │ │ ├── useCamera.ts ← getUserMedia + JPEG capture loop
│ │ │ └── __tests__/ ← 31 hook tests
│ │ │
│ │ ├── types/index.ts ← Prediction, Message, ConnectionStatus
│ │ ├── lib/utils.ts ← cn() utility (clsx + tailwind-merge)
│ │ ├── index.css ← Neumorphic design system (dark + light)
│ │ ├── App.tsx ← Root layout + theme toggle
│ │ └── test/setup.ts ← Vitest global setup
│ │
│ ├── index.html
│ ├── vite.config.ts
│ └── package.json
│
└── README.md
| Requirement | Version | How to check |
|---|---|---|
| Python | 3.10 – 3.12 | python --version |
| Node.js | 18 + | node --version |
| npm | 9 + | npm --version |
| Webcam | Any | — |
| Chrome or Edge | Latest | For Speech-to-Text support |
git clone https://github.com/pvchaitanya8/Sign-Language-Translator.git
cd Sign-Language-Translatorcd backend
# Create a virtual environment
python -m venv venv
# Activate it
venv\Scripts\activate # Windows PowerShell / CMD
source venv/bin/activate # macOS / Linux
# Install dependencies
pip install -r requirements.txtSkip this step if you already have
backend/model/asl_model.pkl.
Download the dataset
- Visit Kaggle — ASL Alphabet
- Download and extract the archive
- Arrange the files so the folder structure looks like this:
backend/dataset/
├── train/
│ ├── A/ ← 3,000 images per class
│ ├── B/
│ ├── C/
│ ├── ...
│ └── Z/
└── test/
├── A_test.jpg
├── B_test.jpg
└── ... ← 1 test image per class
Run preprocessing + training
# Make sure you are inside backend/ with the venv active
# Step 1 — MediaPipe extracts landmarks from every training image
# Produces: model/landmarks.csv (takes ~2-5 minutes)
python model/preprocess.py
# Step 2 — Random Forest is trained on the landmark CSV
# Produces: model/asl_model.pkl (takes ~30 seconds)
python model/train.pyYou should see output like:
[preprocess] Processing class A ... 3000 samples
[preprocess] Processing class B ... 3000 samples
...
[preprocess] Saved 87000 rows to model/landmarks.csv
[train] Training RandomForestClassifier(n_estimators=200) ...
[train] Test accuracy : 99.2 %
[train] Saved → model/asl_model.pkl
[train] Classes (29): ['A', 'B', ... 'Z', 'del', 'nothing', 'space']
# Inside backend/ with venv active
uvicorn main:app --reload --port 8000Expected output:
INFO: Uvicorn running on http://0.0.0.0:8000
[predictor] Model loaded — 29 classes: ['A', 'B', ..., 'space', 'del']
INFO: Application startup complete.
Verify it works: open http://localhost:8000/health — should return {"status":"ok"}.
Open a new terminal (keep the backend running):
cd frontend
npm install
npm run devOpen http://localhost:5173 in Chrome or Edge.
When the page loads:
- The status chip (top of the camera panel) transitions:
OFFLINE→CONNECTING→ LIVE (green) - Click Start Camera and allow browser camera permission
- Your webcam feed appears in the left panel, mirrored like a selfie camera
- Show your hand to the camera
- A prediction overlay slides up at the bottom of the feed — it shows the detected letter and confidence percentage
- The ring disc in the Sentence Builder section starts filling clockwise as you hold the sign
- Hold for 1.5 seconds → the letter is confirmed with a spring-bounce animation and appended to your sentence
- Repeat to build a full sentence
Special signs:
| Sign | Action |
|---|---|
SPACE |
Adds a space between words |
DEL |
Removes the last character (ring turns red) |
- When your sentence is ready, click Speak & Send (green button):
- The sentence is spoken aloud by the browser's text-to-speech engine
- It appears in the conversation feed as a green-accented signer message
- Click the mic button at the bottom-right — two animated ripple rings pulse outward
- Speak naturally — interim text appears as you talk
- Click the mic button again to stop
- Your speech is transcribed and appears as a blue-accented listener message in the feed
Speech recognition requires Chrome or Edge — Firefox and Safari do not support the Web Speech API.
| Control | Where | What it does |
|---|---|---|
| Start / Stop Camera | Left panel, bottom | Toggle webcam capture |
| Hold ring disc | Sentence Builder | Fills as you hold — confirms letter at 100% |
| 🗑 Clear | Sentence Builder | Erase the current sentence |
| 📋 Copy | Sentence Builder | Copy sentence to clipboard |
| Speak & Send | Sentence Builder | Speak aloud + add to conversation |
| 🔊 (on hover) | Any message bubble | Re-read that message aloud |
| ☀ / 🌙 | Top-right nav | Toggle dark / light theme (persisted) |
The model recognises 29 classes:
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
space del nothing
J and Z involve motion (drawn in the air). The static-landmark model handles these with reduced accuracy — this is a known limitation of single-frame landmark classification.
cd backend
# Install dev dependencies (pytest + httpx)
pip install -r requirements-dev.txt
# Run all tests
pytest -vExpected: 32 tests, all passing
tests/test_predictor.py::test_model_bundle_has_required_keys PASSED
tests/test_predictor.py::test_normalise_anchor_at_wrist PASSED
tests/test_predictor.py::test_predict_from_bytes_empty_buffer PASSED
...
tests/test_websocket.py::test_ws_connects PASSED
tests/test_websocket.py::test_ws_prediction_with_hand PASSED
...
========================= 32 passed in 4.31s =========================
cd frontend
npm run test:run # single run, all tests
npm run test # watch mode
npm run test:coverage # with v8 coverage reportExpected: 59 tests, all passing
✓ useWebSocket — connection lifecycle 6 tests
✓ useWebSocket — prediction parsing 2 tests
✓ useWebSocket — sendFrame 2 tests
✓ useTTS 8 tests
✓ useSTT 13 tests
✓ SentenceBuilder — initial render 4 tests
✓ SentenceBuilder — hold-to-confirm 6 tests
✓ SentenceBuilder — action buttons 2 tests
✓ TranscriptPanel — empty state 2 tests
✓ TranscriptPanel — message rendering 6 tests
✓ TranscriptPanel — speak button 1 test
✓ SpeechInput — supported browser 4 tests
✓ SpeechInput — unsupported browser 3 tests
Test Files 5 passed
Tests 59 passed
Simple liveness check.
Response:
{ "status": "ok" }Binary WebSocket endpoint for real-time sign prediction.
Client → Server
Raw JPEG bytes of the current webcam frame (binary WebSocket frame, no base64).
Server → Client
JSON text on every received frame:
Create frontend/.env.local:
VITE_WS_URL=wss://your-app.onrender.com/wsfrontend/src/components/SentenceBuilder.tsx:
const HOLD_MS = 1500 // ms to hold a sign before confirming
const MIN_CONFIDENCE = 0.45 // signs below this confidence are ignoredfrontend/src/hooks/useCamera.ts:
const CAPTURE_FPS = 15 // polling rate (back-pressure limits actual throughput)
const JPEG_QUALITY = 0.65 // 0.0–1.0 (lower = smaller payload, faster)| Symptom | Likely cause | Fix |
|---|---|---|
| Status chip stays CONNECTING | Backend not running | Run uvicorn main:app --port 8000 |
| Status chip shows ERROR | Wrong WebSocket URL | Check VITE_WS_URL in .env.local |
| Camera unavailable error | Permission denied | Allow camera in browser settings → refresh |
| No prediction overlay appears | Hand out of frame | Centre your hand, improve lighting |
| Confidence stuck below 40% | Background clutter | Use a plain background, move hand closer |
| Mic button does not work | Non-Chrome browser | Switch to Chrome or Edge |
FileNotFoundError: asl_model.pkl |
Model not trained | Run preprocess.py then train.py |
| Very high latency | Old code without back-pressure | Pull latest, restart backend + frontend |
- J and Z — motion-based letters; single-frame landmarks cannot capture the trajectory
- Single hand only — the model processes the first detected hand; two-hand signs are not supported
- Lighting sensitivity — very dark or overexposed frames reduce MediaPipe detection confidence
- Speech-to-Text browser support — Chrome and Edge only (Web Speech API)
- No user accounts — conversation history is session-only; refreshing clears it
- Phase 1 — Project scaffold and tooling
- Phase 2 — Dataset pipeline, landmark extraction, Random Forest training (99.2% accuracy)
- Phase 3 — FastAPI backend with binary WebSocket inference endpoint
- Phase 4 — React frontend: camera, WebSocket hook, core components
- Phase 5 — Two-way speech: TTS + STT with continuous mode fix
- Phase 6 — Neumorphic UI, dark/light themes, Framer Motion animations
- Phase 6.5 — 91 automated tests, performance tuning, latency fixes
- Phase 7 — Deployment: Docker, Render (backend), Vercel (frontend)
- Fork the repository
- Create a feature branch:
git checkout -b feat/your-feature - Make your changes with tests
- Verify:
pytest -vandnpm run test:runmust both pass - Open a pull request against
dev
Apache-2.0 license — see LICENSE.