Skip to content

MithraKL/Code_mixed

Repository files navigation

Code-Mixed Machine Translation with NLLB-200

Model: facebook/nllb-200-distilled-600M (Meta AI)
Dataset: LinCE — Linguistic Code-switching Evaluation Benchmark
Task: Translate Hinglish (Hindi-English) and Spanglish (Spanish-English) → English


Project Structure

codemixed_mt/
├── config.py           # All hyperparameters and language code mappings
├── data_pipeline.py    # LinCE loader, text cleaning, NLLB tokenization
├── trainer.py          # Model loading, metrics (BLEU/ChrF/COMET), Seq2SeqTrainer
├── inference.py        # Translation pipeline, ONNX export, interactive demo
├── main.py             # CLI entry point (train / evaluate / translate / demo)
├── notebook.ipynb      # End-to-end Jupyter walkthrough
└── requirements.txt    # Python dependencies

Architecture

Code-Mixed Input (Hinglish/Spanglish)
          │
          ▼
┌─────────────────────┐
│  Text Cleaner       │  Unicode normalization, URL/emoji removal,
│                     │  repeated-char normalization, Romanized Hindi norms
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  NLLB Tokenizer     │  src_lang=hin_Deva / spa_Latn
│  AutoTokenizer      │  Handles multilingual subword tokenization
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  NLLB-200 (600M)    │  Encoder-Decoder Transformer
│  Distilled Model    │  Fine-tuned on code-mixed parallel data
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Beam Search        │  forced_bos_token_id=eng_Latn
│  (num_beams=4)      │  Guides output to target language
└────────┬────────────┘
         │
         ▼
    English Translation

NLLB Language Codes

Language Script NLLB Code
English Latin eng_Latn
Hindi Devanagari hin_Deva
Spanish Latin spa_Latn
French Latin fra_Latn
Arabic Arabic arb_Arab

Important: NLLB uses BCP-47 style codes with script suffix. Always set forced_bos_token_id to the target language token to steer generation.


Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Train (Hinglish → English)

python main.py --mode train --lang-pair hi-en --epochs 5 --batch-size 16

3. Train with Local CSV Data

python main.py \
  --mode train \
  --lang-pair hi-en \
  --train-csv data/train.csv \
  --val-csv data/val.csv \
  --test-csv data/test.csv

CSV format:

source,target
"Mujhe bahut hunger lag raha hai","I am feeling very hungry"

4. Translate a Single Sentence

python main.py \
  --mode translate \
  --model-path ./nllb_codemixed_output \
  --lang-pair hi-en \
  --text "Mujhe bahut zyada hunger lag raha hai"

5. Batch Translate a File

python main.py \
  --mode translate \
  --model-path ./nllb_codemixed_output \
  --lang-pair hi-en \
  --input-file inputs.txt \
  --output-file translations.txt

6. Evaluate on Test Set

python main.py \
  --mode evaluate \
  --model-path ./nllb_codemixed_output \
  --lang-pair hi-en

7. Interactive Demo

python main.py --mode demo --model-path ./nllb_codemixed_output

Obtaining the LinCE Dataset

Option 1 — HuggingFace Hub:

from datasets import load_dataset
dataset = load_dataset("lince", "mt_hineng")

Option 2 — Kaggle:

pip install kaggle
kaggle datasets download -d <lince-dataset-slug>

Option 3 — Official Website:
Download from https://ritual.uh.edu/lince/ and convert to CSV format with source,target columns.


Training Configuration

Parameter Default Notes
Model nllb-200-distilled-600M ~600M params
Max Input Length 128 Tokens
Max Target Length 128 Tokens
Batch Size 16 Per GPU
Gradient Accumulation 2 Effective batch = 32
Learning Rate 3e-5 AdamW
Warmup Ratio 5% Of total steps
Epochs 5 With early stopping
FP16 True GPU only
Scheduler Linear decay With warmup

GPU Memory Requirements:

  • FP16 training: ~12 GB VRAM (batch=16)
  • FP16 training: ~8 GB VRAM (batch=8, grad_accum=4)
  • CPU training: Supported but very slow

Evaluation Metrics

Metric Description Range
SacreBLEU Tokenization-independent BLEU 0–100 (higher=better)
ChrF Character n-gram F-score 0–100 (higher=better)
COMET Neural metric (requires GPU) 0–1 (higher=better)

Typical scores on code-mixed MT:

  • SacreBLEU: 20–40 (task is harder than clean MT)
  • ChrF: 40–60
  • COMET: 0.5–0.8

Advanced: Memory Optimization

Gradient Checkpointing (saves ~30% VRAM)

python main.py --mode train --gradient-checkpointing

8-bit Quantization (inference only)

translator = CodeMixedTranslator(
    model_path="./nllb_codemixed_output",
    use_quantization=True,  # Requires bitsandbytes
)

ONNX Export (optimized inference)

from inference import export_to_onnx
export_to_onnx(
    model_path="./nllb_codemixed_output",
    output_dir="./onnx_model",
)
# Requires: pip install optimum[onnxruntime]

Key Design Decisions

  1. forced_bos_token_id: Critical for NLLB — forces the decoder's first token to be the target language ID, steering generation to the correct language.

  2. Label padding with -100: PyTorch cross-entropy ignores -100 tokens, preventing padding positions from contributing to loss.

  3. group_by_length=True: Groups similar-length sequences in batches, reducing padding and improving efficiency.

  4. EarlyStoppingCallback: Stops training if BLEU doesn't improve for 3 evaluations, preventing overfitting.

  5. Code-mixed src_lang: We use the dominant language's code (e.g., hin_Deva for Hinglish). This is a simplification — true code-mixed language detection would require LID at token level.


Citation

@inproceedings{aguilar2020lince,
  title={LinCE: A Centralized Benchmark for Linguistic Code-switching Evaluation},
  author={Aguilar, Gustavo and others},
  booktitle={LREC},
  year={2020}
}

@article{nllb2022,
  title={No Language Left Behind: Scaling Human-Centered Machine Translation},
  author={NLLB Team, Meta AI},
  year={2022}
}

About

Fine-tuning NLLB-200 for Hinglish & Spanglish → English translation using the LinCE benchmark — with training, BLEU/ChrF/COMET evaluation, batch inference, and ONNX export.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors