Code-Mixed Machine Translation with NLLB-200

Model: facebook/nllb-200-distilled-600M (Meta AI)
Dataset: LinCE — Linguistic Code-switching Evaluation Benchmark
Task: Translate Hinglish (Hindi-English) and Spanglish (Spanish-English) → English

Project Structure

codemixed_mt/
├── config.py           # All hyperparameters and language code mappings
├── data_pipeline.py    # LinCE loader, text cleaning, NLLB tokenization
├── trainer.py          # Model loading, metrics (BLEU/ChrF/COMET), Seq2SeqTrainer
├── inference.py        # Translation pipeline, ONNX export, interactive demo
├── main.py             # CLI entry point (train / evaluate / translate / demo)
├── notebook.ipynb      # End-to-end Jupyter walkthrough
└── requirements.txt    # Python dependencies

Architecture

Code-Mixed Input (Hinglish/Spanglish)
          │
          ▼
┌─────────────────────┐
│  Text Cleaner       │  Unicode normalization, URL/emoji removal,
│                     │  repeated-char normalization, Romanized Hindi norms
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  NLLB Tokenizer     │  src_lang=hin_Deva / spa_Latn
│  AutoTokenizer      │  Handles multilingual subword tokenization
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  NLLB-200 (600M)    │  Encoder-Decoder Transformer
│  Distilled Model    │  Fine-tuned on code-mixed parallel data
└────────┬────────────┘
         │
         ▼
┌─────────────────────┐
│  Beam Search        │  forced_bos_token_id=eng_Latn
│  (num_beams=4)      │  Guides output to target language
└────────┬────────────┘
         │
         ▼
    English Translation

NLLB Language Codes

Language	Script	NLLB Code
English	Latin	`eng_Latn`
Hindi	Devanagari	`hin_Deva`
Spanish	Latin	`spa_Latn`
French	Latin	`fra_Latn`
Arabic	Arabic	`arb_Arab`

Important: NLLB uses BCP-47 style codes with script suffix. Always set forced_bos_token_id to the target language token to steer generation.

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Train (Hinglish → English)

python main.py --mode train --lang-pair hi-en --epochs 5 --batch-size 16

3. Train with Local CSV Data

python main.py \
  --mode train \
  --lang-pair hi-en \
  --train-csv data/train.csv \
  --val-csv data/val.csv \
  --test-csv data/test.csv

CSV format:

source,target
"Mujhe bahut hunger lag raha hai","I am feeling very hungry"

4. Translate a Single Sentence

python main.py \
  --mode translate \
  --model-path ./nllb_codemixed_output \
  --lang-pair hi-en \
  --text "Mujhe bahut zyada hunger lag raha hai"

5. Batch Translate a File

python main.py \
  --mode translate \
  --model-path ./nllb_codemixed_output \
  --lang-pair hi-en \
  --input-file inputs.txt \
  --output-file translations.txt

6. Evaluate on Test Set

python main.py \
  --mode evaluate \
  --model-path ./nllb_codemixed_output \
  --lang-pair hi-en

7. Interactive Demo

python main.py --mode demo --model-path ./nllb_codemixed_output

Obtaining the LinCE Dataset

Option 1 — HuggingFace Hub:

from datasets import load_dataset
dataset = load_dataset("lince", "mt_hineng")

Option 2 — Kaggle:

pip install kaggle
kaggle datasets download -d <lince-dataset-slug>

Option 3 — Official Website:
Download from https://ritual.uh.edu/lince/ and convert to CSV format with source,target columns.

Training Configuration

Parameter	Default	Notes
Model	nllb-200-distilled-600M	~600M params
Max Input Length	128	Tokens
Max Target Length	128	Tokens
Batch Size	16	Per GPU
Gradient Accumulation	2	Effective batch = 32
Learning Rate	3e-5	AdamW
Warmup Ratio	5%	Of total steps
Epochs	5	With early stopping
FP16	True	GPU only
Scheduler	Linear decay	With warmup

GPU Memory Requirements:

FP16 training: ~12 GB VRAM (batch=16)
FP16 training: ~8 GB VRAM (batch=8, grad_accum=4)
CPU training: Supported but very slow

Evaluation Metrics

Metric	Description	Range
SacreBLEU	Tokenization-independent BLEU	0–100 (higher=better)
ChrF	Character n-gram F-score	0–100 (higher=better)
COMET	Neural metric (requires GPU)	0–1 (higher=better)

Typical scores on code-mixed MT:

SacreBLEU: 20–40 (task is harder than clean MT)
ChrF: 40–60
COMET: 0.5–0.8

Advanced: Memory Optimization

Gradient Checkpointing (saves ~30% VRAM)

python main.py --mode train --gradient-checkpointing

8-bit Quantization (inference only)

translator = CodeMixedTranslator(
    model_path="./nllb_codemixed_output",
    use_quantization=True,  # Requires bitsandbytes
)

ONNX Export (optimized inference)

from inference import export_to_onnx
export_to_onnx(
    model_path="./nllb_codemixed_output",
    output_dir="./onnx_model",
)
# Requires: pip install optimum[onnxruntime]

Key Design Decisions

forced_bos_token_id: Critical for NLLB — forces the decoder's first token to be the target language ID, steering generation to the correct language.
Label padding with -100: PyTorch cross-entropy ignores -100 tokens, preventing padding positions from contributing to loss.
group_by_length=True: Groups similar-length sequences in batches, reducing padding and improving efficiency.
EarlyStoppingCallback: Stops training if BLEU doesn't improve for 3 evaluations, preventing overfitting.
Code-mixed src_lang: We use the dominant language's code (e.g., hin_Deva for Hinglish). This is a simplification — true code-mixed language detection would require LID at token level.

Citation

@inproceedings{aguilar2020lince,
  title={LinCE: A Centralized Benchmark for Linguistic Code-switching Evaluation},
  author={Aguilar, Gustavo and others},
  booktitle={LREC},
  year={2020}
}

@article{nllb2022,
  title={No Language Left Behind: Scaling Human-Centered Machine Translation},
  author={NLLB Team, Meta AI},
  year={2022}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code-Mixed Machine Translation with NLLB-200

Project Structure

Architecture

NLLB Language Codes

Quick Start

1. Install Dependencies

2. Train (Hinglish → English)

3. Train with Local CSV Data

4. Translate a Single Sentence

5. Batch Translate a File

6. Evaluate on Test Set

7. Interactive Demo

Obtaining the LinCE Dataset

Training Configuration

Evaluation Metrics

Advanced: Memory Optimization

Gradient Checkpointing (saves ~30% VRAM)

8-bit Quantization (inference only)

ONNX Export (optimized inference)

Key Design Decisions

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
__pycache__		__pycache__
logs		logs
.gitignore		.gitignore
README.md		README.md
config.py		config.py
data_pipeline.py		data_pipeline.py
dataset_builder.py		dataset_builder.py
inference.py		inference.py
main.py		main.py
notebook.ipynb		notebook.ipynb
real_codemixed_dataset.csv		real_codemixed_dataset.csv
requirements.txt		requirements.txt
test.csv		test.csv
train.csv		train.csv
trainer.py		trainer.py
val.csv		val.csv

Folders and files

Latest commit

History

Repository files navigation

Code-Mixed Machine Translation with NLLB-200

Project Structure

Architecture

NLLB Language Codes

Quick Start

1. Install Dependencies

2. Train (Hinglish → English)

3. Train with Local CSV Data

4. Translate a Single Sentence

5. Batch Translate a File

6. Evaluate on Test Set

7. Interactive Demo

Obtaining the LinCE Dataset

Training Configuration

Evaluation Metrics

Advanced: Memory Optimization

Gradient Checkpointing (saves ~30% VRAM)

8-bit Quantization (inference only)

ONNX Export (optimized inference)

Key Design Decisions

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages