A complete machine translation pipeline comparing LSTM Seq2Seq vs Transformer (MarianMT) on the OPUS-100 French-English dataset.
| Details | |
|---|---|
| Task | Machine Translation (FR → EN) |
| Dataset | OPUS-100 (fr-en) — HuggingFace |
| Model 1 | LSTM Seq2Seq (trained from scratch) |
| Model 2 | Transformer — Helsinki-NLP/opus-mt-fr-en (fine-tuned) |
| Metric | LSTM | Transformer |
|---|---|---|
| BLEU Score | 0.23 | 39.38 |
| Test Loss | 5.97 | 1.54 |
| Best Epoch | 6/10 | 3/3 |
| Overfitting | Yes | No |
NLP-Machine-Translation/ ├── Notebook1_EDA_Preprocessing.ipynb ├── Notebook2_LSTM.ipynb ├── Notebook3_Transformer.ipynb ├── Notebook4_Comparison_Demo.ipynb ├── requirements.txt └── README.md
-
Clone the repository git clone https://github.com/medattia/NLP-Machine-Translation.git cd NLP-Machine-Translation
-
Install dependencies pip install -r requirements.txt
-
Run notebooks in order Notebook 1 → Notebook 2 → Notebook 3 → Notebook 4
Note: Notebooks were built and trained on Google Colab with T4 GPU. A Google Drive mount is required to save/load data and models between notebooks.
- 2-layer Encoder LSTM + 2-layer Decoder LSTM
- Embedding dim: 256 | Hidden dim: 512
- 27.8M trainable parameters
- Word-level tokenization (20k vocabulary)
- Pretrained Helsinki-NLP/opus-mt-fr-en
- Fine-tuned for 3 epochs on OPUS-100 subset
- 74M parameters
- SentencePiece BPE tokenization (59k subwords)
- Source: OPUS-100 (fr-en) via HuggingFace
- Training pairs: 75,218 (after filtering)
- Validation pairs: 1,650
- Test pairs: 1,638
- Max sentence length: 50 words
The project includes an interactive Gradio demo (Notebook 4) where you can type any French sentence and see both models translate it simultaneously.
Muhammed Attia GitHub: https://github.com/medattia