Skip to content

medattia/NLP-Machine-Translation

Repository files navigation

NLP Final Project — French to English Machine Translation

A complete machine translation pipeline comparing LSTM Seq2Seq vs Transformer (MarianMT) on the OPUS-100 French-English dataset.


Project Overview

Details
Task Machine Translation (FR → EN)
Dataset OPUS-100 (fr-en) — HuggingFace
Model 1 LSTM Seq2Seq (trained from scratch)
Model 2 Transformer — Helsinki-NLP/opus-mt-fr-en (fine-tuned)

Results

Metric LSTM Transformer
BLEU Score 0.23 39.38
Test Loss 5.97 1.54
Best Epoch 6/10 3/3
Overfitting Yes No

Repository Structure

NLP-Machine-Translation/ ├── Notebook1_EDA_Preprocessing.ipynb ├── Notebook2_LSTM.ipynb ├── Notebook3_Transformer.ipynb ├── Notebook4_Comparison_Demo.ipynb ├── requirements.txt └── README.md


How to Run

  1. Clone the repository git clone https://github.com/medattia/NLP-Machine-Translation.git cd NLP-Machine-Translation

  2. Install dependencies pip install -r requirements.txt

  3. Run notebooks in order Notebook 1 → Notebook 2 → Notebook 3 → Notebook 4

Note: Notebooks were built and trained on Google Colab with T4 GPU. A Google Drive mount is required to save/load data and models between notebooks.


Architecture

LSTM Seq2Seq

  • 2-layer Encoder LSTM + 2-layer Decoder LSTM
  • Embedding dim: 256 | Hidden dim: 512
  • 27.8M trainable parameters
  • Word-level tokenization (20k vocabulary)

Transformer (MarianMT)

  • Pretrained Helsinki-NLP/opus-mt-fr-en
  • Fine-tuned for 3 epochs on OPUS-100 subset
  • 74M parameters
  • SentencePiece BPE tokenization (59k subwords)

Dataset

  • Source: OPUS-100 (fr-en) via HuggingFace
  • Training pairs: 75,218 (after filtering)
  • Validation pairs: 1,650
  • Test pairs: 1,638
  • Max sentence length: 50 words

Demo

The project includes an interactive Gradio demo (Notebook 4) where you can type any French sentence and see both models translate it simultaneously.


Author

Muhammed Attia GitHub: https://github.com/medattia

About

French to English Machine Translation — LSTM Seq2Seq vs Transformer (MarianMT) | OPUS-100 Dataset

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors