Candle is a Transformer-based model that corrects character elongation (tamdeed) in Arabic text. It takes noisy, user-generated input where characters are repeated (e.g. الممملكككة) and restores it to the correct form (e.g. المملكة). This is a common phenomenon in informal Arabic text on social media and in other user-generated content.
The model uses a CTC (Connectionist Temporal Classification) objective and is trained on Arabic newspaper text. A lighter student model is also provided, trained via knowledge distillation from the full teacher model.
You can easily install CANDLE models as a packge using pip as follows:
pip install candle-deduplicatorThen, you can import the classes and use them directly. Here is an example:
from candle_deduplicator import CandleDeduplicator, CandleDeduplicatorDistilled
full_model = CandleDeduplicator()
distilled_model = CandleDeduplicatorDistilled()
print(full_model.deduplicate(' الممملكككة الأررررردنييية ررراااننيييااا'))
# الملكة الأردنية رانيا
print(distilled_model.deduplicate('الممملكككة الأررررردنييية الللهههااااششميية'))
# المملكة الأردنية الهاشميةInput text is first normalized by collapsing all character runs to exactly 2 repetitions (e.g. الممملكككة → االلممللككةة). This normalized form is fed into a Transformer encoder, and the CTC output is decoded greedily to produce the corrected text.
A word-level guard ensures that words with no elongation in the original input are never modified, preventing the model from accidentally altering already-clean words that appear alongside elongated ones in the same sequence.
candle/
├── candle_tokenizer.py # Character-level tokenizer
├── candle_dataset.py # Dataset and DataLoader with overlapping chunking
├── candle_pl.py # Teacher model (CandleModel) — PyTorch Lightning
├── candle_pl_distill.py # Student model (CandleModelDistill) with knowledge distillation
├── transformer.py # Transformer Encoder implementation (with Flash Attention support)
├── xer.py # WER / CER metric utilities
├── train.py # Training script for the teacher model
├── train_distill.py # Training script for the student model
├── predict_cli.py # Inference script (supports both teacher and distilled models)
├── compute_xer_cli.py # CLI tool for computing WER / CER / SER across benchmark files
├── compute_fertility.py # Tokenizer fertility analysis across multiple models and tokenizers
├── export_to_onnx.py # ONNX exporter
├── run_predict_cli.sh # Example shell script for running inference
└── run_eval.sh # Example shell script for running evaluation
- Python 3.8+
- PyTorch
- PyTorch Lightning
- tqdm
- kaldialign (for evaluation metrics)
pip install torch pytorch-lightning tqdm
pip install git+https://github.com/pzelasko/kaldialign.gitpython train.pyThe model is initialized from a pretrained CharBERT checkpoint (CATT) and fine-tuned using a phased unfreezing strategy:
- Phase 1 — Only the last two encoder layers and the linear head are trainable. Checkpoints saved to
models/candle_model_phase_1/. - Phase 2 — Unfreeze one additional layer. Initialize from the best Phase 1 checkpoint. Checkpoints saved to
models/candle_model_phase_2/. - Phase 3 — Unfreeze all layers. Initialize from the best Phase 2 checkpoint. Checkpoints saved to
models/candle_model_phase_3/.
To switch phases, comment/uncomment the relevant freeze/unfreeze calls and dirpath in train.py.
Key hyperparameters:
| Parameter | Value | Description |
|---|---|---|
actual_max_seq_len |
256 | Max sequence length during training |
model_max_seq_len |
1024 | Max sequence length the model can handle at inference |
d_model |
512 | Transformer hidden dimension |
n_layers |
6 | Number of Transformer encoder layers |
n_heads |
16 | Number of attention heads |
batch_size |
256 | Training batch size |
Expected data layout:
train_data/
├── <your_train_file>.txt
└── <your_val_file>.txt
One Arabic sentence per line. Update the train_text_file and val_text_file paths in train.py accordingly.
Checkpoints are saved to the configured dirpath, monitored by val_ser (Sentence Error Rate).
python train_distill.pyThe student model has 2 encoder layers instead of 6, and is trained with a combination of:
- Hard loss — CTC loss against ground-truth labels
- Soft loss — KL divergence against the teacher's output distributions, scaled by temperature²
Student weights are initialized by copying matching layers from the teacher: the teacher's first layer loads into the student's first layer, and the teacher's last layer loads into the student's second layer.
Key distillation hyperparameters:
| Parameter | Value | Description |
|---|---|---|
temperature |
3.0 | Softens the teacher's probability distributions |
alpha |
0.7 | Weight of the soft (distillation) loss; 1 - alpha weights the hard (CTC) loss |
Update ckpt_path in train_distill.py to point to your best teacher checkpoint. Checkpoints are saved to models/candle_model_distilled_v1/.
python predict_cli.py INPUT_FILE OUTPUT_FILE MODEL_PATH BATCH_SIZE IS_DISTILLED| Argument | Description |
|---|---|
INPUT_FILE |
Path to input .txt file (one line per sample) |
OUTPUT_FILE |
Path to write deduplicated output |
MODEL_PATH |
Path to model checkpoint (.ckpt) |
BATCH_SIZE |
Number of lines to process per batch |
IS_DISTILLED |
true to load the 2-layer distilled model, false for the 6-layer teacher |
Example using the provided shell script (runs both teacher and distilled models on the same benchmarks):
bash run_predict_cli.shLines longer than max_seq_len are automatically split on word boundaries and rejoined after inference, so the script handles arbitrarily long inputs gracefully.
If you have a trained Candle PyTorch Lightning checkpoint (.ckpt), you can
export it to ONNX using the included export_to_onnx.py script. In the following example, you can export the original model as well as the ditilled version:
pip install onnx onnxruntime
# Full 6-layer model
python export_to_onnx.py \
--checkpoint best_full_model.ckpt \
--output full_model.onnx \
--validate
# Distilled 2-layer model
python export_to_onnx.py \
--checkpoint best_distilled_model.ckpt \
--output distilled_model.onnx \
--distilled \
--validateThe --validate flag runs a numerical check comparing PyTorch vs ONNX outputs.
A max absolute difference below 1e-4 confirms a clean export.
Then load the exported file locally:
model = CandleDeduplicator(model_path="full_model.onnx")To compute WER, CER, and SER against a reference file:
python compute_xer_cli.py REF.txt HYP.txtResults are reported separately for sentences that contain character duplications and those that do not, as well as overall. To run all benchmark evaluations at once:
bash run_eval.shcompute_fertility.py measures the fertility (tokens per word) of various HuggingFace tokenizers across the raw, ground-truth, and model-deduplicated benchmark files. This is useful for quantifying how much character elongation increases token counts for downstream LLMs.
Configure the MODELS and TOKENIZERS lists at the top of the script, then run:
python compute_fertility.pyResults are printed as a formatted table and optionally saved to a CSV file.
Both training scripts log the following metrics per epoch:
| Metric | Description |
|---|---|
val_loss |
CTC loss on the validation set |
val_ser |
Sentence Error Rate — fraction of sentences with any error |
val_wer |
Word Error Rate |
val_cer |
Character Error Rate |
Logs are written to TensorBoard under the configured logs_path.
Training and inference expect plain .txt files with one Arabic sentence per line. The preprocessing pipeline:
- Strips tashkeel (diacritics) and tatweel (elongation marks)
- Removes non-Arabic characters
- Filters out lines shorter than
min_sentence_lencharacters - Splits long lines into overlapping chunks with a configurable
stride_ratio(default0.5)
- Tokenizer adapted from CATT (CharBERT)
- Pretrained CharBERT weights from CATT
- Transformer implementation based on transformer-pytorch by Hyunwoong Ko
- WER/CER metrics via kaldialign