DVCTr is an extension of DICTr to processing Digital Volume Correlation (DVC). It leverages transformer architecture to achieve precise matching between volumetric image pairs, enabling accurate displacement field estimation.
DVCTr consists of the following key components:
- ResNet Encoder: Extracts multi-scale features from input volume images
- Transformer Encoder: Enhances feature representation using Swin Transformer
- Matching Block: Performs both global and local feature matching
The network processes input volumes at multiple resolution, starting with global matching at lower resolution and refining with local matching at higher resolution.
The unsupervised loss function fuses photometric consistency and multi-resolution displacement gradient consistency (MrDGC), enabling training without labeled displacement data or specific assumptions about smoothness of displacement field.
$l_{g}=\sum_{j=0}^{2} \frac{w_{j}}{N K_{g}} \sum_{i=0}^{N}\left| I_{0}\left(x_{i}\right)-I_{1}^{j}\left(x_{i}-u_{j}\left(x_{i}\right)\right)\right| _{1}$ - Weights (
$w_j$ ):$w_{j}=\frac{0.9^j}{\sum_{k=0}^{2} 0.9^{k}}$ (exponential decay with base 0.9, i.e., intensity differences at higher resolution are assigned with higher weights); normalized by$N$ (total number of pixels) and$K_g=255$ (max grayscale value for 8-bit image).
- $l_{m}=\frac{h\left| g_{h}-g_{f}\right| {1}+q\left| g{q}-g_{f}\right| {1}}{N K{m}}$
- Displacement gradients calculated via central difference (forward/backward for edge pixels); weighted as
$h=0.9$ (1/2 resolution) and$q=0.1$ (1/4 resolution) to preserve genuine high-frequency deformation. - Normalized by
$N$ (total number of pixels) and$K_m$ (a dimensionless factor determined via 1-epoch trial training).
$l_{I}=w_{g} l_{g}+w_{m} l_{m}$ - Optimal weight ratio for DVCTr:
$w_g : w_m = 2 : 1$ .
We recommend creating a Conda environment through the YAML file provided in the repository:
conda env create -f environment.yaml
conda activate dvctrGenerate synthetic speckle datasets using the MATLAB scripts provided:
cd ./dataset/SpeckleDataset
matlab -nosplash -nodesktop -r mainKey parameters in main.m:
% Total number of training and validation samples
batch_size_total = 3200;
% Control displacement magnitude
sigma_d
% Control displacement gradient
% Multiple gradient displacements to enrich the dataset
% Smaller values represent larger relative displacement gradients
grid_size_list = [6, 8, 16, 24];
% Number of speckles
% Determined based on duty cycle
NUMSPECKLE = 6000;
% Control speckle radius
sigma_r = 1.1;
% Noise parameters
% 2% Gaussian noise
myparamnoise = paramnoise('G',0.02);Execute the following command in the root directory of the repository using the provided script:
sh ./train.shKey parameters in train.sh:
# Batch size per GPU
--batch_size 4
# Number of transformer layers
--num_transformer_layers 10
# Attention splits list
--attn_splits_list 2 8
# Correlation radius list
--corr_radius_list -1 4
# Total training steps
--num_steps 100000
# Loss weight for photometric loss
# lm: MrDGC loss, lg+lm=1
--lg 2/3
# Loss weight for half-resolution gradient difference loss
# q: quarter-resolution, q+h=1
--h 0.9The script uses distributed training with PyTorch launcher for efficient multi-GPU training.
To improve training efficiency and reduce grayscale overfitting in unsupervised learning, an early stopping strategy is introduced. The reduction in grayscale difference loss enters a saturation phase after sufficient training stages. Therefore, the relative reduction value of grayscale difference loss during validation can be used as a criterion for early stopping. Our experiments show that a threshold of 2% may be a good choice. For example, if the relative reduction of the mean absolute error of grayscale values remains below 2% for three consecutive epochs, it is appropriate to stop training, and the model at the beginning of the three epochs is used as a fully trained model for inference tasks.
For reference, DVCTr is trained on a high-performance computing server using four NVIDIA A800 SXM4 GPUs (80 GB VRAM). The batch size was 4 and it took around 30 hours for training.
Execute the following command in the root directory of the repository using the provided script:
sh ./experiment.shKey parameters in experiment.sh:
# Path to the pretrained model
--resume checkpoints/UnsupervisedDVCTr.pth
# Types of experiments to run (tfm, rotation, simulate)
--exp_type tfm rotation simulate
# Number of transformer layers
--num_transformer_layers 10
# Attention splits list
--attn_splits_list 2 8
# Correlation radius list
--corr_radius_list -1 4This script runs the model on predefined test cases to evaluate its performance on different deformation scenarios.
Example inference results:
The above shows the displacement inference result for a 128×128×32 voxel volume rotated by 5° in the x-y plane.
The above shows the inference result for a 128×128×32 voxel volume with relatively random complex deformation.
The pretrained models of supervised and unsupervised DVCTr used in the paper are provided in the repository:
- Supervised model:
./checkpoints/SupervisedDVCTr.pth - Unsupervised model:
./checkpoints/UnsupervisedDVCTr.pth
These models will be loaded in the default experiment script.
@article{HE2026114939,
title = {Unsupervised Transformer-based deep learning for digital image correlation and digital volume correlation},
journal = {Optics & Laser Technology},
volume = {198},
pages = {114939},
year = {2026},
issn = {0030-3992},
doi = {https://doi.org/10.1016/j.optlastec.2026.114939},
url = {https://www.sciencedirect.com/science/article/pii/S0030399226002902},
author = {He, Haoyang and Zhou, Yifei and Zhang, Yajing and Cai, Yuqi and Li, Rui and Liu, Yiping and Tang, Liqun and Sun, Taolin and Jiang, Zhenyu}
}


