Hi,
I was trying to use the -resoverlap / -seqoverlap options to clean a very large alignment of closely related viral sequences (~200,000 sequences; ~5,000 nt). Trimal was unable to finish the task within 16 hours of runtime.
I took a (rapid) look at the calculateSpuriousVector function in the alignment.cpp, which computes the overlap for each position of each sequence. It seems to me that the problem comes from multiple nested loops in this function. Instead of using 3 nested loops, the number of nt/aa, gaps, and unidentified bases could be computed once for each column rather than recomputed for each site of each sequence.
I implemented this in Python (using Numpy + Numba), and it runs in approximately 5 minutes (I will attach the code once I further test it).
Maybe it's an improvement that could be considered for a future release?
Thanks.
Hi,
I was trying to use the -resoverlap / -seqoverlap options to clean a very large alignment of closely related viral sequences (~200,000 sequences; ~5,000 nt). Trimal was unable to finish the task within 16 hours of runtime.
I took a (rapid) look at the calculateSpuriousVector function in the alignment.cpp, which computes the overlap for each position of each sequence. It seems to me that the problem comes from multiple nested loops in this function. Instead of using 3 nested loops, the number of nt/aa, gaps, and unidentified bases could be computed once for each column rather than recomputed for each site of each sequence.
I implemented this in Python (using Numpy + Numba), and it runs in approximately 5 minutes (I will attach the code once I further test it).
Maybe it's an improvement that could be considered for a future release?
Thanks.