Skip to content

-resoverlap / -seqoverlap option is extremely slow for very large alignments #127

@Mvila035

Description

@Mvila035

Hi,

I was trying to use the -resoverlap / -seqoverlap options to clean a very large alignment of closely related viral sequences (~200,000 sequences; ~5,000 nt). Trimal was unable to finish the task within 16 hours of runtime.

I took a (rapid) look at the calculateSpuriousVector function in the alignment.cpp, which computes the overlap for each position of each sequence. It seems to me that the problem comes from multiple nested loops in this function. Instead of using 3 nested loops, the number of nt/aa, gaps, and unidentified bases could be computed once for each column rather than recomputed for each site of each sequence.

I implemented this in Python (using Numpy + Numba), and it runs in approximately 5 minutes (I will attach the code once I further test it).

Maybe it's an improvement that could be considered for a future release?

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions