Skip to content

Update sequence coordinates after trimming gappy ends #119

@vagkaratzas

Description

@vagkaratzas

When trimming multiple sequence alignments (MSAs) at the ends, the output FASTA headers should reflect the new coordinate ranges relative to the original (untrimmed) sequences. This can prevent confusion in alignment-trimming workflows.This functionality cannot be applied when trimming gappy columns in the middle of the alignment. The in-between gappy characters should be ignored when re-numbering the sequence slices.

Example:
SEQ_1 original length (without gaps): 49 aa
SEQ_2 original length (without gaps): 46 aa
SEQ_3 original length (without gaps): 48 aa
SEQ_4 original length (without gaps): 46 aa

>SEQ_1
ACGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGTACGTAC
>SEQ_2/50-95
-CGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGTACGT--
>SEQ_3/50-97
GCGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGTACGTC-
>SEQ_4
-CGTACGTACGTACGTA-CTACGTACGTACGTACGTACGTACGTACGT--

After trimming 1 column from the start and 2 columns from the end, the expected output should be:

>SEQ_1/2-47
CGTACGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGT
>SEQ_2/50-95
CGTACGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGT
>SEQ_3/51-96
CGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGTACGT
>SEQ_4
CGTACGTACGTACGTA-CTACGTACGTACGTACGTACGTACGTACGT

This naming scheme makes clear which portion of the original sequence remains after trimming.

Would you consider adding this functionality in a future release? Perhaps with an --update-coordinates flag that registers only when clipping the ends?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions