-
Notifications
You must be signed in to change notification settings - Fork 47
Description
When trimming multiple sequence alignments (MSAs) at the ends, the output FASTA headers should reflect the new coordinate ranges relative to the original (untrimmed) sequences. This can prevent confusion in alignment-trimming workflows.This functionality cannot be applied when trimming gappy columns in the middle of the alignment. The in-between gappy characters should be ignored when re-numbering the sequence slices.
Example:
SEQ_1 original length (without gaps): 49 aa
SEQ_2 original length (without gaps): 46 aa
SEQ_3 original length (without gaps): 48 aa
SEQ_4 original length (without gaps): 46 aa
>SEQ_1
ACGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGTACGTAC
>SEQ_2/50-95
-CGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGTACGT--
>SEQ_3/50-97
GCGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGTACGTC-
>SEQ_4
-CGTACGTACGTACGTA-CTACGTACGTACGTACGTACGTACGTACGT--
After trimming 1 column from the start and 2 columns from the end, the expected output should be:
>SEQ_1/2-47
CGTACGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGT
>SEQ_2/50-95
CGTACGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGT
>SEQ_3/51-96
CGTACGTACGTACGTA-GTACGTACGTACGTACGTACGTACGTACGT
>SEQ_4
CGTACGTACGTACGTA-CTACGTACGTACGTACGTACGTACGTACGT
This naming scheme makes clear which portion of the original sequence remains after trimming.
Would you consider adding this functionality in a future release? Perhaps with an --update-coordinates flag that registers only when clipping the ends?
Thanks!