Skip to content

Conversation

kjaisingh
Copy link
Contributor

@kjaisingh kjaisingh commented Oct 9, 2024

This PR is intended to introduce several new tools related to the CleanVcf workflow in GATK-SV, which the use of these tools being documented in broadinstitute/gatk-sv#733. These tools are intended to introduce several enhancements over the existing implementation, including but not limited to:

  • Introduce various unit and integration tests into the workflow.
  • Create more robust and generalizable tools that can be used independent of CleanVcf.
  • Improve runtime and execution speed by leveraging Java.

@kjaisingh kjaisingh requested a review from mwalker174 October 16, 2024 17:13
Copy link
Collaborator

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great overall. I have some suggestions on style and some places where you can reuse other classes. I did not mark all places where you can add final to variable declarations, just a few cases. You may not need to create a separate "Engine" class for the internals here unless you think some of the components would be reusable in another step or if it makes testing easier.

Comment on lines 144 to 148
@Argument(
fullName = OUTPUT_REVISED_EVENTS_LIST_LONG_NAME,
doc="Output list of revised genotyped events"
)
private GATKPath outputRevisedEventsList;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be possible to build this info directly into the VCF rather than have this extra file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly - added this as a new flag in the INFO field.


private void processSVType(VariantContext variant, VariantContextBuilder builder) {
final String svType = variant.getAttributeAsString(GATKSVVCFConstants.SVTYPE, null);
if (svType != null && variant.getAlleles().stream().noneMatch(allele -> allele.getDisplayString().contains(GATKSVVCFConstants.ME))) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to parse the alleles with GATKSVVariantContextUtils.getSymbolicAlleleSymbols() and check for ME. Sometimes the alt is just <INS:ME>.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly - thanks for the pointer.

Comment on lines +166 to +167
failSet = readLastColumn(failList);
passSet = readLastColumn(passList);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use TableReader here. It would require you to add header lines to the lists in the WDL, which would be okay too. See TableUtils.reader() and look at some implementations to see examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly - thanks for the tip. For visibility, I have made corresponding changes to GATK-SV in broadinstitute/gatk-sv#733.

@kjaisingh kjaisingh self-assigned this Oct 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants