-
Notifications
You must be signed in to change notification settings - Fork 144
Fix POOL_SHORT_READS receiving reads in wrong roder resulting in faulty pooling #894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix POOL_SHORT_READS receiving reads in wrong roder resulting in faulty pooling #894
Conversation
|
|
@nf-core-bot fix linting |
…es-contain-unequal-number-of-reads
subworkflows/local/assembly/main.nf
Outdated
| // We have to merge reads together to match tuple structure of POOL_SHORT_READS/ | ||
| // This MUST be in a interleaved structure (s1_r1, s1_r2, s2_r1, s2_r2, ...) | ||
| // So we merge the two list of R1 and R2s, and sort them to ensure correct order above | ||
| ch_short_reads_grouped_for_pooling = ch_short_reads_grouped.map { meta, reads1, reads2 -> [meta, [reads1 + reads2].flatten().sort()] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we assume that the reads files here are standardly-named such that a sort() won't break the order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we can assume that because all those files are renamed for ${prefix} at that point, imho. Or is it possible to skip the complete QC so that original files names come through? Not entirely sure...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think it's possible to basically completely skip QC...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How likely do you think it would be that people don't have a _R1 / _R2, _1 / _2, _F, _R in their FASTQ files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be wary of assuming anything about file names unless we have strictly controlled it. One way to do that would be also to force a schema like the above in the samplesheet validation, so we stop early before errors.
Otherwise we have to be careful with channel order, etc.?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typical Illumina output from the sequencing facilities & companies I know is <sample>_R1_<lane>.fastq.gz. Single-end read files might not have any of those pattern to identify direction (R1/1F/whatever).
I think that makes it already more complicated to catch? I am not an regex expert though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@d4straub 's pattern the one of the most common patterns I've seen too... and other people append the adapter index sequence to the end after the lane ID too... so I really don't think this will be simply be solvable.
But for me that isnt needed, potentially we could add a comment (Warning) in the docs about the sorting issue with SPAdes & skipping all QC & file names, maybe to the co-assembly step (https://nf-co.re/mag/5.1.0/docs/usage/#the-group-column?),
I'm erring for this, but I want this to be a democracy.
@dialvarezs any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted my previous comment, as I wasn't sure it actually worked, but I think it does:
ch_a = Channel.of(["meta", ["a", "c", "b"], ["d", "b", "f"]])
ch_a.map { meta, f1, f2 ->
def transposed_pairs = [f1, f2].transpose()
println transposed_pairs
def sorted_pairs = transposed_pairs.sort { it[0] }
println sorted_pairs
def interleaved = sorted_pairs.flatten()
return [meta, interleaved]
}.view()
transposed: [[a, d], [c, b], [b, f]]
sorted: [[a, d], [b, f], [c, b]]
output: [meta, [a, d, b, f, c, b]]
So we can just sort on fasta1's name, avoiding issues with naming entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that code above and it seems fine to me. It also works with e.g.
ch_a = Channel.of(["meta", ["a_s1_R1_a", "c_s3_R1_c", "b_s2_R1_b"], ["d_s1_R2_d", "b_s3_R2_b", "f_s2_R2_f"]])
that is sorted to
[meta, [a_s1_R1_a, d_s1_R2_d, b_s2_R1_b, f_s2_R2_f, c_s3_R1_c, b_s3_R2_b]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK nice thanks for the cross-validation @d4straub ! When I am more functional I will try to implement it!
…es-contain-unequal-number-of-reads
…-unequal-number-of-reads' of github.com:nf-core/mag into 890-metaspades-exit-status-21-paired-read-files-contain-unequal-number-of-reads
subworkflows/local/assembly/main.nf
Outdated
| // We have to merge reads together to match tuple structure of POOL_SHORT_READS/ | ||
| // This MUST be in a interleaved structure (s1_r1, s1_r2, s2_r1, s2_r2, ...) | ||
| // So we merge the two list of R1 and R2s, and sort them to ensure correct order above | ||
| ch_short_reads_grouped_for_pooling = ch_short_reads_grouped.map { meta, reads1, reads2 -> [meta, [reads1 + reads2].flatten().sort()] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ch_short_reads_grouped_for_pooling = ch_short_reads_grouped.map { meta, reads1, reads2 -> [meta, [reads1 + reads2].flatten().sort()] } | |
| ch_short_reads_grouped_for_pooling = ch_short_reads_grouped.map { meta, reads1, reads2 -> [meta, [reads1, reads2].transpose().sort { it[0].getName() }.flatten()] } |
…es-contain-unequal-number-of-reads
|
This was almost ready TBH. I just implemented the suggestions from @prototaxites and @d4straub in the discussion and verified that everything works as expected. The only change I made was moving the sorting to the grouping step, so the files start sorted from there. Then, as |
d4straub
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that looks fine, but the .tranpose().sort { it[0] }.flatten() is a little less obvious now (which is probably fine). @prototaxites came up with that fix, so may he judge ;)
subworkflows/local/assembly/main.nf
Outdated
| // We have to merge reads together to match tuple structure of POOL_SHORT_READS/ | ||
| // This MUST be in a interleaved structure (s1_r1, s1_r2, s2_r1, s2_r2, ...) | ||
| // So we merge the two list of R1 and R2s, and sort them to ensure correct order above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but that sorting isnt done here now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will update that line.
In fact, the sorting step isn’t required for this to work, because since we preserve the tuple order, the r1 and r2 pairs will be in the same order. I added it anyway for the sake of result stability (to get snapshots working, I have been adding sorting steps basically everywhere).
Co-authored-by: Daniel Straub <[email protected]>
|
I'm ok with this, but can't approve my own PR @d4straub could you give a ✔️ if you're happy, and ofc @prototaxites ! |
prototaxites
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
Let's go! @dialvarezs thanks for wrapping this up 🙏 I suggest we wait one more week to see if any more last minute bug fixes come up, otherwise let's get this out in a release! |
So we can continue with our flood of bi-weekly releases 😂 |
Exactly |
Closes #890
--coassembly_groupparameter was missed out in the new config structuresTODO:
PR checklist
nf-core pipelines lint).nextflow run . -profile test,docker --outdir <OUTDIR>).nextflow run . -profile debug,test,docker --outdir <OUTDIR>).docs/usage.mdis updated.docs/output.mdis updated.CHANGELOG.mdis updated.README.mdis updated (including new tool citations and authors/contributors).