Skip to content

Conversation

@jfy133
Copy link
Member

@jfy133 jfy133 commented Oct 25, 2025

Closes #890

  • Essentially the input tuple for pooling was in the wrong format meaning only R1s were being pooled and out of order (i.e, what was meant to be a samples R2, was the second samples R1)
  • This was missed as the --coassembly_group parameter was missed out in the new config structures

TODO:

  • Run tests for all other configs to make sure nothing else changed
  • Regenerate snapshot for test_alternative now coassembly activated

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@jfy133 jfy133 marked this pull request as draft October 25, 2025 06:05
@github-actions
Copy link

github-actions bot commented Oct 25, 2025

nf-core pipelines lint overall result: Passed ✅ ⚠️

Posted for pipeline commit bb20e9a

+| ✅ 379 tests passed       |+
#| ❔   1 tests were ignored |#
!| ❗   6 tests had warnings |!
Details

❗ Test warnings:

  • pipeline_todos - TODO string in main.nf: Remove this line if you don't need a FASTA file [TODO: try and test using for --host_fasta and --host_genome]
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • pipeline_todos - TODO string in nextflow.config: Specify any additional parameters here

❔ Tests ignored:

  • files_unchanged - File ignored due to lint config: .github/PULL_REQUEST_TEMPLATE.md

✅ Tests passed:

Run details

  • nf-core/tools version 3.4.1
  • Run at 2025-11-17 12:43:26

@jfy133
Copy link
Member Author

jfy133 commented Oct 25, 2025

@nf-core-bot fix linting

// We have to merge reads together to match tuple structure of POOL_SHORT_READS/
// This MUST be in a interleaved structure (s1_r1, s1_r2, s2_r1, s2_r2, ...)
// So we merge the two list of R1 and R2s, and sort them to ensure correct order above
ch_short_reads_grouped_for_pooling = ch_short_reads_grouped.map { meta, reads1, reads2 -> [meta, [reads1 + reads2].flatten().sort()] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assume that the reads files here are standardly-named such that a sort() won't break the order?

Copy link
Collaborator

@d4straub d4straub Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we can assume that because all those files are renamed for ${prefix} at that point, imho. Or is it possible to skip the complete QC so that original files names come through? Not entirely sure...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it's possible to basically completely skip QC...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How likely do you think it would be that people don't have a _R1 / _R2, _1 / _2, _F, _R in their FASTQ files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be wary of assuming anything about file names unless we have strictly controlled it. One way to do that would be also to force a schema like the above in the samplesheet validation, so we stop early before errors.

Otherwise we have to be careful with channel order, etc.?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typical Illumina output from the sequencing facilities & companies I know is <sample>_R1_<lane>.fastq.gz. Single-end read files might not have any of those pattern to identify direction (R1/1F/whatever).
I think that makes it already more complicated to catch? I am not an regex expert though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d4straub 's pattern the one of the most common patterns I've seen too... and other people append the adapter index sequence to the end after the lane ID too... so I really don't think this will be simply be solvable.

But for me that isnt needed, potentially we could add a comment (Warning) in the docs about the sorting issue with SPAdes & skipping all QC & file names, maybe to the co-assembly step (https://nf-co.re/mag/5.1.0/docs/usage/#the-group-column?),

I'm erring for this, but I want this to be a democracy.

@dialvarezs any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted my previous comment, as I wasn't sure it actually worked, but I think it does:

ch_a = Channel.of(["meta", ["a", "c", "b"], ["d", "b", "f"]])
ch_a.map { meta, f1, f2 ->
    def transposed_pairs = [f1, f2].transpose()
    println transposed_pairs
    
    def sorted_pairs = transposed_pairs.sort { it[0] }  
    println sorted_pairs
      
    def interleaved = sorted_pairs.flatten()

    return [meta, interleaved] 
}.view()
transposed: [[a, d], [c, b], [b, f]]
sorted: [[a, d], [b, f], [c, b]]

output: [meta, [a, d, b, f, c, b]]

So we can just sort on fasta1's name, avoiding issues with naming entirely.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that code above and it seems fine to me. It also works with e.g.
ch_a = Channel.of(["meta", ["a_s1_R1_a", "c_s3_R1_c", "b_s2_R1_b"], ["d_s1_R2_d", "b_s3_R2_b", "f_s2_R2_f"]])
that is sorted to
[meta, [a_s1_R1_a, d_s1_R2_d, b_s2_R1_b, f_s2_R2_f, c_s3_R1_c, b_s3_R2_b]]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK nice thanks for the cross-validation @d4straub ! When I am more functional I will try to implement it!

…-unequal-number-of-reads' of github.com:nf-core/mag into 890-metaspades-exit-status-21-paired-read-files-contain-unequal-number-of-reads
@jfy133 jfy133 marked this pull request as ready for review October 29, 2025 06:20
// We have to merge reads together to match tuple structure of POOL_SHORT_READS/
// This MUST be in a interleaved structure (s1_r1, s1_r2, s2_r1, s2_r2, ...)
// So we merge the two list of R1 and R2s, and sort them to ensure correct order above
ch_short_reads_grouped_for_pooling = ch_short_reads_grouped.map { meta, reads1, reads2 -> [meta, [reads1 + reads2].flatten().sort()] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ch_short_reads_grouped_for_pooling = ch_short_reads_grouped.map { meta, reads1, reads2 -> [meta, [reads1 + reads2].flatten().sort()] }
ch_short_reads_grouped_for_pooling = ch_short_reads_grouped.map { meta, reads1, reads2 -> [meta, [reads1, reads2].transpose().sort { it[0].getName() }.flatten()] }

@dialvarezs
Copy link
Member

dialvarezs commented Nov 16, 2025

This was almost ready TBH. I just implemented the suggestions from @prototaxites and @d4straub in the discussion and verified that everything works as expected. The only change I made was moving the sorting to the grouping step, so the files start sorted from there. Then, as .transpose() keeps the order, just flattening the result is enough to get the interleaved reads for pooling.

Copy link
Collaborator

@d4straub d4straub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that looks fine, but the .tranpose().sort { it[0] }.flatten() is a little less obvious now (which is probably fine). @prototaxites came up with that fix, so may he judge ;)

Comment on lines 49 to 51
// We have to merge reads together to match tuple structure of POOL_SHORT_READS/
// This MUST be in a interleaved structure (s1_r1, s1_r2, s2_r1, s2_r2, ...)
// So we merge the two list of R1 and R2s, and sort them to ensure correct order above
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that sorting isnt done here now?

Copy link
Member

@dialvarezs dialvarezs Nov 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will update that line.
In fact, the sorting step isn’t required for this to work, because since we preserve the tuple order, the r1 and r2 pairs will be in the same order. I added it anyway for the sake of result stability (to get snapshots working, I have been adding sorting steps basically everywhere).

Co-authored-by: Daniel Straub <[email protected]>
@jfy133
Copy link
Member Author

jfy133 commented Nov 17, 2025

I'm ok with this, but can't approve my own PR @d4straub could you give a ✔️ if you're happy, and ofc @prototaxites !

Copy link
Contributor

@prototaxites prototaxites left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jfy133
Copy link
Member Author

jfy133 commented Nov 17, 2025

Let's go! @dialvarezs thanks for wrapping this up 🙏

I suggest we wait one more week to see if any more last minute bug fixes come up, otherwise let's get this out in a release!

@jfy133 jfy133 merged commit 052e43c into dev Nov 17, 2025
22 checks passed
@jfy133 jfy133 deleted the 890-metaspades-exit-status-21-paired-read-files-contain-unequal-number-of-reads branch November 17, 2025 13:54
@dialvarezs
Copy link
Member

dialvarezs commented Nov 17, 2025

I suggest we wait one more week to see if any more last minute bug fixes come up, otherwise let's get this out in a release!

So we can continue with our flood of bi-weekly releases 😂

@jfy133
Copy link
Member Author

jfy133 commented Nov 17, 2025

I suggest we wait one more week to see if any more last minute bug fixes come up, otherwise let's get this out in a release!

So we can continue with our flood of bi-weekly releases 😂

Exactly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants