Skip to content

Conversation

@dialvarezs
Copy link
Member

@dialvarezs dialvarezs commented Nov 4, 2025

Fix for an issue reported on Slack.
The source of the issue is that .collect() is returning a string, not a list, so .size() will give > 1 always.
I tried adding flat: false but didn't work either, so I used .toList().

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the rapid fix @dialvarezs !

@jfy133
Copy link
Member

jfy133 commented Nov 4, 2025

Looks like quay.io is having problems...

@dialvarezs
Copy link
Member Author

dialvarezs commented Nov 4, 2025

Not now BUSCO, not now! >:(

@dialvarezs
Copy link
Member Author

@jfy133 I excluded bin_summary.tsv from the snapshot, because as I related in #905 it seems impossible to keep it stable, unless we add a sorting step aftewards.

If that's ok, I'll merge.

@jfy133
Copy link
Member

jfy133 commented Nov 4, 2025

@jfy133 I excluded bin_summary.tsv from the snapshot, because as I related in #905 it seems impossible to keep it stable, unless we add a sorting step aftewards.

If that's ok, I'll merge.

I'm not following (sorry if it's illness-brain), What exactly is not stable? The md5sum?

I would not exclude entirely as it's a critical file. At a minimum check number of rows and maybe a few strings (you can have multiple assertions for the same file).

But is the problem that bin_summary.csv has rows in different order? If so I think your .sort() fix for the mag_depths script was a good one.

@dialvarezs
Copy link
Member Author

@jfy133 My bad here, I did this in a hurry and for some reason I thought it was busco_summary instead of bin_summary.
And yes, the problem is the row ordering, but in the bin summary is solvable, it just need a sort in the Python script.
I'm reverting this now and aplying the proper fix.

@dialvarezs dialvarezs force-pushed the fix-platform-validation branch from 77430f4 to 83a108c Compare November 4, 2025 14:43
@dialvarezs dialvarezs mentioned this pull request Nov 4, 2025
11 tasks
GenomeBinning/MetaBAT2/unbinned/discarded/*.unbinned.pooled.fa.gz
GenomeBinning/QC/CheckM2/**/DIAMOND_RESULTS.tsv
GenomeBinning/QC/CheckM2/*/checkm2.log
GenomeBinning/QC/busco_summary.tsv
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find this satisfying, can we check for the number of rows instead? But I'll give a preemptive approval on case that's not possible

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely, I agree 100%.

But I guess we can address it later in a snapshot related PR? I can add it to my PR for test_assembly_input and copy it from all the snapshots using BUSCO.

The information of that file es contained on bin_summary.tsv, so we're checking it someway. The problem is that BUSCO in batch mode definitively can't produce stable results.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we check for the number of rows instead

A quick heads up from my own metagenome pipeline - I have found that the binning tools can output varying numbers of bins between runs. So you need to be a bit clever and can't just hard-code a number

Copy link
Member Author

@dialvarezs dialvarezs Nov 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick heads up from my own metagenome pipeline - I have found that the binning tools can output varying numbers of bins between runs. So you need to be a bit clever and can't just hard-code a number

Our binning results seem stable (maybe because the dataset small and simple enough). We have snapshots for all the bin FASTA files, and the number is consistent too. So I guess it's not so bad to have the number harcoded until we change the dataset. But if you have a better idea @prototaxites, we can implement it that way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not specifically (I'm just comparing counts across files) - I just point it out because in my experience MetaBat2 was the big culprit and did not consistently put out the same files or in the same order...

But if it's working here that's fine!

@dialvarezs dialvarezs merged commit b60e1f6 into nf-core:dev Nov 4, 2025
25 checks passed
@dialvarezs dialvarezs mentioned this pull request Nov 5, 2025
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants