Add HTSLIB/BGZIPTABIX and deprecate redundant modules#11571
Conversation
…ession with tests
…oved compatibility and add new tests for name clash handling
…multiple test files
- Updated tests across multiple modules to include empty input arrays for compatibility. - Introduced new test cases for compressing VCF files with regions and indexing. - Deprecated the tabix module, replacing its functionality with HTSLIB/BGZIPTABIX. - Adjusted workflow and test configurations to reflect the new module structure. - Updated metadata and tags in various subworkflows to align with the new HTSLIB module.
…metadata consistency
…port in bgziptabix module
…to tabixupdate
|
@nf-core-bot fix linting |
|
The latest version of TABIX/TABIX had a piece of code (main.nf line 24) that sometimes made it transplant additional parts of the compressed file name (e.g. if the input file is called |
famosab
left a comment
There was a problem hiding this comment.
My few cents, I like it generally. I think I need to take another look at the module logic but my main point is that I am not entirely convinced that having only one module is the way to go here, I would rather have one for the compress and one for the decompress direction but that might be personal preference.
| val action | ||
| val make_index | ||
| val out_ext |
There was a problem hiding this comment.
what is the reason to have them all separated in channels? I feel like they could be in a tuple but thats not a strong opinion. in the pipeline i would control these values via the meta map anyway and set this then in the modules config I think
There was a problem hiding this comment.
I feel like the practice with the most support in the community is to have this type of "setting" inputs is to have them as separate val channels. Some (e.g. Simon) would push much of this into the config, but it's not fully accepted as it stands.
| tuple val(meta), path("${outfile}") , emit: output | ||
| tuple val(meta), path("${outfile}.{tbi,csi}"), emit: index , optional: true | ||
| // all htslib tools have the same version, we use bgzip | ||
| tuple val("${task.process}"), val('htslib'), eval("bgzip --version | sed '1! d; s/bgzip (htslib) //'"), topic: versions, emit: versions_htslib |
There was a problem hiding this comment.
we should add the xz version here as well
There was a problem hiding this comment.
I would use
nextflow lint -format -sort-declarations -spaces 4 -harshil-alignmenton this file to clean this up nicely. (mainly setting {}
There was a problem hiding this comment.
Done, it broke Harshil alignment though
| - infile_tbi: | ||
| type: file | ||
| description: Optional tabix index for the input file. Required when decompressing | ||
| and creating an index for the output file. |
There was a problem hiding this comment.
do we really need an index when we decompress? 🤔
There was a problem hiding this comment.
My bad, too much tab-driven development, fixed it
| - action: | ||
| type: string | ||
| description: Action to perform, either `compress` or `decompress` | ||
| - make_index: |
There was a problem hiding this comment.
I would argue that we want this index to always be created automatically and that its just either used then in the pipeline downstream with some kind of .join or ignored. But in most cases I think the index is needed rather than not needed.
There was a problem hiding this comment.
I tried to remove this option, but it causes issues in cases where the format is legitimate bgzip input, but not supported by tabix (i.e. any non-tabular file). I know it's a minor use case, but I see no other way to make it a comprehensive bgzip module.
| ontologies: | ||
| - edam: http://edamontology.org/format_3475 # TSV | ||
| - edam: http://edamontology.org/format_3003 # BED | ||
| - action: |
There was a problem hiding this comment.
I am not 100% sure if I like the approach that one module does both directions, i feel like having one compress and one decompress module would make it a bit easier to reuse but I see the maintenance burden again.
There was a problem hiding this comment.
It's a philosophical question, we can discuss it with the others on Slack. It was Maxime's idea to merge everything into one module and I think it makes sense maintenance-wise (without putting that much burden on pipeline developers).
| touch ${prefix}.paraphase.bam | ||
| touch ${prefix}.paraphase.bam.bai | ||
| echo '' | gzip > ${prefix}_paraphase_vcfs/${prefix}_stub.vcf.gz | ||
| echo "" | gzip > ${prefix}_paraphase_vcfs/${prefix}_stub.vcf.gz |
There was a problem hiding this comment.
| echo "" | gzip > ${prefix}_paraphase_vcfs/${prefix}_stub.vcf.gz | |
| echo | gzip > ${prefix}_paraphase_vcfs/${prefix}_stub.vcf.gz |
why is there something added to the prefix here 🤔
There was a problem hiding this comment.
No idea, I did not make this module, only changed the quotes to fix a linting failure
There was a problem hiding this comment.
You suggestion fails in nf-core modules lint
There was a problem hiding this comment.
yeah it should not, i already updated that in the docs und have an open issue with tools but then leave it as it was before
…iption for optional input file index
…to tabixupdate
…to tabixupdate
maxulysse
left a comment
There was a problem hiding this comment.
It's all I ever dreamed it should be
Changes
Rationale
This change was initially proposed by Maxime. bgzip-related operations are currently spread across 4 different modules with largely overlapping functionality. The modules are not consistent in their API nor behavior. This makes it hard for developers to quickly distinguish their purposes and choose the right tools. Additionally, it creates an unnecessary maintenance burden. Finally, none of the tools is placed in a correctly named subdirectory. A single, multi-purpose module will guarantee a predictable and consistent behavior, and facilitate maintenance in the future.
Design and expected behavior
Purpose
The new module can be used for:
Inputs
The module accepts the following input channels:
meta, infile, index, regions: meta map, arbitrary input file, optional tabix indexaction: explicit action input:compressordecompressmake_index: boolean, whether to generate a tabix indexout_ext: output file extension (Mahesh style, as discussed on Slack)Intended behavior
If
action = "compress", the module should:If a tabix index is passed, it is returned as-is and no new index is generated (for consistency with prior modules).
If
action = "decompress", the module should:Any other action is invalid.
Error handling
actionis notcompressordecompress, an error is raisedactionisdecompressandmake_indexis true, a warning is emittedValidation plan
New module
The following behaviors are covered by tests:
The following errors are tested:
actionThe following are not tested due to nf-test limitations:
compressis valid bgzipDeprecated modules
All tests in deprecated modules have been replaced with one normal and one stub test, with assertions compliant with the deprecation instruction.
Other affected components
All module and subworkflow CI tests that relied on the deprecated modules have been refactored to use the new module. Snapshots have been updated where relevant, verifying that changes only affect file naming. All relevant CI tests pass.
Approval plan
REVIEWER: DO NOT MERGE. Due to the substantial scope of this PR, I want to get at least 3 positive reviews from core/maintainer reviewers before merging. I also want to receive feedback from certain key people (Maxime, Rike, Famke, Luisa).
PR checklist
topic: versions- See version_topicslabelnf-core modules test <MODULE> --profile dockernf-core modules test <MODULE> --profile singularitynf-core modules test <MODULE> --profile conda