Skip to content

Add HTSLIB/BGZIPTABIX and deprecate redundant modules#11571

Open
itrujnara wants to merge 33 commits into
nf-core:masterfrom
itrujnara:tabixupdate
Open

Add HTSLIB/BGZIPTABIX and deprecate redundant modules#11571
itrujnara wants to merge 33 commits into
nf-core:masterfrom
itrujnara:tabixupdate

Conversation

@itrujnara
Copy link
Copy Markdown
Contributor

⚠️ Heads up: This PR contains numerous significant changes. Proper review might take time.

Changes

  • Added HTSLIB/BGZIPTABIX module for bgzip compression, decompression, and indexing
  • Deprecated SAMTOOLS/BGZIP
  • Deprecated TABIX/BGZIP
  • Deprecated TABIX/BGZIPTABIX
  • Deprecated TABIX/TABIX
  • Replaced all deprecated modules with HTSLIB/BGZIPTABIX in subworkflows and CI setup blocks

⚠️ Breaking changes

  • HTSLIB/BGZIPTABIX is not an exact drop-in replacement for any of the deprecated modules; some channel tweaks and/or additional inputs are required
  • Automatic renaming from TABIX/TABIX was not ported; pipelines must control the naming of its outputs through configs

Rationale

This change was initially proposed by Maxime. bgzip-related operations are currently spread across 4 different modules with largely overlapping functionality. The modules are not consistent in their API nor behavior. This makes it hard for developers to quickly distinguish their purposes and choose the right tools. Additionally, it creates an unnecessary maintenance burden. Finally, none of the tools is placed in a correctly named subdirectory. A single, multi-purpose module will guarantee a predictable and consistent behavior, and facilitate maintenance in the future.

Design and expected behavior

Purpose

The new module can be used for:

  • compressing files to bgzip
  • converting other compression types to bgzip (currently supported: gz, bz2, xz)
  • decompressing bgzip and several other formats
  • indexing compressed tabular files with tabix

Inputs

The module accepts the following input channels:

  • meta, infile, index, regions: meta map, arbitrary input file, optional tabix index
  • action: explicit action input: compress or decompress
  • make_index: boolean, whether to generate a tabix index
  • out_ext: output file extension (Mahesh style, as discussed on Slack)

Intended behavior

If action = "compress", the module should:

  • compress an uncompressed file with bgzip
  • decompress a non-bgzip compressed file and recompress with bgzip
  • return a bgzip-compressed file as-is
  • if requested, index the resulting bgzip with tabix
  • if regions are passed, performs reindexing with tabix
    If a tabix index is passed, it is returned as-is and no new index is generated (for consistency with prior modules).

If action = "decompress", the module should:

  • decompress a bgzip or other compressed file
  • return an uncompressed file as-is
  • not create any index

Any other action is invalid.

Error handling

  • If action is not compress or decompress, an error is raised
  • If action is decompress and make_index is true, a warning is emitted

Validation plan

New module

The following behaviors are covered by tests:

  • Decompressing an uncompressed file (no-op)
  • Compressing an uncompressed file with indexing
  • Compressing and indexing with regions input
  • Decompressing a bgzip
  • Compressing a bgzip (no-op)
  • Decompressing a gzip
  • Recompressing a gzip to bgzip
  • Decompressing a bzip2
  • Recompressing a bzip2 to bgzip
  • Decompressing an xzip
  • Recompressing an xzip to bgzip

The following errors are tested:

  • Invalid action
  • Name clash in gzip → bgzip conversion

The following are not tested due to nf-test limitations:

  • The output of compress is valid bgzip
  • The output is bit-identical to the input in case of a no-op

Deprecated modules

All tests in deprecated modules have been replaced with one normal and one stub test, with assertions compliant with the deprecation instruction.

Other affected components

All module and subworkflow CI tests that relied on the deprecated modules have been refactored to use the new module. Snapshots have been updated where relevant, verifying that changes only affect file naming. All relevant CI tests pass.

Approval plan

REVIEWER: DO NOT MERGE. Due to the substantial scope of this PR, I want to get at least 3 positive reviews from core/maintainer reviewers before merging. I also want to receive feedback from certain key people (Maxime, Rike, Famke, Luisa).

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • Remove all TODO statements.
  • Broadcast software version numbers to topic: versions - See version_topics
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • nf-core modules test <MODULE> --profile docker
      • nf-core modules test <MODULE> --profile singularity
      • nf-core modules test <MODULE> --profile conda

itrujnara added 10 commits May 7, 2026 08:29
…oved compatibility and add new tests for name clash handling
- Updated tests across multiple modules to include empty input arrays for compatibility.
- Introduced new test cases for compressing VCF files with regions and indexing.
- Deprecated the tabix module, replacing its functionality with HTSLIB/BGZIPTABIX.
- Adjusted workflow and test configurations to reflect the new module structure.
- Updated metadata and tags in various subworkflows to align with the new HTSLIB module.
@itrujnara itrujnara requested a review from maxulysse as a code owner May 8, 2026 13:28
@itrujnara itrujnara changed the title Tabixupdate Add HTSLIB/BGZIPTABIX and deprecate redundant modules May 8, 2026
@itrujnara
Copy link
Copy Markdown
Contributor Author

@nf-core-bot fix linting

@itrujnara itrujnara added new module Adding a new module size/xl labels May 8, 2026
@SPPearce
Copy link
Copy Markdown
Contributor

SPPearce commented May 8, 2026

Automatic renaming from TABIX/TABIX was not ported; pipelines must control the naming of its outputs through configs
Can you elaborate on this?

@itrujnara
Copy link
Copy Markdown
Contributor Author

itrujnara commented May 8, 2026

Automatic renaming from TABIX/TABIX was not ported; pipelines must control the naming of its outputs through configs
Can you elaborate on this?

The latest version of TABIX/TABIX had a piece of code (main.nf line 24) that sometimes made it transplant additional parts of the compressed file name (e.g. if the input file is called test.annotated.vcf.gz, it would call the index ${prefix}.annotated.vcf.gz.tbi). I am not sure if this was intentional. Since this module has a broader functionality, I decided to use a more standardized procedure, where the pipeline determines the prefix through configuration and suffix through module input. This adds a tiny bit of extra work for pipeline maintainers, but should not cause major issues.

Copy link
Copy Markdown
Contributor

@famosab famosab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My few cents, I like it generally. I think I need to take another look at the module logic but my main point is that I am not entirely convinced that having only one module is the way to go here, I would rather have one for the compress and one for the decompress direction but that might be personal preference.

Comment on lines +12 to +14
val action
val make_index
val out_ext
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason to have them all separated in channels? I feel like they could be in a tuple but thats not a strong opinion. in the pipeline i would control these values via the meta map anyway and set this then in the modules config I think

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like the practice with the most support in the community is to have this type of "setting" inputs is to have them as separate val channels. Some (e.g. Simon) would push much of this into the config, but it's not fully accepted as it stands.

tuple val(meta), path("${outfile}") , emit: output
tuple val(meta), path("${outfile}.{tbi,csi}"), emit: index , optional: true
// all htslib tools have the same version, we use bgzip
tuple val("${task.process}"), val('htslib'), eval("bgzip --version | sed '1! d; s/bgzip (htslib) //'"), topic: versions, emit: versions_htslib
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add the xz version here as well

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use

nextflow lint -format -sort-declarations -spaces 4 -harshil-alignment

on this file to clean this up nicely. (mainly setting {}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, it broke Harshil alignment though

Comment on lines +34 to +37
- infile_tbi:
type: file
description: Optional tabix index for the input file. Required when decompressing
and creating an index for the output file.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need an index when we decompress? 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, too much tab-driven development, fixed it

- action:
type: string
description: Action to perform, either `compress` or `decompress`
- make_index:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that we want this index to always be created automatically and that its just either used then in the pipeline downstream with some kind of .join or ignored. But in most cases I think the index is needed rather than not needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to remove this option, but it causes issues in cases where the format is legitimate bgzip input, but not supported by tabix (i.e. any non-tabular file). I know it's a minor use case, but I see no other way to make it a comprehensive bgzip module.

ontologies:
- edam: http://edamontology.org/format_3475 # TSV
- edam: http://edamontology.org/format_3003 # BED
- action:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% sure if I like the approach that one module does both directions, i feel like having one compress and one decompress module would make it a bit easier to reuse but I see the maintenance burden again.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a philosophical question, we can discuss it with the others on Slack. It was Maxime's idea to merge everything into one module and I think it makes sense maintenance-wise (without putting that much burden on pipeline developers).

touch ${prefix}.paraphase.bam
touch ${prefix}.paraphase.bam.bai
echo '' | gzip > ${prefix}_paraphase_vcfs/${prefix}_stub.vcf.gz
echo "" | gzip > ${prefix}_paraphase_vcfs/${prefix}_stub.vcf.gz
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
echo "" | gzip > ${prefix}_paraphase_vcfs/${prefix}_stub.vcf.gz
echo | gzip > ${prefix}_paraphase_vcfs/${prefix}_stub.vcf.gz

why is there something added to the prefix here 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No idea, I did not make this module, only changed the quotes to fix a linting failure

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You suggestion fails in nf-core modules lint

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it should not, i already updated that in the docs und have an open issue with tools but then leave it as it was before

Copy link
Copy Markdown
Member

@maxulysse maxulysse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's all I ever dreamed it should be

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new module Adding a new module size/xl

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants