Skip to content

Novartis/AVIDbase

Repository files navigation

Repository for the AVIDbase nanobody-antigen dataset source code

This repository contains the Python source code used to create the AVIDbase (Antibody VHH Interaction Database) as part of the associated publication (link coming soon!).

Authors: Tadej Medved, Jurij Lah, Goran Miličić & San Hadži

AVIDbase represents a curated dataset of existing nanobody-antigen crystal structures, with emphasis on reporting the "true" biological assembly (with respect to the nanobody-antigen interface). It consists of the following workflow:

  • Information on existing PDB structures containing nanobodies in complex with other protein molecules is initially obtained from SAbDab, then the structures are downloaded from the RCSB PDB directly and input into the AVIDbase processing pipeline.
  • Putative interfaces are automatically compiled (assembly_annotation.py) into an Excel file (/example/annotation_input_table_UNMODIFIED.xlsx), then manually checked for structural or annotation errors by the authors.
  • The amended table, annotation_input_table_MANUAL.xlsx, is then used as input to assembly_generation.py, to generate the initial database and process the structures/metadata (/example/assembly_generation_UNMODIFIED.xlsx). In this step, .pdb files of individual interfaces are extracted from the raw CIF files, and standardized w.r.t chain IDs, numbering and removal of expression tags, crystallization agents, etc. All biologically relevant non-protein species, such as posttranslational modifications and cofactors, along with missing density segments near the interface, are explicitly annotated.
  • In the final step, the structures are sorted into subsets based on interface integrity (see AVIDbase-prot below), and any annotation errors are corrected, and structural redundancy is computed. The final data is stored in AVIDbase.xlsx.

Existing dataset availability & structure

All data and structures for the 9th Feb 2026 version of AVIDbase are available on Zenodo: 10.5281/zenodo.20488703.

Data tables (cutoff date: 9th Feb 2026):

  • annotation_input_table_MANUAL.xlsx - exhaustive manual annotation of all nanobody-antigen interfaces identified in PDB structures available in SAbDab. Columns with manual annotations are highlighted in yellow.
  • AVIDbase.xlsx - full dataset of biologically accurate nanobody-antigen interfaces. Each row corresponds to a single interface.

AVIDbase.xlsx is divided into 4 subsets:

  • AVIDbase-nr - nonredundant nanobody-antigen dataset (689 total structures).
  • AVIDbase-prot - subset of AVIDbase-nr; contains only interfaces that:
    • do not contain missing density near the interface
    • do not possess significant contacts w/ non-protein moieties
    • have high or medium interface integrity (see paper and SI)
    • have resolution <= 3.5 A
  • AVIDbase-low - remaining low integrity nonredundant interfaces
  • AVIDbase-r - dataset of redundant structures to AVIDbase-nr, either:
    • copies of equivalent interfaces found in the asymmetric unit
    • interfaces from equivalent crystal structures with lower resolution or interface integrity (missing density/glycans/cofactors closer to the interface)

Environment setup

Currently tested platforms:

  • linux-64
  • win-64
conda create -n avidbase
conda activate avidbase
conda install --file requirements.txt

Additional requirements:

  • Rosetta 3.x

Running the code

Below is a step-by-step guide for recreating the final AVIDbase dataset.

For example output structures see /example.

Step 1: Automatic assembly annotation

Generates an Excel table based on the given input PDB codes. SAbDab summary files can also be passed, in which case they are pre-filtered exclusively for X-ray crystal structures. Any structures not already present in the structure directory are downloaded from the RCSB PDB.

python assembly_annotation.py -o output/intermediary_tables -s output/structures/CIF_files --sabdab-files input_files/sabdab/20260209_protein_summary.tsv input_files/sabdab/20260209_peptide_summary.tsv --verbose

All options:

python assembly_annotation.py --help

Output: output/intermediary_tables/annotation_input_table_UNMODIFIED.xlsx (for example see /example directory).

Step 1.5: Manual review of assembly annotation

Before proceeding to step 2, it's recommended to manually assess the correctness of the assigned biological interfaces and modify/exclude rows where necessary.

Example is given in annotation_input_table_MANUAL.xlsx - columns expected to be modified by the user are highlighted in yellow. For explanation of column names, see the associated paper's Supplementary Table S1.

Specific rows are excluded from further consideration by setting the keep column from 1 to 0.

Step 2: Generation of dataset table & standardization of structures

This step extracts relevant metadata for each structure record and extracts standardized PDB structures into a single directory. Also computes interface_integrity and intra_redundancy, but does not sort or filter the dataset.

python assembly_generation.py -f annotation_input_table_MANUAL.xlsx -o output/intermediary_tables -s output/structures/CIF_files --output-structure-dir output/structures/AVIDbase --output-structures --verbose

All options:

python assembly_generation.py --help

Output: output/intermediary_tables/assembly_generation_UNMODIFIED.xlsx (for example see /example directory).

Step 3: Compilation of final dataset

Compute inter_redundancy and epitope_cluster, then sort the full dataset into nonredundant (AVIDbase-nr), redundant (AVIDbase-r), high+medium integrity protein-only (AVIDbase-prot), and low integrity/low resolution (AVIDbase-low).

python assembly_sorting.py -f output/intermediary_tables/assembly_generation_UNMODIFIED.xlsx -o "." -s output/structures/AVIDbase --output-structures --verbose

Rerunning the computation when structures were already sorted:

python assembly_sorting.py -f output/intermediary_tables/assembly_generation_UNMODIFIED.xlsx -o "." -s output/structures/AVIDbase/AVIDbase-nr/full --verbose

All options:

python assembly_sorting.py --help

Output: AVIDbase.xlsx (individual sheets: AVIDbase-nr, AVIDbase-r, AVIDbase-prot, AVIDbase-low)

Step 4: Rosetta relax

Relax protein-only subset AVIDbase-prot with a custom Rosetta XML script. Input avidbase_prot_list.txt should contain paths to each structure in output/structures/AVIDbase/AVIDbase-nr/AVIDbase-prot, 1 path per line. See:

python pdblist_gen.py -f avidbase_prot_list.txt -pdb output/structures/AVIDbase/AVIDbase-nr/AVIDbase-prot

(Linux or WSL)

ROSETTASCRIPTS=<path to RosettaScripts executable>
ROSETTA_N_CPUS=<number of CPUs for parallelization (if applicable)>

mpiexec -np $ROSETTA_N_CPUS $ROSETTASCRIPTS -l avidbase_prot_list.txt -out:path:pdb output/structures/AVIDbase/AVIDbase-nr/AVIDbase-prot-relaxed -out:path:score output/structures/AVIDbase/AVIDbase-nr/AVIDbase-prot-relaxed -parser:protocol relax_and_score.xml @rosetta_flags.txt

(optional) Step 5: Custom calculation of epitope clusters

For custom filtering of AVIDbase. Optionally output interactable 3D plot visualizing the individual clusters (w/ structure IDs).

Example:

python cluster_epitopes.py -f AVIDbase.xlsx -o output/cluster_plots -s output/structures/AVIDbase/AVIDbase-nr/full --dataset-slice 'ag_name|lysozyme,ag_source_organism|gallus-gallus' --plot-clusters --verbose

All options:

python cluster_epitopes.py --help

About

Computational pipeline for the preparation of the AVIDbase dataset of nanobody-antigen crystal structures from the PDB.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors