Repository for the AVIDbase nanobody-antigen dataset source code

This repository contains the Python source code used to create the AVIDbase (Antibody VHH Interaction Database) as part of the associated publication (link coming soon!).

Authors: Tadej Medved, Jurij Lah, Goran Miličić & San Hadži

AVIDbase represents a curated dataset of existing nanobody-antigen crystal structures, with emphasis on reporting the "true" biological assembly (with respect to the nanobody-antigen interface). It consists of the following workflow:

Information on existing PDB structures containing nanobodies in complex with other protein molecules is initially obtained from SAbDab, then the structures are downloaded from the RCSB PDB directly and input into the AVIDbase processing pipeline.
Putative interfaces are automatically compiled (assembly_annotation.py) into an Excel file (/example/annotation_input_table_UNMODIFIED.xlsx), then manually checked for structural or annotation errors by the authors.
The amended table, annotation_input_table_MANUAL.xlsx, is then used as input to assembly_generation.py, to generate the initial database and process the structures/metadata (/example/assembly_generation_UNMODIFIED.xlsx). In this step, .pdb files of individual interfaces are extracted from the raw CIF files, and standardized w.r.t chain IDs, numbering and removal of expression tags, crystallization agents, etc. All biologically relevant non-protein species, such as posttranslational modifications and cofactors, along with missing density segments near the interface, are explicitly annotated.
In the final step, the structures are sorted into subsets based on interface integrity (see AVIDbase-prot below), and any annotation errors are corrected, and structural redundancy is computed. The final data is stored in AVIDbase.xlsx.

Existing dataset availability & structure

All data and structures for the 9th Feb 2026 version of AVIDbase are available on Zenodo: 10.5281/zenodo.20488703.

Data tables (cutoff date: 9th Feb 2026):

annotation_input_table_MANUAL.xlsx - exhaustive manual annotation of all nanobody-antigen interfaces identified in PDB structures available in SAbDab. Columns with manual annotations are highlighted in yellow.
AVIDbase.xlsx - full dataset of biologically accurate nanobody-antigen interfaces. Each row corresponds to a single interface.

AVIDbase.xlsx is divided into 4 subsets:

AVIDbase-nr - nonredundant nanobody-antigen dataset (689 total structures).
AVIDbase-prot - subset of AVIDbase-nr; contains only interfaces that:
- do not contain missing density near the interface
- do not possess significant contacts w/ non-protein moieties
- have high or medium interface integrity (see paper and SI)
- have resolution <= 3.5 A
AVIDbase-low - remaining low integrity nonredundant interfaces
AVIDbase-r - dataset of redundant structures to AVIDbase-nr, either:
- copies of equivalent interfaces found in the asymmetric unit
- interfaces from equivalent crystal structures with lower resolution or interface integrity (missing density/glycans/cofactors closer to the interface)

Environment setup

Currently tested platforms:

linux-64
win-64

conda create -n avidbase
conda activate avidbase
conda install --file requirements.txt

Additional requirements:

Rosetta 3.x

Running the code

Below is a step-by-step guide for recreating the final AVIDbase dataset.

For example output structures see /example.

Step 1: Automatic assembly annotation

Generates an Excel table based on the given input PDB codes. SAbDab summary files can also be passed, in which case they are pre-filtered exclusively for X-ray crystal structures. Any structures not already present in the structure directory are downloaded from the RCSB PDB.

python assembly_annotation.py -o output/intermediary_tables -s output/structures/CIF_files --sabdab-files input_files/sabdab/20260209_protein_summary.tsv input_files/sabdab/20260209_peptide_summary.tsv --verbose

All options:

python assembly_annotation.py --help

Output: output/intermediary_tables/annotation_input_table_UNMODIFIED.xlsx (for example see /example directory).

Step 1.5: Manual review of assembly annotation

Before proceeding to step 2, it's recommended to manually assess the correctness of the assigned biological interfaces and modify/exclude rows where necessary.

Example is given in annotation_input_table_MANUAL.xlsx - columns expected to be modified by the user are highlighted in yellow. For explanation of column names, see the associated paper's Supplementary Table S1.

Specific rows are excluded from further consideration by setting the keep column from 1 to 0.

Step 2: Generation of dataset table & standardization of structures

This step extracts relevant metadata for each structure record and extracts standardized PDB structures into a single directory. Also computes interface_integrity and intra_redundancy, but does not sort or filter the dataset.

python assembly_generation.py -f annotation_input_table_MANUAL.xlsx -o output/intermediary_tables -s output/structures/CIF_files --output-structure-dir output/structures/AVIDbase --output-structures --verbose

All options:

python assembly_generation.py --help

Output: output/intermediary_tables/assembly_generation_UNMODIFIED.xlsx (for example see /example directory).

Step 3: Compilation of final dataset

Compute inter_redundancy and epitope_cluster, then sort the full dataset into nonredundant (AVIDbase-nr), redundant (AVIDbase-r), high+medium integrity protein-only (AVIDbase-prot), and low integrity/low resolution (AVIDbase-low).

python assembly_sorting.py -f output/intermediary_tables/assembly_generation_UNMODIFIED.xlsx -o "." -s output/structures/AVIDbase --output-structures --verbose

Rerunning the computation when structures were already sorted:

python assembly_sorting.py -f output/intermediary_tables/assembly_generation_UNMODIFIED.xlsx -o "." -s output/structures/AVIDbase/AVIDbase-nr/full --verbose

All options:

python assembly_sorting.py --help

Output: AVIDbase.xlsx (individual sheets: AVIDbase-nr, AVIDbase-r, AVIDbase-prot, AVIDbase-low)

Step 4: Rosetta relax

Relax protein-only subset AVIDbase-prot with a custom Rosetta XML script. Input avidbase_prot_list.txt should contain paths to each structure in output/structures/AVIDbase/AVIDbase-nr/AVIDbase-prot, 1 path per line. See:

python pdblist_gen.py -f avidbase_prot_list.txt -pdb output/structures/AVIDbase/AVIDbase-nr/AVIDbase-prot

(Linux or WSL)

ROSETTASCRIPTS=<path to RosettaScripts executable>
ROSETTA_N_CPUS=<number of CPUs for parallelization (if applicable)>

mpiexec -np $ROSETTA_N_CPUS $ROSETTASCRIPTS -l avidbase_prot_list.txt -out:path:pdb output/structures/AVIDbase/AVIDbase-nr/AVIDbase-prot-relaxed -out:path:score output/structures/AVIDbase/AVIDbase-nr/AVIDbase-prot-relaxed -parser:protocol relax_and_score.xml @rosetta_flags.txt

(optional) Step 5: Custom calculation of epitope clusters

For custom filtering of AVIDbase. Optionally output interactable 3D plot visualizing the individual clusters (w/ structure IDs).

Example:

python cluster_epitopes.py -f AVIDbase.xlsx -o output/cluster_plots -s output/structures/AVIDbase/AVIDbase-nr/full --dataset-slice 'ag_name|lysozyme,ag_source_organism|gallus-gallus' --plot-clusters --verbose

All options:

python cluster_epitopes.py --help

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repository for the AVIDbase nanobody-antigen dataset source code

Existing dataset availability & structure

Environment setup

Running the code

Step 1: Automatic assembly annotation

Step 1.5: Manual review of assembly annotation

Step 2: Generation of dataset table & standardization of structures

Step 3: Compilation of final dataset

Step 4: Rosetta relax

(optional) Step 5: Custom calculation of epitope clusters

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
avidbase_functions		avidbase_functions
example		example
input_files/sabdab		input_files/sabdab
.gitignore		.gitignore
AVIDbase.xlsx		AVIDbase.xlsx
LICENSE		LICENSE
README.md		README.md
annotation_input_table_MANUAL.xlsx		annotation_input_table_MANUAL.xlsx
assembly_annotation.py		assembly_annotation.py
assembly_generation.py		assembly_generation.py
assembly_sorting.py		assembly_sorting.py
cluster_epitopes.py		cluster_epitopes.py
pdblist_gen.py		pdblist_gen.py
relax_and_score.xml		relax_and_score.xml
requirements.txt		requirements.txt
rosetta_flags.txt		rosetta_flags.txt

Folders and files

Latest commit

History

Repository files navigation

Repository for the AVIDbase nanobody-antigen dataset source code

Existing dataset availability & structure

Environment setup

Running the code

Step 1: Automatic assembly annotation

Step 1.5: Manual review of assembly annotation

Step 2: Generation of dataset table & standardization of structures

Step 3: Compilation of final dataset

Step 4: Rosetta relax

(optional) Step 5: Custom calculation of epitope clusters

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages