antipasti.preprocessing package
Submodules
antipasti.preprocessing.preprocessing module
This module contains the pre-processing class.
- Authors:
Kevin Michalewicz <k.michalewicz22@imperial.ac.uk>
- class antipasti.preprocessing.preprocessing.Preprocessing(data_path='../data/', scripts_path='../scripts/', structures_path='/Users/kevinmicha/Documents/all_structures/chothia/', df='sabdab_summary_all.tsv', modes=30, chain_lengths_path='chain_lengths/', dccm_map_path='dccm_maps/', residues_path='lists_of_residues/', file_type_input='.pdb', selection='_fv', pathological=None, renew_maps=False, renew_residues=False, cmaps=False, cmaps_thr=8.0, ag_agnostic=False, affinity_entries_only=True, stage='training', test_data_path=None, test_dccm_map_path=None, test_residues_path=None, test_structure_path=None, test_pdb_id='1t66', alphafold=False, h_offset=0, l_offset=0, ag_residues=0)[source]
Bases:
object
Generating the residue normal mode correlation maps.
- Parameters:
data_path (str) – Path to the data folder.
scripts_path (str) – Path to the scripts folder.
structures_path (str) – Path to the PDB files.
df (str) – Name of the database containing the PDB entries.
modes (int) – Number of considered normal modes.
chain_lengths_path (str) – Path to the folder containing arrays with the chain lengths.
dccm_map_path (str) – Path to the normal mode correlation maps.
residues_path (str) – Path to the folder containing the list of residues per entry.
file_type_input (str) – Filename extension of input structures.
selection (str) – Considered portion of antibody chains.
pathological (list) – PDB identifiers of antibodies that need to be excluded.
renew_maps (bool) – Compute all the normal mode correlation maps.
renew_residues (bool) – Retrieve the lists of residues for each entry.
cmaps (bool) – If
True
, ANTIPASTI computes the contact maps of the complexes instead of the Normal Modes.cmaps_thr (float) – Thresholding distance for alpha (α) carbons to build the contact maps.
ag_agnostic (bool) – If
True
, Normal Mode correlation maps are computed in complete absence of the antigen.affinity_entries_only (bool) – This is
False
in general, but the ANTIPASTI pipeline could be used to other types of projects and thus consider data without affinity values.stage (str) – Choose between
training
andpredicting
.test_data_path (str) – Path to the test data folder.
test_dccm_map_path (str) – Path to the test normal mode correlation maps.
test_residues_path (str) – Path to the folder containing the list of residues for a test sample.
test_structures_path (str) – Path to the test PDB file.
test_pdb_id (str) – Test PDB ID.
alphafold (bool) –
True
the test structure was folded usingAlphaFold
.h_offset (int) – Amount of absent residues between positions 1 and 25 in the heavy chain.
l_offset (int) – Amount of absent residues between positions 1 and 23 in the light chain.
- clean_df()[source]
Cleans the database containing the PDB entries.
- Returns:
df_pdbs (list) – PDB entries.
df_kds (list) – Binding affinities.
df (pandas.DataFrame) – Cleaned database.
- generate_fv_pdb(path, keepABC=True, lresidues=False, hupsymchain=None, lupsymchain=None)[source]
Generates a new PDB file containing the antigen residues and the antibody variable region.
- Parameters:
path (str) – Path of a Chothia-numbered PDB file.
keepABC (bool) – Keeps residues whose name ends with a letter from ‘A’ to ‘Z’.
lresidues (bool) – The names of each residue are stored in
self.residues_path
.upsymchain (int) – Upper limit of heavy chain residues due to a change in the numbering convention. Only useful when using
AlphaFold
.lupsymchain (int) – Upper limit of light chain residues due to a change in the numbering convention. Only useful when using
AlphaFold
.
- generate_masked_image(img, idx, test_h=None, test_l=None)[source]
Generates a masked normal mode correlation map
- Parameters:
- Returns:
masked (numpy.ndarray) – Masked normal mode correlation map.
mask (numpy.ndarray) – Mask itself.
- get_lists_of_lengths(selected_entries)[source]
Retrieves lists with the lengths of the heavy and light chains.
- Parameters:
selected_entries (list) – PDB valid entries.
- Returns:
heavy (list) – Lengths of the heavy chains. In the context of the prediction stage, this list has one element.
light (list) – Lengths of the light chains. In the context of the prediction stage, this list has one element.
selected_entries (list) – PDB valid entries. In the context of the prediction stage, this list has one element.
- initialisation(renew_maps, renew_residues)[source]
Computes the normal mode correlation maps and retrieves lists with the lengths of the heavy and light chains.
Module contents
This subpackage contains preprocessing classes and functions.