antipasti.preprocessing package

Submodules

antipasti.preprocessing.preprocessing module

This module contains the pre-processing class.

Authors:

Kevin Michalewicz <k.michalewicz22@imperial.ac.uk>

class antipasti.preprocessing.preprocessing.Preprocessing(data_path='../data/', scripts_path='../scripts/', structures_path='/Users/kevinmicha/Documents/all_structures/chothia/', df='sabdab_summary_all.tsv', modes=30, chain_lengths_path='chain_lengths/', dccm_map_path='dccm_maps/', residues_path='lists_of_residues/', file_type_input='.pdb', selection='_fv', pathological=None, renew_maps=False, renew_residues=False, cmaps=False, cmaps_thr=8.0, ag_agnostic=False, affinity_entries_only=True, stage='training', test_data_path=None, test_dccm_map_path=None, test_residues_path=None, test_structure_path=None, test_pdb_id='1t66', alphafold=False, h_offset=0, l_offset=0, ag_residues=0)[source]

Bases: object

Generating the residue normal mode correlation maps.

Parameters:
  • data_path (str) – Path to the data folder.

  • scripts_path (str) – Path to the scripts folder.

  • structures_path (str) – Path to the PDB files.

  • df (str) – Name of the database containing the PDB entries.

  • modes (int) – Number of considered normal modes.

  • chain_lengths_path (str) – Path to the folder containing arrays with the chain lengths.

  • dccm_map_path (str) – Path to the normal mode correlation maps.

  • residues_path (str) – Path to the folder containing the list of residues per entry.

  • file_type_input (str) – Filename extension of input structures.

  • selection (str) – Considered portion of antibody chains.

  • pathological (list) – PDB identifiers of antibodies that need to be excluded.

  • renew_maps (bool) – Compute all the normal mode correlation maps.

  • renew_residues (bool) – Retrieve the lists of residues for each entry.

  • cmaps (bool) – If True, ANTIPASTI computes the contact maps of the complexes instead of the Normal Modes.

  • cmaps_thr (float) – Thresholding distance for alpha (α) carbons to build the contact maps.

  • ag_agnostic (bool) – If True, Normal Mode correlation maps are computed in complete absence of the antigen.

  • affinity_entries_only (bool) – This is False in general, but the ANTIPASTI pipeline could be used to other types of projects and thus consider data without affinity values.

  • stage (str) – Choose between training and predicting.

  • test_data_path (str) – Path to the test data folder.

  • test_dccm_map_path (str) – Path to the test normal mode correlation maps.

  • test_residues_path (str) – Path to the folder containing the list of residues for a test sample.

  • test_structures_path (str) – Path to the test PDB file.

  • test_pdb_id (str) – Test PDB ID.

  • alphafold (bool) – True the test structure was folded using AlphaFold.

  • h_offset (int) – Amount of absent residues between positions 1 and 25 in the heavy chain.

  • l_offset (int) – Amount of absent residues between positions 1 and 23 in the light chain.

clean_df()[source]

Cleans the database containing the PDB entries.

Returns:

  • df_pdbs (list) – PDB entries.

  • df_kds (list) – Binding affinities.

  • df (pandas.DataFrame) – Cleaned database.

generate_fv_pdb(path, keepABC=True, lresidues=False, hupsymchain=None, lupsymchain=None)[source]

Generates a new PDB file containing the antigen residues and the antibody variable region.

Parameters:
  • path (str) – Path of a Chothia-numbered PDB file.

  • keepABC (bool) – Keeps residues whose name ends with a letter from ‘A’ to ‘Z’.

  • lresidues (bool) – The names of each residue are stored in self.residues_path.

  • upsymchain (int) – Upper limit of heavy chain residues due to a change in the numbering convention. Only useful when using AlphaFold.

  • lupsymchain (int) – Upper limit of light chain residues due to a change in the numbering convention. Only useful when using AlphaFold.

generate_maps()[source]

Generates the Normal Mode correlation maps.

generate_masked_image(img, idx, test_h=None, test_l=None)[source]

Generates a masked normal mode correlation map

Parameters:
  • img (numpy.ndarray) – Original array containing no blank pixels.

  • idx (int) – Input index.

  • test_h (int) – Length of the heavy chain of an antibody in the test set.

  • test_l (int) – Length of the light chain of an antibody in the test set.

Returns:

  • masked (numpy.ndarray) – Masked normal mode correlation map.

  • mask (numpy.ndarray) – Mask itself.

get_lists_of_lengths(selected_entries)[source]

Retrieves lists with the lengths of the heavy and light chains.

Parameters:

selected_entries (list) – PDB valid entries.

Returns:

  • heavy (list) – Lengths of the heavy chains. In the context of the prediction stage, this list has one element.

  • light (list) – Lengths of the light chains. In the context of the prediction stage, this list has one element.

  • selected_entries (list) – PDB valid entries. In the context of the prediction stage, this list has one element.

get_max_min_chains()[source]

Returns the longest and shortest possible chains.

initialisation(renew_maps, renew_residues)[source]

Computes the normal mode correlation maps and retrieves lists with the lengths of the heavy and light chains.

Parameters:
  • renew_maps (bool) – Compute all the normal mode correlation maps.

  • renew_residues (bool) – Retrieve the lists of residues for each entry.

Returns:

  • heavy (list) – Lengths of the heavy chains.

  • light (list) – Lengths of the light chains.

  • selected_entries (list) – PDB valid entries.

load_test_image()[source]

Returns a test normal mode correlation map which is masked according to the existing residues in the training set.

load_training_images()[source]

Returns the input/output pairs of the model and their corresponding labels.

Module contents

This subpackage contains preprocessing classes and functions.