Get Started

The following section serves as a first guide to start using the package, using protein-protein interface (PPI) queries as example. For an enhanced learning experience, we provide in-depth tutorial notebooks for generating PPI data, generating SRVs data, and for the training pipeline.

Data Generation

For each protein-protein complex (or protein structure containing a missense variant), a Query can be created and added to the QueryCollection object, to be processed later on. Two subtypes of Query exist: ProteinProteinInterfaceQuery and SingleResidueVariantQuery.

A Query takes as inputs:

A .pdb file, representing the molecular structure.
The resolution ("residue" or "atom"), i.e. whether each node should represent an amino acid residue or an atom.
chain_ids, the chain ID or IDs (generally single capital letter(s)).
- SingleResidueVariantQuery takes a single ID, which represents the chain containing the variant residue.
- ProteinProteinInterfaceQuery takes a pair of ids, which represent the chains between which the interface exists.
- Note that in either case this does not limit the structure to residues from this/these chain/s. The structure contained in the .pdb can thus have any number of chains, and residues from these chains will be included in the graphs and grids produced by DeepRank2 (if they are within the influence_radius).
Optionally, the correspondent position-specific scoring matrices (PSSMs), in the form of .pssm files.

from deeprank2.query import QueryCollection, ProteinProteinInterfaceQuery

queries = QueryCollection()

# Append data points
queries.add(ProteinProteinInterfaceQuery(
    pdb_path = "tests/data/pdb/1ATN/1ATN_1w.pdb",
    resolution = "residue",
    chain_ids = ["A", "B"],
    targets = {
        "binary": 0
    },
    pssm_paths = {
        "A": "tests/data/pssm/1ATN/1ATN.A.pdb.pssm",
        "B": "tests/data/pssm/1ATN/1ATN.B.pdb.pssm"
    }
))
queries.add(ProteinProteinInterfaceQuery(
    pdb_path = "tests/data/pdb/1ATN/1ATN_2w.pdb",
    resolution = "residue",
    chain_ids = ["A", "B"],
    targets = {
        "binary": 1
    },
    pssm_paths = {
        "A": "tests/data/pssm/1ATN/1ATN.A.pdb.pssm",
        "B": "tests/data/pssm/1ATN/1ATN.B.pdb.pssm"
    }
))
queries.add(ProteinProteinInterfaceQuery(
    pdb_path = "tests/data/pdb/1ATN/1ATN_3w.pdb",
    resolution = "residue",
    chain_ids = ["A", "B"],
    targets = {
        "binary": 0
    },
    pssm_paths = {
        "A": "tests/data/pssm/1ATN/1ATN.A.pdb.pssm",
        "B": "tests/data/pssm/1ATN/1ATN.B.pdb.pssm"
    }
))

The user is free to implement a custom query class. Each implementation requires the build method to be present.

The queries can then be processed into graphs only or both graphs and 3D grids, depending on which kind of network will be used later for training.

from deeprank2.features import components, conservation, contact, exposure, irc, surfacearea
from deeprank2.utils.grid import GridSettings, MapMethod

feature_modules = [components, conservation, contact, exposure, irc, surfacearea]

# Save data into 3D-graphs only
hdf5_paths = queries.process(
    "<output_folder>/<prefix_for_outputs>",
    feature_modules = feature_modules)

# Save data into 3D-graphs and 3D-grids
hdf5_paths = queries.process(
    "<output_folder>/<prefix_for_outputs>",
    feature_modules = feature_modules,
    grid_settings = GridSettings(
        # the number of points on the x, y, z edges of the cube
        points_counts = [20, 20, 20],
        # x, y, z sizes of the box in Å
        sizes = [1.0, 1.0, 1.0]),
    grid_map_method = MapMethod.GAUSSIAN)

Data Exploration

As representative example, the following is the HDF5 structure generated by the previous phase for 1ATN_1w.pdb, so for one single graph, for the graph + grid case:

└── ppi-1ATN_1w:A-B
    |
    ├── edge_features
    │   ├── _index
    │   ├── _name
    │   ├── covalent
    │   ├── distance
    │   ├── electrostatic
    │   ├── same_chain
    │   └── vanderwaals
    |
    ├── node_features
    │   ├── _chain_id
    │   ├── _name
    │   ├── _position
    │   ├── bsa
    │   ├── hse
    │   ├── info_content
    │   ├── res_depth
    │   ├── pssm
    |   ├── ...
    |   └── sasa
    |
    ├── grid_points
    │   ├── center
    │   ├── x
    │   ├── y
    │   └── z
    |
    ├── mapped_features
    │   ├── _position_000
    │   ├── _position_001
    │   ├── _position_002
    │   ├── bsa
    │   ├── covalent
    │   ├── distance
    │   ├── electrostatic
    │   ├── hse_000
    |   ├── ...
    |   └── vanderwaals
    |
    └── target_values
        └── binary

This entry represents the interface between the two proteins contained in the .pdb file, at the residue level. edge_features and node_features are specific for the graph-like representation of the PPI, while grid_points and mapped_features refer to the grid mapped from the graph. Each data point generated by DeepRank2 has the above structure, apart from the features and the target that are specified by the user.

It is always a good practice to first explore the data, and then make decision about splitting them in training, test and validation sets. For this purpose, users can either use HDFView, a visual tool written in Java for browsing and editing HDF5 files, or Python packages such as h5py. Few examples for the latter:

import h5py

with h5py.File("<hdf5_path.hdf5>", "r") as hdf5:
    # List of all graphs in hdf5, each graph representing a ppi
    ids = list(hdf5.keys())
    # List of all node features
    node_features = list(hdf5[ids[0]]["node_features"])
    # List of all edge features
    edge_features = list(hdf5[ids[0]]["edge_features"])
    # List of all edge targets
    targets = list(hdf5[ids[0]]["target_values"])
    # BSA feature for ids[0], numpy.ndarray
    node_feat_polarity = hdf5[ids[0]]["node_features"]["bsa"][:]
     # Electrostatic feature for ids[0], numpy.ndarray
    edge_feat_electrostatic = hdf5[ids[0]]["edge_features"]["electrostatic"][:]

Datasets

Data can be split in sets implementing custom splits according to the specific application. Assuming that the training, validation and testing ids have been chosen (keys of the HDF5 file/s), then the DeeprankDataset objects can be defined.

GraphDataset

For training GNNs the user can create a GraphDataset instance:

from deeprank2.dataset import GraphDataset

node_features = ["bsa", "res_depth", "hse", "info_content", "pssm"]
edge_features = ["distance"]
target = "binary"

# Creating GraphDataset objects
train_ids = [<ids>]
valid_ids = [<ids>]
test_ids = [<ids>]
dataset_train = GraphDataset(
    hdf5_path = hdf5_paths,
    subset = train_ids,
    node_features = node_features,
    edge_features = edge_features,
    target = target
)
dataset_val = GraphDataset(
    hdf5_path = hdf5_paths,
    subset = valid_ids,
    train_source = dataset_train
)
dataset_test = GraphDataset(
    hdf5_path = hdf5_paths,
    subset = test_ids,
    train_source = dataset_train
)

Transforming Features

For the GraphDataset class it is possible to define a dictionary to indicate which transformations to apply to the features, being the transformations lambda functions and/or standardization. If True, standardization is applied after transformation, if the latter is present. Example:

import numpy as np
from deeprank2.dataset import GraphDataset

node_features = ["bsa", "res_depth", "hse", "info_content", "pssm"]
edge_features = ["distance"]
target = "binary"

features_transform = {
    'bsa': {'transform': lambda t: np.log(t+1), 'standardize': True},
    'electrostatic': {'transform': lambda t: np.sqrt(t), 'standardize': True},
    'hse': {'transform': lambda t: np.log(t+1), 'standardize': False}
}
train_ids = [<ids>]
dataset_train = GraphDataset(
    hdf5_path = hdf5_path,
    subset = train_ids,
    node_features = node_features,
    edge_features = edge_features,
    features_transform = features_transform,
    target = target
)

An all key can be set for indicating to apply the same standardize and transform to all the features present in the dataset. Example:

features_transform = {'all':
    {'transform': lambda t: np.log(t+1), 'standardize': True}
}
train_ids = [<ids>]
dataset_train = GraphDataset(
    hdf5_path = hdf5_path,
    subset = train_ids,
    node_features = node_features,
    edge_features = edge_features,
    features_transform = features_transform
    target = target
)

If standardize functionality is used, validation and testing sets need to know the interested features’ means and standard deviations in order to use the same values for standardizing validation and testing features. This can be done using train_source parameter of the GraphDataset class. Example:

features_transform = {'all':
    {'transform': lambda t: np.log(t+1), 'standardize': True}
}
train_ids = [<ids>]
valid_ids = [<ids>]
test_ids = [<ids>]
# `train_source` defaults to `None`
dataset_train = GraphDataset(
    hdf5_path = hdf5_path,
    subset = train_ids,
    node_features = node_features,
    edge_features = edge_features,
    features_transform = features_transform,
    target = target
)
dataset_val = GraphDataset(
    hdf5_path = hdf5_paths,
    subset = valid_ids,
    train_source = dataset_train # dataset_train means and stds will be used
)
dataset_test = GraphDataset(
    hdf5_path = hdf5_paths,
    subset = test_ids,
    train_source = dataset_train # dataset_train means and stds will be used
)

GridDataset

For training CNNs the user can create a GridDataset instance:

from deeprank2.dataset import GridDataset

features = ["bsa", "res_depth", "hse", "info_content", "pssm", "distance"]
target = "binary"

# Creating GridDataset objects
train_ids = [<ids>]
valid_ids = [<ids>]
test_ids = [<ids>]
dataset_train = GridDataset(
    hdf5_path = hdf5_paths,
    subset = train_ids,
    features = features,
    target = target
)
dataset_val = GridDataset(
    hdf5_path = hdf5_paths,
    subset = valid_ids,
    train_source = dataset_train
)
dataset_test = GridDataset(
    hdf5_path = hdf5_paths,
    subset = test_ids,
    train_source = dataset_train
)

Training

Let’s define a Trainer instance, using for example of the already existing GINet. Because GINet is a GNN, it requires a dataset instance of type GraphDataset.

from deeprank2.trainer import Trainer
from deeprank2.neuralnets.gnn.vanilla_gnn import VanillaNetwork

trainer = Trainer(
    VanillaNetwork,
    dataset_train,
    dataset_val,
    dataset_test
)

The same can be done using a CNN, for example CnnClassification. Here a dataset instance of type GridDataset is required.

from deeprank2.trainer import Trainer
from deeprank2.neuralnets.cnn.model3d import CnnClassification

trainer = Trainer(
    CnnClassification,
    dataset_train,
    dataset_val,
    dataset_test
)

By default, the Trainer class creates the folder ./output for storing predictions information collected later on during training and testing. HDF5OutputExporter is the exporter used by default, but the user can specify any other implemented exporter or implement a custom one.

Optimizer (torch.optim.Adam by default) and loss function can be defined by using dedicated functions:

import torch

trainer.configure_optimizers(torch.optim.Adamax, lr = 0.001, weight_decay = 1e-04)

Then the Trainer can be trained and tested; the best model in terms of validation loss is saved by default, and the user can modify so or indicate where to save it using the train() method parameter filename.

trainer.train(
    nepoch = 50,
    batch_size = 64,
    validate = True,
    filename = "<my_folder/model.pth.tar>")
trainer.test()

Results Export and Visualization

The user can specify a DeepRank2 exporter or a custom one in output_exporters parameter of the Trainer class, together with the path where to save the results. Exporters are used for storing predictions information collected later on during training and testing. Example:

from deeprank2.trainer import Trainer
from deeprank2.neuralnets.gnn.vanilla_gnn import VanillaNetwork
from deeprank2.utils.exporters import HDF5OutputExporter

trainer = Trainer(
    VanillaNetwork,
    dataset_train,
    dataset_val,
    dataset_test,
    output_exporters = [HDF5OutputExporter("<output_folder_path>")]
)

By default, the Trainer class creates the folder ./output and uses HDF5OutputExporter. In the latter case, results are saved in output_exporter.hdf5 both during the training (.train()) and during the testing (.test()) phases. output_exporter.hdf5 contains Groups which refer to each phase, e.g. training and testing if both are run, only one of them otherwise. Training phase includes validation results as well. The HDF5 file can then be read as a Pandas Dataframe:

import os
import pandas as pd

output_train = pd.read_hdf(os.path.join("<output_folder_path>", "output_exporter.hdf5"), key="training")
output_test = pd.read_hdf(os.path.join("<output_folder_path>", "output_exporter.hdf5"), key="testing")

The dataframes contain phase, epoch, entry, output, target, and loss columns, and can be easily used to visualize the results.

For classification tasks, the output column contains a list of probabilities that each class occurs, and each list sums to 1 (for more details, please see documentation on the softmax function). Note that the order of the classes in the list depends on the classes attribute of the DeeprankDataset instances. For classification tasks, if classes is not specified (as in this example case), it is defaulted to [0, 1].

Example for plotting training loss curves using Plotly Express:

import plotly.express as px

fig = px.line(
    output_train,
    x='epoch',
    y='loss',
    color='phase',
    markers=True)

fig.update_layout(
    xaxis_title='Epoch #',
    yaxis_title='Loss',
    title='Loss vs epochs'
)

Run a Pre-trained Model on New Data

If you want to run a pre-trained model on new PDB files, the first step is to process and save them into HDF5 files. Let’s suppose that the model has been trained with ProteinProteinInterfaceQuery queries mapped to graphs:

from deeprank2.query import QueryCollection, ProteinProteinInterfaceQuery

queries = QueryCollection()

# Append data points
queries.add(ProteinProteinInterfaceQuery(
    pdb_path = "<new_pdb_file1.pdb>",
    chain_id1 = "A",
    chain_id2 = "B"
))
queries.add(ProteinProteinInterfaceQuery(
    pdb_path = "<new_pdb_file2.pdb>",
    chain_id1 = "A",
    chain_id2 = "B"
))

hdf5_paths = queries.process(
    "<output_folder>/<prefix_for_outputs>",
    feature_modules = 'all')

Then, the GraphDataset instance for the newly processed data can be created. Do this by specifying the path for the pre-trained model in train_source, together with the path to the HDF5 files just created. Note that there is no need of setting the dataset’s parameters, since they are inherited from the information saved in the pre-trained model.

from deeprank2.dataset import GraphDataset

dataset_test = GraphDataset(
    hdf5_path = "<output_folder>/<prefix_for_outputs>",
    train_source = "<pretrained_model_path>"
)

Finally, the Trainer instance can be defined and the new data can be tested:

from deeprank2.trainer import Trainer
from deeprank2.neuralnets.gnn.vanilla_gnn import VanillaNetwork
from deeprank2.utils.exporters import HDF5OutputExporter

trainer = Trainer(
    VanillaNetwork,
    dataset_test = dataset_test,
    pretrained_model = "<pretrained_model_path>",
    output_exporters = [HDF5OutputExporter("<output_folder_path>")]
)

trainer.test()

The results can then be read in a Pandas Dataframe and visualized:

import os
import pandas as pd

output = pd.read_hdf(os.path.join("<output_folder_path>", "output_exporter.hdf5"), key="testing")
output.head()