Get Started
The following section serves as a first guide to start using the package, using protein-protein interface (PPI) queries as example. For an enhanced learning experience, we provide in-depth tutorial notebooks for generating PPI data, generating SRVs data, and for the training pipeline.
Data Generation
For each protein-protein complex (or protein structure containing a missense variant), a Query
can be created and added to the QueryCollection
object, to be processed later on. Two subtypes of Query
exist: ProteinProteinInterfaceQuery
and SingleResidueVariantQuery
.
A Query
takes as inputs:
A
.pdb
file, representing the molecular structure.The resolution (
"residue"
or"atom"
), i.e. whether each node should represent an amino acid residue or an atom.chain_ids
, the chain ID or IDs (generally single capital letter(s)).SingleResidueVariantQuery
takes a single ID, which represents the chain containing the variant residue.ProteinProteinInterfaceQuery
takes a pair of ids, which represent the chains between which the interface exists.Note that in either case this does not limit the structure to residues from this/these chain/s. The structure contained in the
.pdb
can thus have any number of chains, and residues from these chains will be included in the graphs and grids produced by DeepRank2 (if they are within theinfluence_radius
).
Optionally, the correspondent position-specific scoring matrices (PSSMs), in the form of
.pssm
files.
from deeprank2.query import QueryCollection, ProteinProteinInterfaceQuery
queries = QueryCollection()
# Append data points
queries.add(ProteinProteinInterfaceQuery(
pdb_path = "tests/data/pdb/1ATN/1ATN_1w.pdb",
resolution = "residue",
chain_ids = ["A", "B"],
targets = {
"binary": 0
},
pssm_paths = {
"A": "tests/data/pssm/1ATN/1ATN.A.pdb.pssm",
"B": "tests/data/pssm/1ATN/1ATN.B.pdb.pssm"
}
))
queries.add(ProteinProteinInterfaceQuery(
pdb_path = "tests/data/pdb/1ATN/1ATN_2w.pdb",
resolution = "residue",
chain_ids = ["A", "B"],
targets = {
"binary": 1
},
pssm_paths = {
"A": "tests/data/pssm/1ATN/1ATN.A.pdb.pssm",
"B": "tests/data/pssm/1ATN/1ATN.B.pdb.pssm"
}
))
queries.add(ProteinProteinInterfaceQuery(
pdb_path = "tests/data/pdb/1ATN/1ATN_3w.pdb",
resolution = "residue",
chain_ids = ["A", "B"],
targets = {
"binary": 0
},
pssm_paths = {
"A": "tests/data/pssm/1ATN/1ATN.A.pdb.pssm",
"B": "tests/data/pssm/1ATN/1ATN.B.pdb.pssm"
}
))
The user is free to implement a custom query class. Each implementation requires the build
method to be present.
The queries can then be processed into graphs only or both graphs and 3D grids, depending on which kind of network will be used later for training.
from deeprank2.features import components, conservation, contact, exposure, irc, surfacearea
from deeprank2.utils.grid import GridSettings, MapMethod
feature_modules = [components, conservation, contact, exposure, irc, surfacearea]
# Save data into 3D-graphs only
hdf5_paths = queries.process(
"<output_folder>/<prefix_for_outputs>",
feature_modules = feature_modules)
# Save data into 3D-graphs and 3D-grids
hdf5_paths = queries.process(
"<output_folder>/<prefix_for_outputs>",
feature_modules = feature_modules,
grid_settings = GridSettings(
# the number of points on the x, y, z edges of the cube
points_counts = [20, 20, 20],
# x, y, z sizes of the box in Å
sizes = [1.0, 1.0, 1.0]),
grid_map_method = MapMethod.GAUSSIAN)
Data Exploration
As representative example, the following is the HDF5 structure generated by the previous phase for 1ATN_1w.pdb
, so for one single graph, for the graph + grid case:
└── ppi-1ATN_1w:A-B
|
├── edge_features
│ ├── _index
│ ├── _name
│ ├── covalent
│ ├── distance
│ ├── electrostatic
│ ├── same_chain
│ └── vanderwaals
|
├── node_features
│ ├── _chain_id
│ ├── _name
│ ├── _position
│ ├── bsa
│ ├── hse
│ ├── info_content
│ ├── res_depth
│ ├── pssm
| ├── ...
| └── sasa
|
├── grid_points
│ ├── center
│ ├── x
│ ├── y
│ └── z
|
├── mapped_features
│ ├── _position_000
│ ├── _position_001
│ ├── _position_002
│ ├── bsa
│ ├── covalent
│ ├── distance
│ ├── electrostatic
│ ├── hse_000
| ├── ...
| └── vanderwaals
|
└── target_values
└── binary
This entry represents the interface between the two proteins contained in the .pdb
file, at the residue level. edge_features
and node_features
are specific for the graph-like representation of the PPI, while grid_points
and mapped_features
refer to the grid mapped from the graph. Each data point generated by DeepRank2 has the above structure, apart from the features and the target that are specified by the user.
It is always a good practice to first explore the data, and then make decision about splitting them in training, test and validation sets. For this purpose, users can either use HDFView, a visual tool written in Java for browsing and editing HDF5 files, or Python packages such as h5py. Few examples for the latter:
import h5py
with h5py.File("<hdf5_path.hdf5>", "r") as hdf5:
# List of all graphs in hdf5, each graph representing a ppi
ids = list(hdf5.keys())
# List of all node features
node_features = list(hdf5[ids[0]]["node_features"])
# List of all edge features
edge_features = list(hdf5[ids[0]]["edge_features"])
# List of all edge targets
targets = list(hdf5[ids[0]]["target_values"])
# BSA feature for ids[0], numpy.ndarray
node_feat_polarity = hdf5[ids[0]]["node_features"]["bsa"][:]
# Electrostatic feature for ids[0], numpy.ndarray
edge_feat_electrostatic = hdf5[ids[0]]["edge_features"]["electrostatic"][:]
Datasets
Data can be split in sets implementing custom splits according to the specific application. Assuming that the training, validation and testing ids have been chosen (keys of the HDF5 file/s), then the DeeprankDataset
objects can be defined.
GraphDataset
For training GNNs the user can create a GraphDataset
instance:
from deeprank2.dataset import GraphDataset
node_features = ["bsa", "res_depth", "hse", "info_content", "pssm"]
edge_features = ["distance"]
target = "binary"
# Creating GraphDataset objects
train_ids = [<ids>]
valid_ids = [<ids>]
test_ids = [<ids>]
dataset_train = GraphDataset(
hdf5_path = hdf5_paths,
subset = train_ids,
node_features = node_features,
edge_features = edge_features,
target = target
)
dataset_val = GraphDataset(
hdf5_path = hdf5_paths,
subset = valid_ids,
train_source = dataset_train
)
dataset_test = GraphDataset(
hdf5_path = hdf5_paths,
subset = test_ids,
train_source = dataset_train
)
Transforming Features
For the GraphDataset
class it is possible to define a dictionary to indicate which transformations to apply to the features, being the transformations lambda functions and/or standardization. If True
, standardization is applied after transformation, if the latter is present. Example:
import numpy as np
from deeprank2.dataset import GraphDataset
node_features = ["bsa", "res_depth", "hse", "info_content", "pssm"]
edge_features = ["distance"]
target = "binary"
features_transform = {
'bsa': {'transform': lambda t: np.log(t+1), 'standardize': True},
'electrostatic': {'transform': lambda t: np.sqrt(t), 'standardize': True},
'hse': {'transform': lambda t: np.log(t+1), 'standardize': False}
}
train_ids = [<ids>]
dataset_train = GraphDataset(
hdf5_path = hdf5_path,
subset = train_ids,
node_features = node_features,
edge_features = edge_features,
features_transform = features_transform,
target = target
)
An all
key can be set for indicating to apply the same standardize
and transform
to all the features present in the dataset. Example:
features_transform = {'all':
{'transform': lambda t: np.log(t+1), 'standardize': True}
}
train_ids = [<ids>]
dataset_train = GraphDataset(
hdf5_path = hdf5_path,
subset = train_ids,
node_features = node_features,
edge_features = edge_features,
features_transform = features_transform
target = target
)
If standardize
functionality is used, validation and testing sets need to know the interested features’ means and standard deviations in order to use the same values for standardizing validation and testing features. This can be done using train_source
parameter of the GraphDataset
class. Example:
features_transform = {'all':
{'transform': lambda t: np.log(t+1), 'standardize': True}
}
train_ids = [<ids>]
valid_ids = [<ids>]
test_ids = [<ids>]
# `train_source` defaults to `None`
dataset_train = GraphDataset(
hdf5_path = hdf5_path,
subset = train_ids,
node_features = node_features,
edge_features = edge_features,
features_transform = features_transform,
target = target
)
dataset_val = GraphDataset(
hdf5_path = hdf5_paths,
subset = valid_ids,
train_source = dataset_train # dataset_train means and stds will be used
)
dataset_test = GraphDataset(
hdf5_path = hdf5_paths,
subset = test_ids,
train_source = dataset_train # dataset_train means and stds will be used
)
GridDataset
For training CNNs the user can create a GridDataset
instance:
from deeprank2.dataset import GridDataset
features = ["bsa", "res_depth", "hse", "info_content", "pssm", "distance"]
target = "binary"
# Creating GridDataset objects
train_ids = [<ids>]
valid_ids = [<ids>]
test_ids = [<ids>]
dataset_train = GridDataset(
hdf5_path = hdf5_paths,
subset = train_ids,
features = features,
target = target
)
dataset_val = GridDataset(
hdf5_path = hdf5_paths,
subset = valid_ids,
train_source = dataset_train
)
dataset_test = GridDataset(
hdf5_path = hdf5_paths,
subset = test_ids,
train_source = dataset_train
)
Training
Let’s define a Trainer
instance, using for example of the already existing GINet
. Because GINet
is a GNN, it requires a dataset instance of type GraphDataset
.
from deeprank2.trainer import Trainer
from deeprank2.neuralnets.gnn.vanilla_gnn import VanillaNetwork
trainer = Trainer(
VanillaNetwork,
dataset_train,
dataset_val,
dataset_test
)
The same can be done using a CNN, for example CnnClassification
. Here a dataset instance of type GridDataset
is required.
from deeprank2.trainer import Trainer
from deeprank2.neuralnets.cnn.model3d import CnnClassification
trainer = Trainer(
CnnClassification,
dataset_train,
dataset_val,
dataset_test
)
By default, the Trainer
class creates the folder ./output
for storing predictions information collected later on during training and testing. HDF5OutputExporter
is the exporter used by default, but the user can specify any other implemented exporter or implement a custom one.
Optimizer (torch.optim.Adam
by default) and loss function can be defined by using dedicated functions:
import torch
trainer.configure_optimizers(torch.optim.Adamax, lr = 0.001, weight_decay = 1e-04)
Then the Trainer
can be trained and tested; the best model in terms of validation loss is saved by default, and the user can modify so or indicate where to save it using the train()
method parameter filename
.
trainer.train(
nepoch = 50,
batch_size = 64,
validate = True,
filename = "<my_folder/model.pth.tar>")
trainer.test()
Results Export and Visualization
The user can specify a DeepRank2 exporter or a custom one in output_exporters
parameter of the Trainer class, together with the path where to save the results. Exporters are used for storing predictions information collected later on during training and testing. Example:
from deeprank2.trainer import Trainer
from deeprank2.neuralnets.gnn.vanilla_gnn import VanillaNetwork
from deeprank2.utils.exporters import HDF5OutputExporter
trainer = Trainer(
VanillaNetwork,
dataset_train,
dataset_val,
dataset_test,
output_exporters = [HDF5OutputExporter("<output_folder_path>")]
)
By default, the Trainer
class creates the folder ./output
and uses HDF5OutputExporter
. In the latter case, results are saved in output_exporter.hdf5
both during the training (.train()
) and during the testing (.test()
) phases. output_exporter.hdf5
contains Groups which refer to each phase, e.g. training
and testing
if both are run, only one of them otherwise. Training phase includes validation results as well. The HDF5 file can then be read as a Pandas Dataframe:
import os
import pandas as pd
output_train = pd.read_hdf(os.path.join("<output_folder_path>", "output_exporter.hdf5"), key="training")
output_test = pd.read_hdf(os.path.join("<output_folder_path>", "output_exporter.hdf5"), key="testing")
The dataframes contain phase
, epoch
, entry
, output
, target
, and loss
columns, and can be easily used to visualize the results.
For classification tasks, the output
column contains a list of probabilities that each class occurs, and each list sums to 1 (for more details, please see documentation on the softmax function). Note that the order of the classes in the list depends on the classes
attribute of the DeeprankDataset instances. For classification tasks, if classes
is not specified (as in this example case), it is defaulted to [0, 1].
Example for plotting training loss curves using Plotly Express:
import plotly.express as px
fig = px.line(
output_train,
x='epoch',
y='loss',
color='phase',
markers=True)
fig.update_layout(
xaxis_title='Epoch #',
yaxis_title='Loss',
title='Loss vs epochs'
)
Run a Pre-trained Model on New Data
If you want to run a pre-trained model on new PDB files, the first step is to process and save them into HDF5 files. Let’s suppose that the model has been trained with ProteinProteinInterfaceQuery
queries mapped to graphs:
from deeprank2.query import QueryCollection, ProteinProteinInterfaceQuery
queries = QueryCollection()
# Append data points
queries.add(ProteinProteinInterfaceQuery(
pdb_path = "<new_pdb_file1.pdb>",
chain_id1 = "A",
chain_id2 = "B"
))
queries.add(ProteinProteinInterfaceQuery(
pdb_path = "<new_pdb_file2.pdb>",
chain_id1 = "A",
chain_id2 = "B"
))
hdf5_paths = queries.process(
"<output_folder>/<prefix_for_outputs>",
feature_modules = 'all')
Then, the GraphDataset instance for the newly processed data can be created. Do this by specifying the path for the pre-trained model in train_source
, together with the path to the HDF5 files just created. Note that there is no need of setting the dataset’s parameters, since they are inherited from the information saved in the pre-trained model.
from deeprank2.dataset import GraphDataset
dataset_test = GraphDataset(
hdf5_path = "<output_folder>/<prefix_for_outputs>",
train_source = "<pretrained_model_path>"
)
Finally, the Trainer
instance can be defined and the new data can be tested:
from deeprank2.trainer import Trainer
from deeprank2.neuralnets.gnn.vanilla_gnn import VanillaNetwork
from deeprank2.utils.exporters import HDF5OutputExporter
trainer = Trainer(
VanillaNetwork,
dataset_test = dataset_test,
pretrained_model = "<pretrained_model_path>",
output_exporters = [HDF5OutputExporter("<output_folder_path>")]
)
trainer.test()
The results can then be read in a Pandas Dataframe and visualized:
import os
import pandas as pd
output = pd.read_hdf(os.path.join("<output_folder_path>", "output_exporter.hdf5"), key="testing")
output.head()