0:00
/
0:00
Transcript

Distributional Graphormer (DiG): Predicting equilibrium distributions for molecular systems with deep learning

Deep learning has transformed molecular structure prediction, with AlphaFold and deep learning-based docking methods achieving high accuracy in areas like protein design and drug discovery. However, these methods only predict the single state(static) structure, neglecting the protein structure flexibility. DiG (Distributional Graphormer), uses diffusion and efficiently predicts the equilibrium distribution of dynamic proteins, and generates diverse conformations and estimates state densities, orders of magnitude faster than traditional methods. Training can be achieved using experimental data or molecular dynamics simulations.


Application includes.

A.      Protein conformation

B.      ligand structure sampling

C.      Catalyst–adsorbate samplings

D.     Property-guided structure generation.

DiG Demonstrates Significant Efficiency Gains:

  • DiG achieves a remarkable 1000-fold speedup compared to Folding@Home on a 2.6-ms MD simulation of SARS-CoV-2 main protease conformation sampling.

  • DiG accomplishes this in 18 GPU days, whereas Folding@Home finished approximately 70 GPUs in 365 days.

  • Much cheaper compared to current sampling methods.

Data and Databases used in DiG:

  • PDB version used for training: downloaded on 25 December 2020

  • Template search used PDB70 database: downloaded on 13 May 2020

  • MSA lookup used Uniclust30 v.2018_08

  • Simulation trajectories: 238 from GPCRmd dataset

  • Protein–ligand docked complexes: CrossDocked2020 dataset v1.3

  • Programming languages and libraries: Python, PyTorch, Numpy, fairseq, torch-geometric, rdkit

  • MSA and PDB70 template searches: HHBlits and HHSearch from hh-suite

  • MD simulations: Gromacs

  • Energy function training: OpenMM, pdbfixer, amber14 force field

  • DFT calculations for carbon polymorphs dataset: VASP

Protein conformation sampling

DiG's performance was assessed against Conformational distributions from extensive MD simulations of SARS-CoV-2 proteins (RBD and main protease).

  • Proteins with experimentally determined multiple conformations.

  • DiG-generated structures closely resembled the diverse conformations observed in MD simulations and 70% of the RBD conformations sampled by simulations can be covered with just 10,000 DiG-generated structures.

  • It captured multiple functional states for various proteins, including adenylate kinase (rmsd < 1.0Å), LmrP(rmsd < 2.0Å), BRAF kinase, and D-ribose binding protein.

Ligand structure sampling around binding sites

  • DiG model trained on 1,500 complexes from MD simulations.

  • DiG evaluated on 409 protein-ligand systems not in training dataset.

  • Inputs: protein pocket information (atomic type and position), ligand descriptor (SMILES string).

  • Outputs: atomic coordinate distributions of both ligand and protein pocket.

  • Protein pocket flexibility reflected in up to 1.0 Å r.m.s.d. changes in atomic positions.

  • Conformationally, generated structures highly similar to crystal ligands (r.m.s.d. 1.74 Å).

  • Including binding pose deviations, generated structures within 2.0 Å r.m.s.d. of experimental data for nearly all 409 systems

Catalyst–adsorbate sampling

  • DiG trained on MD trajectories from the Open Catalyst.

  • Evaluated on random combinations of adsorbates and surfaces not in the training set.

  • DiG predicts adsorption sites and stable adsorbate configurations with probabilities.

  • Adsorption configurations of an acyl group on a stepped TiIr alloy surface predicted by DiG.

  • DiG finds all stable sites from a grid search using DFT methods.

  • Adsorption configurations close to DFT calculation results (RMSD 0.5–0.8 Å).

  • DiG predicts adsorption sites and probabilities for single N or O atoms on ten metallic surfaces.

  • Achieves 81% site coverage compared with DFT grid search results.

  • Predictions show excellent accordance with adsorption energies from DFT.

DiG is much faster than DFT (1 minute vs >2 hours for a single relaxation).


Paper:

Code and Model weights:

Presentation on DiG:


Thank you for spending your time on my blog! I would love to hear from you about any other topic, tool or tutorial and discussion to cover in my future posts.

Please do not hesitate to connect via LinkedIn

Thanks for reading Protein Design Studio! This post is public so feel free to share it.

Share

Discussion about this video

User's avatar

Ready for more?