Geometric Representation Learning on Molecular Graphs
Date
2024-07-28Type of Degree
PhD DissertationDepartment
Computer Science and Software Engineering
Restriction Status
EMBARGOEDRestriction Type
FullDate Available
07-28-2027Metadata
Show full item recordAbstract
Graphs as a type of data structure have recently attracted significant attention. Representation learning of geometric graphs has achieved great success in many fields including molecular, social, and financial networks. It is natural to present proteins as graphs in which nodes represent the residues and edges represent the pairwise interactions between residues. However, 3D protein structures have rarely been studied as graphs directly. The challenges include: 1) Proteins are complex macromolecules composed of thousands of atoms making them much harder to model than micro-molecules. 2) Capturing the long range pairwise relations for protein structure modeling remains under-explored. 3) Few studies have focused on learning the different attributes of proteins together. 4) Existing graph neural networks (GNNs) have limitations in capturing complex multi-level structural information and handling variable sizes of molecular structures. In this dissertation, we propose four geometric representation learning frameworks to address the above challenges under different scenarios. First, we introduce the Protein Graph-GNN (PG-GNN) architecture for protein backbone structure modeling, which utilizes geometric graph convolution blocks to generate distance geometric graph representations and can handle variable sizes of protein graphs dynamically. This gives a significant advantage because this network opens a new path from sequence to structure. Second, we develop the Attention-based Protein-drug Interaction Prediction (APIP) framework for interpretable protein-ligand interface prediction, which handles different input types separately and models long-range dependencies in protein sequences. Third, we present the Explainable framework for drug-target Interaction prediction (EIR), which incorporates both intrinsic and extrinsic information to enhance interpretability and accuracy in drug screening. Finally, we propose the Subgraph Aggregation Module Network (SAMNet), a provably geometric lossless encoding and rotation equivariant network for molecular representation learning, which captures complex geometry across spatial dimensions using a subgraph sampling policy and a drop-in geometric Subgraph Aggregation Module (SAM). We conducted extensive experiments on benchmark datasets and demonstrated the effectiveness of the proposed methods for in silico structural biology and rational drug discovery, and showcased their ability to address the limitations of existing GNNs in molecular representation learning. By developing these novel approaches, we contribute to advancing the field of life science and pave the way for more accurate, interpretable, and generalizable machine learning models in protein structure prediction, drug discovery, and molecular representation learning.