This Is AuburnElectronic Theses and Dissertations

Geometric Representation Learning on Molecular Graphs

Date

2024-07-28

Author

Tian, Xia

Type of Degree

PhD Dissertation

Department

Computer Science and Software Engineering

Restriction Status

EMBARGOED

Restriction Type

Full

Date Available

07-28-2027

Abstract

Graphs as a type of data structure have recently attracted significant attention. Representation learning of geometric graphs has achieved great success in many fields including molecular, social, and financial networks. It is natural to present proteins as graphs in which nodes represent the residues and edges represent the pairwise interactions between residues. However, 3D protein structures have rarely been studied as graphs directly. The challenges include: 1) Proteins are complex macromolecules composed of thousands of atoms making them much harder to model than micro-molecules. 2) Capturing the long range pairwise relations for protein structure modeling remains under-explored. 3) Few studies have focused on learning the different attributes of proteins together. 4) Existing graph neural networks (GNNs) have limitations in capturing complex multi-level structural information and handling variable sizes of molecular structures. In this dissertation, we propose four geometric representation learning frameworks to address the above challenges under different scenarios. First, we introduce the Protein Graph-GNN (PG-GNN) architecture for protein backbone structure modeling, which utilizes geometric graph convolution blocks to generate distance geometric graph representations and can handle variable sizes of protein graphs dynamically. This gives a significant advantage because this network opens a new path from sequence to structure. Second, we develop the Attention-based Protein-drug Interaction Prediction (APIP) framework for interpretable protein-ligand interface prediction, which handles different input types separately and models long-range dependencies in protein sequences. Third, we present the Explainable framework for drug-target Interaction prediction (EIR), which incorporates both intrinsic and extrinsic information to enhance interpretability and accuracy in drug screening. Finally, we propose the Subgraph Aggregation Module Network (SAMNet), a provably geometric lossless encoding and rotation equivariant network for molecular representation learning, which captures complex geometry across spatial dimensions using a subgraph sampling policy and a drop-in geometric Subgraph Aggregation Module (SAM). We conducted extensive experiments on benchmark datasets and demonstrated the effectiveness of the proposed methods for in silico structural biology and rational drug discovery, and showcased their ability to address the limitations of existing GNNs in molecular representation learning. By developing these novel approaches, we contribute to advancing the field of life science and pave the way for more accurate, interpretable, and generalizable machine learning models in protein structure prediction, drug discovery, and molecular representation learning.